首页 » ORACLE 9i-23ai » Troubleshooting: 实例被驱逐和”IP packet reassembles failed“增长 on Linux 6.6

Troubleshooting: 实例被驱逐和”IP packet reassembles failed“增长 on Linux 6.6

前几日和好友赵彬同学交流在ANBOB公众号投稿分享了一个案例,数据库第二个节点被驱逐出集群,并且多次自动重启以失败告终,驱逐原因在GI Alert log显示是私网通信丢失,而重启失败也是因为ASM无法启动,ASM db alert显示IPC Send timeout.  当时ping和tracert并没发现什么异常,从OSW中的netstat -s查看IP packet reassembles failed时间段值大量增长, 这是一套oracle 11.2.0.4 2Nodes RAC on RHEL 6.6的环境,另个每个数据库服务器CPU核数过百, 这个案例有多种巧合,如果CPU只有几个,问题也可能不会发生,如果是RHEL7也不会发生等等。当然最终以调速OS参数解决,但是原因更值得分析。

Node1 GI ALERT LOG

2017-03-02 11:46:26.607: 
[/u01/11.2.0/grid/bin/oraagent.bin(121496)]CRS-5011:Check of resource "testdb" failed: details at "(:CLSN00007:)" in "/u01/11.2.0/grid/log/anbob/agent/crsd/oraagent_oracle//oraagent_oracle.log"
2017-03-02 11:46:26.612: 
[crsd(175117)]CRS-2765:Resource 'ora.testdb.db' has failed on server 'anbob'.
2017-03-02 11:46:42.866: 
[cssd(172139)]CRS-1612:Network communication with node db2 (2) missing for 50% of timeout interval.  Removal of this node from cluster in 14.620 seconds
2017-03-02 11:46:50.869: 
[cssd(172139)]CRS-1611:Network communication with node db2 (2) missing for 75% of timeout interval.  Removal of this node from cluster in 6.620 seconds
2017-03-02 11:46:54.870: 
[cssd(172139)]CRS-1610:Network communication with node db2 (2) missing for 90% of timeout interval.  Removal of this node from cluster in 2.620 seconds
2017-03-02 11:46:57.493: 
[cssd(172139)]CRS-1607:Node db2 is being evicted in cluster incarnation 351512591; details at (:CSSNM00007:) in /u01/11.2.0/grid/log/anbob/cssd/ocssd.log.
2017-03-02 11:46:58.626: 
[cssd(172139)]CRS-1662:Member kill requested by node db2 for member number 0, group DBHBCRM

Node2 GI ALERT LOG

2017-03-02 11:46:45.378: 
[cssd(177450)]CRS-1663:Member kill issued by PID 84816 for 1 members, group DBCRM. Details at (:CSSGM00044:) in /u01/11.2.0/grid/log/db2/cssd/ocssd.log.
2017-03-02 11:46:58.982: 
[cssd(177450)]CRS-1608:This node was evicted by node 1, anbob; details at (:CSSNM00005:) in /u01/11.2.0/grid/log/db2/cssd/ocssd.log.
2017-03-02 11:46:58.983: 
[cssd(177450)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/11.2.0/grid/log/db2/cssd/ocssd.log  

Note:
node2 因网络通信失败而自杀 member kill.

* allocate domain 7, invalid = TRUE 
 * domain 7 valid = 1 according to instance 1 
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 Set master node info 
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
Thu Mar 02 11:54:14 2017
IPC Send timeout detected. Sender: ospid 153935 [oracle@db2 (LMD0)]
Receiver: inst 1 binc 430132771 ospid 174276
IPC Send timeout to 1.0 inc 92 for msg type 65521 from opid 10
Thu Mar 02 11:54:16 2017
Communications reconfiguration: instance_number 1
Thu Mar 02 11:54:16 2017
Dumping diagnostic data in directory=[cdmp_20170302115416], requested by (instance=1, osid=174274 (LMON)), summary=[abnormal instance termination].
Reconfiguration started (old inc 92, new inc 96)
List of instances:
 2 (myinst: 2) 
 Nested reconfiguration detected. 
 Global Resource Directory frozen
* dead instance detected - domain 1 invalid = TRUE 
freeing rdom 1
* dead instance detected - domain 2 invalid = TRUE 
freeing rdom 2

Note:
NODE2 的ASM因ipc time out失败一直未启动。

Node2 OSW netsat信息

zzz ***Thu Mar 2 11:40:31 CST 2017
75646 packet reassembles failed
zzz ***Thu Mar 2 11:40:52 CST 2017
77043 packet reassembles failed
zzz ***Thu Mar 2 11:46:33 CST 2017
131134 packet reassembles failed

Note:
这段时间产生了大量的IP packet reassembles failed. 汇总统计了之前的时间里是不存在这个问题的。

-- FOR LINUX 6
netstat -s |grep reassembles
-sleep 5sec
netstat -s |grep reassembles

-- FOR LINUX 7 
nstat -az|grep IpReasmFails
-sleep 5sec
nstat -az|grep IpReasmFails

当然到这里如果你以以上信息为关键字在MOS中应该已经可以找到解决方案RHEL 6.6: IPC Send timeout/node eviction etc with high packet reassembles failure (Doc ID 2008933.1)。在NOTE中记录是因为OS过多的IP包重组失败导致,解决方法是加大重组的BUFFER大小或启用jumbo frame,对于jumbo frame需要硬件的支持,简单的就是增加network IP 重组buffer的大小。如RHEL 6.6中的

$ vi /etc/sysctl.conf
# append 
net.ipv4.ipfrag_high_thresh = 16777216    --default 4M
net.ipv4.ipfrag_low_thresh = 15728640       -- default 3M
net.ipv4.ipfrag_time = 120   --default 30

立即生效
$ sysctl -p

同样根据RHEL 官方建议,同样可以调整以下两个参数
net.core.netdev_max_backlog = 2000
net.core.netdev_budget = 600

通常默认的应该满足基本需求的,像上面的调整在ORACLE的RHEL的最佳实践中也并未建议安装里调整,那后续的LINUX平台都要调整该参数么? 带着这个问题我查开了LINUX 7.2的内核参数

[root@MiWiFi-srv ~]# sysctl -a|grep ipfrag
net.ipv4.ipfrag_high_thresh = 4194304
net.ipv4.ipfrag_low_thresh = 3145728
net.ipv4.ipfrag_max_dist = 64
net.ipv4.ipfrag_secret_interval = 600
net.ipv4.ipfrag_time = 30

如果是参数配置不是最佳,为什么LINUX最新版并未改变,连ORACLE软件也没建议修改? 那我们可以再深入研究一些。

什么是IP packet reassembles failed?

在linux平台使用netstat -s命令时会看到packet reassembles failed项, 记录的是IP重组包失败的累计数据值,什么时候要重组呢?当IP包通信存在碎片时。在网络通信协议中MTU(最大传输单元)限制了每次传输的IP 包的大小,一种是源端和目标端使用了不同的MTU时会交生碎片,这里需要先确认传输过程中的MTU大小配置,确认使用了相同的MTU;还有就是当传送的数据大于MTU时,回分成多个分片传递。这时调整MTU就不可能解决所有的IP包碎片的问题,可以通过加大通信的buffer值,尽可能保留更多的数据在源端拆包,目标端缓存等接收完整后再重组校验。 在LINUX系统中调整BUFFER使用ipfrag_low_thresh 和ipfrag_high_thresh参数,如果调整了这个参数仍有较大的重组失败还可以加大ipfrag_time 参数控制IP 碎片包在内存中保留的秒数时间。

如果在ORACLE RAC环境中一个节点突然产生了大量的数据包输送给另一个节点,如应用设计问题,如数据文件cache fusion或归档只能一个节点访问时,都加大了网络通信量,这里需要检查网络负载及丢包或包不一致的现象,因为ORACLE在网络通信中使用了UDP和IP通信协议,这两类信息都需要关注。

那是不是LINUX所有的系统都需要调整上面的参数呢?
不是的, 从RHEL了解这个问题基本也就出现在LINUX6.6,和部分6.7中,这个问题的根本原因是在linux6.6中引入了percpu counters计数器在内核 kernel-2.6.32-477.el6, 在后面的linux 6.8(kernel-2.6.32-642.el6)和linux 6.7.x(kernel-2.6.32-573.8.1.el6)已经修复了该问题, 所以在其它配置中没有影响。
这个计数器是在每个CPU上一个,当CPU多时很容易超过默认的4M ipfrag_high_thresh, 所以这也是就是我一开始说的CPU多了也是这个问题的巧合,当网络传输大里更容易超过,而产生faild.

下面是RHEL的官方解释:

A bug has been added to address this for RHEL6.6, where IP fragmentation memory accounting fails under some conditions. The new percpu counters are implemented to increase performance. There is one pcounter per CPU. Therefore when the routines that increment these counters run there is no need for locking. However this comes at the expense of accuracy. There is a global 64 bit count variable in the percpu_counter structure. This is only updated under lock when the lockless percpu counter is > 130K or < -130K. Therefore it is possible on machines with many CPUs that this global value can exceed the default 4MB ipfrag_high_thresh threshold. When this happens it will trigger false evictions and further compromise the accuracy of the global count variable, which is used by the IP stack to determine whether to evict incoming fragments or not. This will cause subsequent IP fragmentation to fail. It should be noted that actual overall value of the counter is correct when the global count variable is sync'd with the per cpu counters. This is apparent when checking the frag memory via cat /proc/net/sockstat. This invokes the sum_frag_mem_limit() routine which will tally up all the per cpu counters and return the real fragmentation memory that is being consumed. As this is not a real memory leak it can be mitigated by at least doubling the IP fragmentation thresholds as per the resolution section and no further action is required. Engineering are also looking at using the more reliable sum_frag_mem_limit() routine to calculate the real fragmentation memory when the threshold is exceeded as a precaution. This will prevent IP fragmentation breaking if the threshold is hit. However simply doubling the IP fragmentation thresholds should prevent the threshold from being exceeded in the 1st place, even on large CPU configurations.

Summary:
当数据库的某种场景触发了大量的数据在节点间传送, 刚好又是LINUX 6.6内核的一个bug, 刚好这又是一台CPU非常多的服务器,触发了这次的IPC TIMEOUT和IP包重组失败, 导致的私网数据包丢失,节点2被驱逐。解决方法可以通常调整操作系统参数,加大IP包buffer大小,尽可能的减少IP packet reassembles failed。无系统不网络,DBA也应该有所了解,希望对你有帮助。

Tip: 如果在未来的Linux 7,8中出现同类问题也可以修改以上内核参数
--over —

打赏

, , ,

对不起,这篇文章暂时关闭评论。