Troubleshooting 11gR2 Grid Infrastructure Node not Join the Cluster After Evicted error show disk and network HB failed
前段时间分析的一个问题,节点2驱逐后无法再加入集群,日志显示是网络通信问题,查看开始时驱逐的原因也是VD CRS-1615:No I/O has completed 和 Network communication missing, 同时DISK HB和Network HB同时失败,并且存储和private network是双链路,用的也不是同一交换机。什么会导致同时出问题呢?简单记录一下
# node2 GI alert log
2019-09-17 02:10:06.619: [cssd(20050)]CRS-1612:Network communication with node node1 (1) missing for 50% of timeout interval. Removal of this node from cluster in 14.870 seconds 2019-09-17 02:10:06.940: [cssd(20050)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file /dev/asm-diskj will be considered not functional in 7990 milliseconds 2019-09-17 02:10:06.940: [cssd(20050)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file /dev/asm-diskk will be considered not functional in 8110 milliseconds 2019-09-17 02:10:08.445: [cssd(20050)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file /dev/asm-diskj will be considered not functional in 6480 milliseconds 2019-09-17 02:10:08.445: [cssd(20050)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file /dev/asm-diskk will be considered not functional in 6600 milliseconds 2019-09-17 02:10:14.623: [cssd(20050)]CRS-1662:Member kill requested by node node1 for member number 1, group DBORCL 2019-09-17 02:11:41.644: [cssd(20050)]CRS-1611:Network communication with node node1 (1) missing for 75% of timeout interval. Removal of this node from cluster in 6.620 seconds 2019-09-17 02:11:45.645: [cssd(20050)]CRS-1610:Network communication with node node1 (1) missing for 90% of timeout interval. Removal of this node from cluster in 2.620 seconds 2019-09-17 02:11:48.268: [cssd(20050)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008:) in /u01/app/11.2.0/grid/log/node2/cssd/ocssd.log. 2019-09-17 02:11:48.268: [cssd(20050)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/11.2.0/grid/log/node2/cssd/ocssd.log 2019-09-17 02:11:48.315: 2019-09-17 02:13:57.761: [cssd(1248047)]CRS-1713:CSSD daemon is started in clustered mode 2019-09-17 02:14:13.489: [cssd(1248047)]CRS-1707:Lease acquisition for node node2 number 2 completed 2019-09-17 02:14:14.761: [cssd(1248047)]CRS-1605:CSSD voting file is online: /dev/asm-diski; details in /u01/app/11.2.0/grid/log/node2/cssd/ocssd.log. 2019-09-17 02:14:14.770: [cssd(1248047)]CRS-1605:CSSD voting file is online: /dev/asm-diskj; details in /u01/app/11.2.0/grid/log/node2/cssd/ocssd.log. 2019-09-17 02:14:14.779: [cssd(1248047)]CRS-1605:CSSD voting file is online: /dev/asm-diskk; details in /u01/app/11.2.0/grid/log/node2/cssd/ocssd.log. 2019-09-17 02:14:33.309: [cssd(1248047)]CRS-1612:Network communication with node node1 (1) missing for 50% of timeout interval. Removal of this node from cluster in 14.040 seconds 2019-09-17 02:14:40.310: [cssd(1248047)]CRS-1611:Network communication with node node1 (1) missing for 75% of timeout interval. Removal of this node from cluster in 7.040 seconds 2019-09-17 02:14:45.312: [cssd(1248047)]CRS-1610:Network communication with node node1 (1) missing for 90% of timeout interval. Removal of this node from cluster in 2.040 seconds 2019-09-17 02:14:47.355: [cssd(1248047)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008:) in /u01/app/11.2.0/grid/log/node2/cssd/ocssd.log. 2019-09-17 02:14:47.355: [cssd(1248047)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/11.2.0/grid/log/node2/cssd/ocssd.log 2019-09-17 02:14:47.387: [cssd(1248047)]CRS-1603:CSSD on node node2 shutdown by user. 2019-09-17 02:14:47.589: [cssd(1248047)]CRS-1660:The CSS daemon shutdown has completed
诊断方法
# on node1
ping node2-priv-ip
traceroute node2-priv-ip
# on node2
ping node1-priv-ip
traceroute node1-priv-ip
— 测试网络也是没有问题的。
cat /etc/*release*|head -n 3
— RHEL 6.6
又看到了这个多病的操作系统版本,想起了以前的老问题 https://www.anbob.com/archives/2851.html
ping 11.11.11.11 -s 8192
— 果然没反映,即使没有使用大帧也应该拆包发送呀
netstat -s |grep reass
-sleep 5
netstat -s |grep reass
— 值在增加,原因就可能在这里了
echo 16777216 > /proc/sys/net/ipv4/ipfrag_low_thresh
echo 15728640 > /proc/sys/net/ipv4/ipfrag_high_thresh
echo 60 > /proc/sys/net/ipv4/ipfrag_time
一切恢复正常
如果你解决不了, 可以联系www.anbob.com 首页的联系方式。
对不起,这篇文章暂时关闭评论。