首页 » ORACLE 9i-23ai » Troubleshooting 11gR2 Grid Infrastructure Node not Join the Cluster After Evicted error show disk and network HB failed

Troubleshooting 11gR2 Grid Infrastructure Node not Join the Cluster After Evicted error show disk and network HB failed

前段时间分析的一个问题,节点2驱逐后无法再加入集群,日志显示是网络通信问题,查看开始时驱逐的原因也是VD CRS-1615:No I/O has completed 和 Network communication missing, 同时DISK HB和Network HB同时失败,并且存储和private network是双链路,用的也不是同一交换机。什么会导致同时出问题呢?简单记录一下

# node2 GI alert log

2019-09-17 02:10:06.619: 
[cssd(20050)]CRS-1612:Network communication with node node1 (1) missing for 50% of timeout interval.  Removal of this node from cluster in 14.870 seconds
2019-09-17 02:10:06.940: 
[cssd(20050)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file /dev/asm-diskj will be considered not functional in 7990 milliseconds
2019-09-17 02:10:06.940: 
[cssd(20050)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file /dev/asm-diskk will be considered not functional in 8110 milliseconds
2019-09-17 02:10:08.445: 
[cssd(20050)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file /dev/asm-diskj will be considered not functional in 6480 milliseconds
2019-09-17 02:10:08.445: 
[cssd(20050)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file /dev/asm-diskk will be considered not functional in 6600 milliseconds
2019-09-17 02:10:14.623: 
[cssd(20050)]CRS-1662:Member kill requested by node node1 for member number 1, group DBORCL
2019-09-17 02:11:41.644: 
[cssd(20050)]CRS-1611:Network communication with node node1 (1) missing for 75% of timeout interval.  Removal of this node from cluster in 6.620 seconds
2019-09-17 02:11:45.645: 
[cssd(20050)]CRS-1610:Network communication with node node1 (1) missing for 90% of timeout interval.  Removal of this node from cluster in 2.620 seconds
2019-09-17 02:11:48.268: 
[cssd(20050)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008:) in /u01/app/11.2.0/grid/log/node2/cssd/ocssd.log.
2019-09-17 02:11:48.268: 
[cssd(20050)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/11.2.0/grid/log/node2/cssd/ocssd.log
2019-09-17 02:11:48.315: 

2019-09-17 02:13:57.761: 
[cssd(1248047)]CRS-1713:CSSD daemon is started in clustered mode
2019-09-17 02:14:13.489: 
[cssd(1248047)]CRS-1707:Lease acquisition for node node2 number 2 completed
2019-09-17 02:14:14.761: 
[cssd(1248047)]CRS-1605:CSSD voting file is online: /dev/asm-diski; details in /u01/app/11.2.0/grid/log/node2/cssd/ocssd.log.
2019-09-17 02:14:14.770: 
[cssd(1248047)]CRS-1605:CSSD voting file is online: /dev/asm-diskj; details in /u01/app/11.2.0/grid/log/node2/cssd/ocssd.log.
2019-09-17 02:14:14.779: 
[cssd(1248047)]CRS-1605:CSSD voting file is online: /dev/asm-diskk; details in /u01/app/11.2.0/grid/log/node2/cssd/ocssd.log.
2019-09-17 02:14:33.309: 
[cssd(1248047)]CRS-1612:Network communication with node node1 (1) missing for 50% of timeout interval.  Removal of this node from cluster in 14.040 seconds
2019-09-17 02:14:40.310: 
[cssd(1248047)]CRS-1611:Network communication with node node1 (1) missing for 75% of timeout interval.  Removal of this node from cluster in 7.040 seconds
2019-09-17 02:14:45.312: 
[cssd(1248047)]CRS-1610:Network communication with node node1 (1) missing for 90% of timeout interval.  Removal of this node from cluster in 2.040 seconds
2019-09-17 02:14:47.355: 
[cssd(1248047)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008:) in /u01/app/11.2.0/grid/log/node2/cssd/ocssd.log.
2019-09-17 02:14:47.355: 
[cssd(1248047)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/11.2.0/grid/log/node2/cssd/ocssd.log
2019-09-17 02:14:47.387: 
[cssd(1248047)]CRS-1603:CSSD on node node2 shutdown by user.
2019-09-17 02:14:47.589: 
[cssd(1248047)]CRS-1660:The CSS daemon shutdown has completed

诊断方法
# on node1
ping node2-priv-ip
traceroute node2-priv-ip

# on node2
ping node1-priv-ip
traceroute node1-priv-ip

— 测试网络也是没有问题的。

cat /etc/*release*|head -n 3
— RHEL 6.6

又看到了这个多病的操作系统版本,想起了以前的老问题 https://www.anbob.com/archives/2851.html

ping 11.11.11.11 -s 8192
— 果然没反映,即使没有使用大帧也应该拆包发送呀

netstat -s |grep reass
-sleep 5
netstat -s |grep reass

— 值在增加,原因就可能在这里了

echo 16777216 > /proc/sys/net/ipv4/ipfrag_low_thresh
echo 15728640 > /proc/sys/net/ipv4/ipfrag_high_thresh
echo 60 > /proc/sys/net/ipv4/ipfrag_time

一切恢复正常

如果你解决不了, 可以联系www.anbob.com 首页的联系方式。

打赏

, ,

对不起,这篇文章暂时关闭评论。