Troubleshooting Oracle RAC a node Fails to Join the Cluster with “no network HB”
近日1客户环境的oracle 12cR2 6-nodes RAC多个节点脑裂后无法启动加回cluster, 分析日志又是经典的“has a disk HB, but no network HB“, 最近安全加固需求颇多,当心过度封锁影响到了RAC 间的interconnect 通信。 这里简单记录一下case现象的分析方法。
分析方法:
1, 检查crs状态和资源情况 All nodes
crsctl check crs crsctl stat res -t crsctl stat res -t -init
2, 检查问题节点软件环境
cluvfy stage -post crsinst -n hract21,hract22
3, 检查日志
GI alert log
ocssd.log
crs.log
ASM alert.log
DB alert.log
…
4, 如果CSSD启动失败,可以开启ocssd debug 日志
# $GRID_HOME/bin/crsctl set log css CSSD:3
Set CSSD Module: CSSD Log Level: 3
# $GRID_HOME/bin/crsctl get log css CSSD
Get CSSD Module: CSSD Log Level: 3
5, 检查CSS timeout values
# $GRID_HOME/bin/crsctl get css misscount
CRS-4678: Successful get misscount 30 for Cluster Synchronization Services.
6, ocssd log
$ cat $GRID_HOME/log/grac2/cssd/ocssd.log | egrep -i 'Removal|evict|30000|network HB|splitbrain|aborting'
$ cat $GRID_HOME/log/grac2/cssd/ocssd.log | egrep -i 'fail|error|exception|fatal'
alert.log:
2015-02-17 09:42:27.823 [OCSSD(15855)]CRS-1656: The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/grid/diag/crs/hract21/crs/trace/ocssd.trc
2015-02-17 09:42:27.824 [OCSSD(15855)]CRS-1603: CSSD on node hract21 shutdown by user.
2015-02-17 09:42:27.823 [CSSDAGENT(15844)]CRS-5818: Aborted command 'start' for resource 'ora.cssd'. Details at (:CRSAGF00113:) {0:0:2} in /u01/app/grid/diag/crs/hract21/crs/trace/ohasd_cssdagent_root.trc.
Tue Feb 17 09:42:32 2015
Errors in file /u01/app/grid/diag/crs/hract21/crs/trace/ocssd.trc (incident=2977):
CRS-8503 [] [] [] [] [] [] [] [] [] [] [] []
Incident details in: /u01/app/grid/diag/crs/hract21/crs/incident/incdir_2977/ocssd_i2977.trc
2015-02-17 09:42:33.019 [OCSSD(15855)]CRS-8503: Oracle Clusterware OCSSD process with operating system process ID
15855 experienced fatal signal or exception code 6
Sweep [inc][2977]: completed
2015-02-17 09:42:38.005 [OHASD(11954)]CRS-2757: Command 'Start' timed out waiting for response from the resource 'ora.cssd'. Details at (:CRSPE00163:) {0:0:2} in /u01/app/grid/diag/crs/hract21/crs/trace/ohasd.trc.
ocssd.trc:
2015-02-17 09:42:32.451021 : CSSD:2417551104:
clssnmvDHBValidateNCopy: node 2, hract22, has a disk HB, but no network HB, DHB has rcfg 319544228, wrtcnt, 963949, LATS 92477974, lastSeqNo 963946, uniqueness 1424074596, timestamp 1424162551/21220694
2015-02-17 09:42:32.451113 : CSSD:2422281984:
clssnmvDHBValidateNCopy: node 2, hract22, has a disk HB, but no network HB, DHB has rcfg 319544228, wrtcnt, 963950, LATS 92477974, lastSeqNo 963947, uniqueness 1424074596, timestamp 1424162552/21220904
Trace file /u01/app/grid/diag/crs/hract21/crs/trace/ocssd.trc
Oracle Database 12c Clusterware Release 12.1.0.2.0 - Production Copyright 1996, 2014 Oracle. All rights reserved.
DDE: Flood control is not active
CLSB:2467473152: Oracle Clusterware infrastructure error in OCSSD (OS PID 15855): Fatal signal 6 has occurred in program ocssd thread 2467473152; nested signal count is 1
Incident 2977 created, dump file: /u01/app/grid/diag/crs/hract21/crs/incident/incdir_2977/ocssd_i2977.trc
CRS-8503 [] [] [] [] [] [] [] [] [] [] [] []
2015-02-17 09:42:33.108629 : CSSD:2450904832: clssscWaitOnEventValue: after CmInfo State val 3, eval 1 waited 1000 with cvtimewait status 4294967186
2015-02-17 09:42:33.451785 : CSSD:2417551104: clssnmvDHBValidateNCopy: node 2, hract22, has a disk HB, but no network HB, DHB has rcfg 319544228, wrtcnt, 963952, LATS 92478974, lastSeqNo 963949, uniqueness 1424074596, timestamp 1424162552/21221694
2015-02-17 09:42:33.451933 : CSSD:2422281984: clssnmvDHBValidateNCopy: node 2, hract22, has a disk HB, but no network HB, DHB has rcfg 319544228, wrtcnt, 963953, LATS 92478974, lastSeqNo 963950, uniqueness 1424074596, timestamp 1424162553/21221904
--> Here we know that we have a networking problem
Note:
has a disk HB, but no network HB和OCSSD (OS PID 15855): Fatal signal 6关键字 多是网络不通, (因为现场故障只能截图,所以找了个相同案例日志输出)
7, 检查 gpnp profile.xml
$GRID_HOME/bin/gpnptool get 2>/dev/null | xmllint --format - | egrep 'CSS-Profile|ASM-Profile|Network id'
8, 检查 network
ping traceroute ifconfig xxx ip addr
9, 多播测试
oracle官方提供了一个多播测试脚本mcasttest.pl, 确认OS 和网络设备启用了多播。
$ ./mcasttest.pl -n db01,db02 -i ib0,ib3
10, 检查是否有网络重组包问题
检查OS层是否存在网络包重组问题,之前的案例中多次遇到,Troubleshooting 11gR2 Grid Infrastructure Node not Join the Cluster After Evicted error show disk and network HB failed
— ON ALL NODE
LINUX 6 netstat -s |grep reass -sleep 5sec netstat -s |grep reass LINUX 7 nstat -az|grep IpReasmFails -sleep 5sec nstat -az|grep IpReasmFails
11, 检查防火墙
检查是否有软、硬件防火墙调整网络策略,如iptables firewall随OS 启动的服务。
12, 检查 ASM disk
$GRID_HOME/bin/kfod disks=asm st=true ds=true cluster=true
13, 检查OS message
是否有硬件损坏,如link down/up现象。
14, network sock错误
如/tmp 目录下network sock文件缺失或权限错误。
这个故障检查ASM磁盘正常, sqlnet.ora并未配置白名单(可能影响Flex ASM listener通信), OS message日志无硬件错误后, 可以尝试重启crs stack, 如果还是失败,使用traceroute 测试到幸存节点的private IP发现并不通, 结合 no network HB和OCSSD (OS PID 15855): Fatal signal 6关键字初步判断是以下可能:
OS layer: iptables firewall
Network Layer: 网络防火墙等访问策略
在询问网络工程师确认刚做过网络策略调整, Disabled firewall especially on the private interconnect. 禁用private interconnect中的所有网络限制后,恢复正常。
对不起,这篇文章暂时关闭评论。