首页 » ORACLE 9i-23ai » Crsd start fail and crsd.log show “Policy Engine is not initialized yet”& evmd.log show “[gipcretConnectionRefused] [29]”

Crsd start fail and crsd.log show “Policy Engine is not initialized yet”& evmd.log show “[gipcretConnectionRefused] [29]”

最近一套数据库的2节点半夜突然crash,被1节点驱逐, AGENT 启动DB失败,手动重启CRS启动失败,后来发现日志中的现象与MOS中多篇bug很像但又不是,节点2CRS启动失败,AIX环境,crs日志显示”Policy Engine is not initialized yet”,evmd.log 显示”[gipcretConnectionRefused] [29]”  , 这里记录一下,希望以后遇到相同问题时可以提供参考。

版本信息

root@dbnode2[/]#oslevel -s
6100-09-07-1614

root@dbnode2[/]#crsctl query crs releaseversion
Oracle High Availability Services release version on the local node is [11.2.0.3.0]

Node2 db alert log

2016-11-07 02:52:55.499000 +08:00
Thread 2 advanced to log sequence 5029 (LGWR switch)
  Current log# 7 seq# 5029 mem# 0: /dev/rzwc_redo07
2016-11-07 02:53:06.801000 +08:00
Archived Log entry 8513 added for thread 2 sequence 5028 ID 0xffffffffde6802cb dest 1:
2016-11-07 03:05:35.906000 +08:00
Restarting dead background process DIA0
DIA0 started with pid=984, OS id=22675762
2016-11-07 03:08:54.383000 +08:00
IPC Send timeout detected. Sender: ospid 2229640 [oracle@dbnode2 (PING)]
Receiver: inst 1 binc 185672221 ospid 3867476
2016-11-07 03:10:35.528000 +08:00
IPC Send timeout detected. Sender: ospid 3344084 [oracle@dbnode2 (LCK0)]
Receiver: inst 1 binc 2 ospid 7078268
2016-11-07 03:10:44.431000 +08:00
IPC Send timeout detected. Sender: ospid 16451218 [oracle@dbnode2]
Receiver: inst 1 binc 185672259 ospid 7733904
2016-11-07 03:10:46.432000 +08:00
IPC Send timeout to 1.2 inc 4 for msg type 65521 from opid 1778
IPC Send timeout: Terminating pid 1778 osid 16451218
2016-11-07 03:10:51.767000 +08:00
IPC Send timeout detected. Sender: ospid 2163290 [oracle@dbnode2 (LMS3)]
Receiver: inst 1 binc 185672267 ospid 8651308
2016-11-07 03:10:53.799000 +08:00
IPC Send timeout to 1.4 inc 4 for msg type 65518 from opid 16
Communications reconfiguration: instance_number 1
2016-11-07 03:11:42.460000 +08:00
Detected an inconsistent instance membership by instance 2
Received an instance abort message from instance 1
Please check instance 1 alert and LMON trace files for detail.
LMS0 (ospid: 2098886): terminating the instance due to error 481
System state dump requested by (instance=2, osid=2098886 (LMS0)), summary=[abnormal instance termination].
System State dumped to trace file /oracle/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_diag_3474386.trc

Note:
大致在3:06分左右出现了异常,随后几个GES/GCS进程异常IPC通信超时实例crash.

crs alert log

2016-11-07 01:21:42.171
[ctssd(3147414)]CRS-2409:The clock on host dbnode2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
2016-11-07 02:03:52.583
[ctssd(3147414)]CRS-2409:The clock on host dbnode2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
2016-11-07 02:39:54.942
[ctssd(3147414)]CRS-2409:The clock on host dbnode2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
2016-11-07 03:08:45.295
[/oracle/app/11.2.0.3/grid/bin/orarootagent.bin(1836630)]CRS-5818:Aborted command 'check' for resource 'ora.crsd'. Details at (:CRSAGF00113:) {0:0:2} in /oracle/app/11.2.0.3/grid/log/dbnode2/agent/ohasd/orarootagent_root/orarootagent_
root.log.
2016-11-07 03:11:43.225
[/oracle/app/11.2.0.3/grid/bin/oraagent.bin(1442382)]CRS-5011:Check of resource "anbob" failed: details at "(:CLSN00007:)" in "/oracle/app/11.2.0.3/grid/log/dbnode2/agent/crsd/oraagent_oracle/oraagent_oracle.log"
2016-11-07 03:17:07.826
[cssd(1574190)]CRS-1662:Member kill requested by node dbnode1 for member number 1, group DBanbob
2016-11-07 03:18:57.354
[ctssd(3147414)]CRS-2409:The clock on host dbnode2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
2016-11-07 03:24:45.377
[/oracle/app/11.2.0.3/grid/bin/orarootagent.bin(1836630)]CRS-5818:Aborted command 'check' for resource 'ora.crsd'. Details at (:CRSAGF00113:) {0:0:2} in /oracle/app/11.2.0.3/grid/log/dbnode2/agent/ohasd/orarootagent_root/orarootagent_
root.log.
2016-11-07 03:39:45.442
[ohasd(2622286)]CRS-2765:Resource 'ora.crsd' has failed on server 'dbnode2'.

CRS log 在重启前一直循环打印

2016-11-07 03:39:52.606: [GIPCHALO][2314] gipchaLowerProcessNode: no valid interfaces found to node for 1625604158 ms, node 111749010 { host 'dbnode1', haName '74a2-a872-a861-8b63', srcLuid 2e76d732-59d936ff, dstLuid 00000000-00000000
 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [0 : 0], createTime 1625604158, sentRegister 0, localMonitor 0, flags 0x4 }
2016-11-07 03:39:52.606: [GIPCHGEN][2571] gipchaNodeAddInterface: adding interface information for inf 11174d390 { host 'dbnode1', haName '74a2-a872-a861-8b63', local 0, ip '192.168.60.15:51892', subnet '192.168.60.0', mask '255.255.2
55.0', mac '', ifname '', numRef 0, numFail 0, idxBoot 0, flags 0x2 }
2016-11-07 03:39:53.606: [GIPCHTHR][2314] gipchaWorkerCreateInterface: created remote interface for node 'dbnode1', haName '74a2-a872-a861-8b63', inf 'udp://192.168.60.15:51892'
2016-11-07 03:39:53.606: [GIPCHGEN][2314] gipchaWorkerAttachInterface: Interface attached inf 11174d390 { host 'dbnode1', haName '74a2-a872-a861-8b63', local 1117633f0, ip '192.168.60.15:51892', subnet '192.168.60.0', mask '255.255.25
5.0', mac '', ifname '', numRef 0, numFail 0, idxBoot 0, flags 0x6 }
2016-11-07 03:40:05.609: [GIPCHALO][2314] gipchaLowerProcessAcks: ESTABLISH finished for node 111749010 { host 'dbnode1', haName '74a2-a872-a861-8b63', srcLuid 2e76d732-59d936ff, dstLuid 1ef7c1b6-3573a1f5 numInf 1, contigSeq 7, lastAc
k 0, lastValidAck 0, sendSeq [13 : 13], createTime 1625604158, sentRegister 1, localMonitor 0, flags 0x20c }
2016-11-07 03:40:05.609: [GIPCHALO][2314] gipchaLowerProcessWaitQ: triggering deffered startup of msg 110ef55f8 { len 232, seq 0, type gipchaHdrTypeSend (1), lastSeq 0, lastAck 0, minAck 0, flags 0x0, srcLuid 00000000-00000000, dstLu
id 00000000-00000000, msgId 0 }, node 111749010 { host 'dbnode1', haName '74a2-a872-a861-8b63', srcLuid 2e76d732-59d936ff, dstLuid 1ef7c1b6-3573a1f5 numInf 1, contigSeq 7, lastAck 0, lastValidAck 0, sendSeq [14 : 14], createTime 16256
04158, sentRegister 1, localMonitor 0, flags 0x20c }
2016-11-07 03:40:07.611: [GIPCXCPT][3342] gipchaInternalResolve: failed to resolve ret gipcretKeyNotFound (36), host 'dbnode2', port '6742-d1a0-946e-b24b', hctx 110eceb10 [0000000000000010] { gipchaContext : host 'dbnode2', name '87fc-
6da2-72ca-d7c0', luid '2e76d732-00000000', numNode 1, numInf 1, usrFlags 0x0, flags 0x5 }, ret gipcretKeyNotFound (36)
2016-11-07 03:40:07.611: [GIPCHGEN][3342] gipchaResolveF [gipcmodGipcResolve : gipcmodGipc.c : 804]: EXCEPTION[ ret gipcretKeyNotFound (36) ]  failed to resolve ctx 110eceb10 [0000000000000010] { gipchaContext : host 'dbnode2', name '
87fc-6da2-72ca-d7c0', luid '2e76d732-00000000', numNode 1, numInf 1, usrFlags 0x0, flags 0x5 }, host 'dbnode2', port '6742-d1a0-946e-b24b', flags 0x0
2016-11-07 03:40:10.613: [  OCRMAS][3342]proath_master: SUCCESSFULLY CONNECTED TO THE MASTER
2016-11-07 03:40:10.613: [  OCRMAS][3342]th_master: NEW OCR MASTER IS 1
2016-11-07 03:40:10.613: [  OCRSRV][1]th_snap_local_spawn: Inside snap local spawn. host is [dbnode2]
2016-11-07 03:40:10.615: [ CRSMAIN][1] Running mode check...
2016-11-07 03:40:10.615: [ CRSMAIN][1] Running as user: root
2016-11-07 03:40:10.615: [ CRSMAIN][1] CRSD running as the Privileged user

... 
2016-11-07 08:26:18.902: [ CRSMAIN][515] Policy Engine is not initialized yet!
2016-11-07 08:26:48.903: [ CRSMAIN][515] Policy Engine is not initialized yet!
2016-11-07 08:27:18.902: [ CRSMAIN][515] Policy Engine is not initialized yet!
2016-11-07 08:27:48.906: [ CRSMAIN][515] Policy Engine is not initialized yet!
2016-11-07 08:28:18.902: [ CRSMAIN][515] Policy Engine is not initialized yet!
2016-11-07 08:28:48.902: [ CRSMAIN][515] Policy Engine is not initialized yet!
2016-11-07 08:29:18.902: [ CRSMAIN][515] Policy Engine is not initialized yet!
2016-11-07 08:29:48.902: [ CRSMAIN][515] Policy Engine is not initialized yet!
2016-11-07 08:30:18.902: [ CRSMAIN][515] Policy Engine is not initialized yet!
2016-11-07 08:30:48.902: [ CRSMAIN][515] Policy Engine is not initialized yet!
2016-11-07 08:31:18.902: [ CRSMAIN][515] Policy Engine is not initialized yet!
2016-11-07 08:31:48.902: [ CRSMAIN][515] Policy Engine is not initialized yet!
2016-11-07 08:32:18.902: [ CRSMAIN][515] Policy Engine is not initialized yet!

开始手动尝试启动数据库

尝试手动启CRS
Mon Nov 07 08:10:32 2016
Starting ORACLE instance (normal)
Specified value of sga_max_size is too small, bumping to 43486543872
...
  deferred_segment_creation= FALSE
  parallel_force_local     = TRUE
  _optimizer_join_factorization= FALSE
  _optimizer_use_cbqt_star_transformation= FALSE
  _optimizer_use_feedback  = FALSE
  diagnostic_dest          = "/oracle/app/oracle"
  max_dump_file_size       = "200M"
  trace_enabled            = FALSE
Cluster communication is configured to use the following interface(s) for this instance
  169.254.23.27
cluster interconnect IPC version:Oracle UDP/IP (generic)
IPC Vendor 1 proto 2
Mon Nov 07 08:10:33 2016
PMON started with pid=2, OS id=23463648
Mon Nov 07 08:10:33 2016
PSP0 started with pid=3, OS id=16582414
Mon Nov 07 08:10:34 2016
VKTM started with pid=4, OS id=23790424 at elevated priority
VKTM running at (10)millisec precision with DBRM quantum (100)ms
Mon Nov 07 08:10:34 2016
GEN0 started with pid=5, OS id=23331732
Mon Nov 07 08:10:34 2016
DIAG started with pid=6, OS id=23724950
...
...
...
 LMS 3: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Mon Nov 07 08:10:50 2016
 LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Mon Nov 07 08:10:50 2016
 LMS 4: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Mon Nov 07 08:10:50 2016
 LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 Set master node info
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted       <<<< 启动到这里时节点1 实例会出现hang
Mon Nov 07 08:14:59 2016
LMD0 (ospid: 23988070) received an instance eviction notification from instance 1 [2]
Mon Nov 07 08:14:59 2016
LMON received an instance eviction notification from instance 1
The instance eviction reason is 0x2
The instance eviction map is 2
Mon Nov 07 08:15:01 2016
PMON (ospid: 23463648): terminating the instance due to error 481
Mon Nov 07 08:15:02 2016
System state dump requested by (instance=2, osid=23463648 (PMON)), summary=[abnormal instance termination].
System State dumped to trace file /oracle/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_diag_23724950.trc
Dumping diagnostic data in directory=[cdmp_20161107081502], requested by (instance=2, osid=23463648 (PMON)), summary=[abnormal instance termination].
Instance terminated by PMON, pid = 23463648

Note:
An instance eviction message often appears before ora-481 or “error 481” in the alert.log, and the reason for eviction is often 2.

尝试手动重启CRS

2016-11-07 08:35:52.159: [ CRSMAIN][1] Connecting to the CSS Daemon
2016-11-07 08:35:52.159: [ CRSMAIN][1] Getting local node number
2016-11-07 08:35:52.161: [ CRSMAIN][1] Initializing OCR
[   CLWAL][1]clsw_Initialize: OLR initlevel [70000]
2016-11-07 08:35:56.703: [  OCRRAW][1]proprioo: for disk 0 (/dev/rlv_crs2), id match (1), total id sets, (2) need recover (0), my votes (1), total votes (2), commit_lsn (7257), lsn (7257)
2016-11-07 08:35:56.703: [  OCRRAW][1]proprioo: my id set: (559948020, 1953421370, 0, 0, 0)
2016-11-07 08:35:56.703: [  OCRRAW][1]proprioo: 1st set: (559948020, 1028247821, 0, 0, 0)
2016-11-07 08:35:56.703: [  OCRRAW][1]proprioo: 2nd set: (559948020, 1953421370, 0, 0, 0)
2016-11-07 08:35:56.703: [  OCRRAW][1]proprioo: for disk 1 (/dev/rlv_crs1), id match (1), total id sets, (2) need recover (0), my votes (1), total votes (2), commit_lsn (7257), lsn (7257)
2016-11-07 08:35:56.703: [  OCRRAW][1]proprioo: my id set: (559948020, 1953421370, 0, 0, 0)
2016-11-07 08:35:56.703: [  OCRRAW][1]proprioo: 1st set: (559948020, 1028247821, 0, 0, 0)
2016-11-07 08:35:56.703: [  OCRRAW][1]proprioo: 2nd set: (559948020, 1953421370, 0, 0, 0)
2016-11-07 08:35:58.724: [  OCRSRV][1]th_init: Successfully retrieved CSS misscount [31].
2016-11-07 08:35:58.724: [  OCRSRV][1]th_init: Successfully query CLSS mode [3].
2016-11-07 08:36:00.710: [  OCRSRV][1]th_init:1: FROM PUBDATA Node num [1]Remote Listening Port [0] Cache invalidation port [0]
2016-11-07 08:36:00.710: [  OCRSRV][1]th_init:1.1: FROM PUBDATA Node num [1]CLSC Private IP or GIPC connect string [gipcha]
[  OCRMAS][1]th_calc_av:5': Rturn persisted AV [186647296] [11.2.0.3.0]
2016-11-07 08:36:00.720: [  OCRSRV][1]th_init: Decided to use GIPC
2016-11-07 08:36:00.720: [GIPCHGEN][2057] gipchaInternalRegister: Initializing HA GIPC
2016-11-07 08:36:00.720: [GIPCHGEN][2057] gipchaNodeCreate: adding new node 11168d470 { host '', haName '0eea-3e6e-b116-fc24', srcLuid 439aba68-00000000, dstLuid 00000000-00000000 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sen
dSeq [0 : 0], createTime 1643372268, sentRegister 0, localMonitor 0, flags 0x1 }
2016-11-07 08:36:00.720: [GIPCHTHR][2314] gipchaWorkerThread: starting worker thread hctx 110eceb10 [0000000000000010] { gipchaContext : host 'dbnode2', name '0eea-3e6e-b116-fc24', luid '439aba68-00000000', numNode 0, numInf 0, usrFla
gs 0x0, flags 0x0 }
2016-11-07 08:36:00.720: [GIPCHDEM][2571] gipchaDaemonThread: starting daemon thread hctx 110eceb10 [0000000000000010] { gipchaContext : host 'dbnode2', name '0eea-3e6e-b116-fc24', luid '439aba68-00000000', numNode 0, numInf 0, usrFla
gs 0x0, flags 0x0 }
2016-11-07 08:36:00.791: [GIPCHGEN][2571] gipchaNodeAddInterface: adding interface information for inf 111763410 { host '', haName '0eea-3e6e-b116-fc24', local 0, ip '192.168.60.16', subnet '192.168.60.0', mask '255.255.255.0', mac '
7c-fe-90-cd-11-d4', ifname 'en35', numRef 0, numFail 0, idxBoot 0, flags 0x1 }
2016-11-07 08:36:00.993: [GIPCXCPT][2057] gipchaInternalResolve: failed to resolve ret gipcretKeyNotFound (36), host 'dbnode2', port '4708-4db2-78cb-53a4', hctx 110eceb10 [0000000000000010] { gipchaContext : host 'dbnode2', name '0eea-
3e6e-b116-fc24', luid '439aba68-00000000', numNode 0, numInf 1, usrFlags 0x0, flags 0x5 }, ret gipcretKeyNotFound (36)
2016-11-07 08:36:00.993: [GIPCHGEN][2057] gipchaResolveF [gipcmodGipcResolve : gipcmodGipc.c : 804]: EXCEPTION[ ret gipcretKeyNotFound (36) ]  failed to resolve ctx 110eceb10 [0000000000000010] { gipchaContext : host 'dbnode2', name '
0eea-3e6e-b116-fc24', luid '439aba68-00000000', numNode 0, numInf 1, usrFlags 0x0, flags 0x5 }, host 'dbnode2', port '4708-4db2-78cb-53a4', flags 0x0
2016-11-07 08:36:00.993: [GIPCHTHR][2314] gipchaWorkerCreateInterface: created local interface for node 'dbnode2', haName '0eea-3e6e-b116-fc24', inf 'udp://192.168.60.16:60258'
2016-11-07 08:36:00.993: [  OCRMSG][2057]prom_listen: Port str [4708-4db2-78cb-53a4]
2016-11-07 08:36:00.993: [  OCRSRV][2057]proath_listen: listening to remote requests at portstr [4708-4db2-78cb-53a4]
2016-11-07 08:36:01.049: [GIPCXCPT][3085] gipchaInternalResolve: failed to resolve ret gipcretKeyNotFound (36), host 'dbnode2', port '4315-1ba3-1fd6-f2fe', hctx 110eceb10 [0000000000000010] { gipchaContext : host 'dbnode2', name '0eea-
3e6e-b116-fc24', luid '439aba68-00000000', numNode 0, numInf 1, usrFlags 0x0, flags 0x5 }, ret gipcretKeyNotFound (36)
2016-11-07 08:36:01.049: [GIPCHGEN][3085] gipchaResolveF [gipcmodGipcResolve : gipcmodGipc.c : 804]: EXCEPTION[ ret gipcretKeyNotFound (36) ]  failed to resolve ctx 110eceb10 [0000000000000010] { gipchaContext : host 'dbnode2', name '
0eea-3e6e-b116-fc24', luid '439aba68-00000000', numNode 0, numInf 1, usrFlags 0x0, flags 0x5 }, host 'dbnode2', port '4315-1ba3-1fd6-f2fe', flags 0x0
2016-11-07 08:36:01.049: [  OCRMSG][3085]prom_listen: Port str [4315-1ba3-1fd6-f2fe]
2016-11-07 08:36:01.049: [  OCRSRV][3085]th_invalidate_cache: listening to cache_invalidation requests at portstr [4315-1ba3-1fd6-f2fe]
[  OCRMAS][3342]th_calc_av:5': Rturn persisted AV [186647296] [11.2.0.3.0]
2016-11-07 08:36:01.051: [  OCRMAS][3342]prom_dump_pubdata: pubdatactx->prom_pubdata_ocrid=1307813993
2016-11-07 08:36:01.051: [  OCRMAS][3342]prom_dump_pubdata: begin dumping pubdatactx->prom_pubdata_prom_con
2016-11-07 08:36:01.051: [  OCRMAS][3342]prom_dump_con: promcon->cache_invalidation_port=0
2016-11-07 08:36:01.051: [  OCRMAS][3342]prom_dump_con: promcon->remote_listening_port=0
2016-11-07 08:36:01.051: [  OCRMAS][3342]prom_dump_con: promcon->local_listening_port=0
2016-11-07 08:36:01.051: [  OCRMAS][3342]prom_dump_pubdata: end dumping pubdatactx->prom_pubdata_prom_con
2016-11-07 08:36:01.051: [  OCRMAS][3342]prom_dump_pubdata: pubdatactx->prom_pubdata_software_version=186647296
2016-11-07 08:36:01.051: [  OCRMAS][3342]prom_dump_pubdata: pubdatactx->prom_pubdata_active_version=186647296
2016-11-07 08:36:01.051: [  OCRMAS][3342]prom_dump_pubdata: pubdatactx->prom_pubdata_priv_nodename=gipcha
2016-11-07 08:36:03.711: [  OCRMAS][3342]rcfg_con:1: Member [2] joined. Inc [6].
2016-11-07 08:36:03.711: [  OCRMAS][3342]proath_master: Master changing. cssctx->master [-1] new master [1] is_new_mastership [1] num_rcfg [1].
2016-11-07 08:36:03.711: [  OCRSRV][3342]th_not_master_change: Master change callback not registered
2016-11-07 08:36:03.711: [  OCRMAS][3342]prom_dump_pubdata: pubdatactx->prom_pubdata_ocrid=1307813993
2016-11-07 08:36:03.711: [  OCRMAS][3342]prom_dump_pubdata: begin dumping pubdatactx->prom_pubdata_prom_con
2016-11-07 08:36:03.711: [  OCRMAS][3342]prom_dump_con: promcon->cache_invalidation_port=0
2016-11-07 08:36:03.711: [  OCRMAS][3342]prom_dump_con: promcon->remote_listening_port=0
2016-11-07 08:36:03.711: [  OCRMAS][3342]prom_dump_con: promcon->local_listening_port=0
2016-11-07 08:36:03.711: [  OCRMAS][3342]prom_dump_pubdata: end dumping pubdatactx->prom_pubdata_prom_con
2016-11-07 08:36:03.711: [  OCRMAS][3342]prom_dump_pubdata: pubdatactx->prom_pubdata_software_version=186647296
2016-11-07 08:36:03.711: [  OCRMAS][3342]prom_dump_pubdata: pubdatactx->prom_pubdata_active_version=186647296
2016-11-07 08:36:03.711: [  OCRMAS][3342]prom_dump_pubdata: pubdatactx->prom_pubdata_priv_nodename=gipcha
2016-11-07 08:36:03.711: [  OCRMAS][3342]th_connect_master: Using GIPC type to connect
2016-11-07 08:36:03.711: [  OCRMAS][3342]th_connect_master:10: Master host name [dbnode1]
2016-11-07 08:36:03.711: [  OCRMAS][3342]proath_connect_master: Attempting to connect to master at address [dbnode1:0348-389c-de71-f8b0]
2016-11-07 08:36:03.712: [GIPCXCPT][3342] gipchaInternalResolve: failed to resolve ret gipcretKeyNotFound (36), host 'dbnode2', port '7c73-6f83-0cf7-d835', hctx 110eceb10 [0000000000000010] { gipchaContext : host 'dbnode2', name '0eea-
3e6e-b116-fc24', luid '439aba68-00000000', numNode 0, numInf 1, usrFlags 0x0, flags 0x5 }, ret gipcretKeyNotFound (36)
2016-11-07 08:36:03.712: [GIPCHGEN][3342] gipchaResolveF [gipcmodGipcResolve : gipcmodGipc.c : 804]: EXCEPTION[ ret gipcretKeyNotFound (36) ]  failed to resolve ctx 110eceb10 [0000000000000010] { gipchaContext : host 'dbnode2', name '
0eea-3e6e-b116-fc24', luid '439aba68-00000000', numNode 0, numInf 1, usrFlags 0x0, flags 0x5 }, host 'dbnode2', port '7c73-6f83-0cf7-d835', flags 0x0
2016-11-07 08:36:22.151: [ CRSMAIN][515] Policy Engine is not initialized yet!
2016-11-07 08:36:52.158: [ CRSMAIN][515] Policy Engine is not initialized yet!
2016-11-07 08:37:22.158: [ CRSMAIN][515] Policy Engine is not initialized yet!
2016-11-07 08:37:32.748: [GIPCHGEN][2314] gipchaNodeCreate: adding new node 111784eb0 { host 'dbnode1', haName '74a2-a872-a861-8b63', srcLuid 439aba68-737599a2, dstLuid 00000000-00000000 numInf 0, contigSeq 0, lastAck 0, lastValidAck
0, sendSeq [0 : 0], createTime 1643464296, sentRegister 0, localMonitor 0, flags 0x0 }
2016-11-07 08:37:32.748: [GIPCHALO][2314] gipchaLowerSend: deffering startup of hdr 111741038 { len 232, seq 0, type gipchaHdrTypeSend (1), lastSeq 0, lastAck 0, minAck 0, flags 0x0, srcLuid 00000000-00000000, dstLuid 00000000-000000
00, msgId 0 }, node 111784eb0 { host 'dbnode1', haName '74a2-a872-a861-8b63', srcLuid 439aba68-737599a2, dstLuid 00000000-00000000 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [0 : 0], createTime 1643464296, sentRegister
0, localMonitor 0, flags 0x0 }
2016-11-07 08:37:32.748: [GIPCHALO][2314] gipchaLowerProcessNode: no valid interfaces found to node for 1643464296 ms, node 111784eb0 { host 'dbnode1', haName '74a2-a872-a861-8b63', srcLuid 439aba68-737599a2, dstLuid 00000000-00000000
 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [0 : 0], createTime 1643464296, sentRegister 0, localMonitor 0, flags 0x4 }
2016-11-07 08:37:32.749: [GIPCHGEN][2571] gipchaNodeAddInterface: adding interface information for inf 111755bf0 { host 'dbnode1', haName '74a2-a872-a861-8b63', local 0, ip '192.168.60.15:51892', subnet '192.168.60.0', mask '255.255.2
55.0', mac '', ifname '', numRef 0, numFail 0, idxBoot 0, flags 0x2 }
2016-11-07 08:37:33.748: [GIPCHTHR][2314] gipchaWorkerCreateInterface: created remote interface for node 'dbnode1', haName '74a2-a872-a861-8b63', inf 'udp://192.168.60.15:51892'
2016-11-07 08:37:33.748: [GIPCHGEN][2314] gipchaWorkerAttachInterface: Interface attached inf 111755bf0 { host 'dbnode1', haName '74a2-a872-a861-8b63', local 111763410, ip '192.168.60.15:51892', subnet '192.168.60.0', mask '255.255.25
5.0', mac '', ifname '', numRef 0, numFail 0, idxBoot 0, flags 0x6 }
2016-11-07 08:37:52.168: [ CRSMAIN][515] Policy Engine is not initialized yet!
2016-11-07 08:38:02.748: [GIPCXCPT][3342] gipchaInternalResolve: failed to resolve ret gipcretKeyNotFound (36), host 'dbnode2', port '7f22-cf7b-cf3a-c3a0', hctx 110eceb10 [0000000000000010] { gipchaContext : host 'dbnode2', name '0eea-
3e6e-b116-fc24', luid '439aba68-00000000', numNode 1, numInf 1, usrFlags 0x0, flags 0x5 }, ret gipcretKeyNotFound (36)
2016-11-07 08:38:02.748: [GIPCHGEN][3342] gipchaResolveF [gipcmodGipcResolve : gipcmodGipc.c : 804]: EXCEPTION[ ret gipcretKeyNotFound (36) ]  failed to resolve ctx 110eceb10 [0000000000000010] { gipchaContext : host 'dbnode2', name '
0eea-3e6e-b116-fc24', luid '439aba68-00000000', numNode 1, numInf 1, usrFlags 0x0, flags 0x5 }, host 'dbnode2', port '7f22-cf7b-cf3a-c3a0', flags 0x0
2016-11-07 08:38:03.761: [GIPCHALO][2314] gipchaLowerSend: deffering startup of hdr 111748578 { len 232, seq 0, type gipchaHdrTypeSend (1), lastSeq 0, lastAck 0, minAck 0, flags 0x0, srcLuid 00000000-00000000, dstLuid 00000000-000000
00, msgId 0 }, node 111784eb0 { host 'dbnode1', haName '74a2-a872-a861-8b63', srcLuid 439aba68-737599a2, dstLuid 00000000-00000000 numInf 1, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [30 : 30], createTime 1643464296, sentRegiste
r 1, localMonitor 0, flags 0x4 }
2016-11-07 08:38:22.168: [ CRSMAIN][515] Policy Engine is not initialized yet!
2016-11-07 08:38:33.761: [  OCRMAS][3342]proath_connect_master:1: could not yet connect to master  retval1 = 203, retval2 = 203
2016-11-07 08:38:33.761: [  OCRMAS][3342]th_master:110: Could not yet connect to new master [1]
2016-11-07 08:38:33.961: [  OCRMAS][3342]proath_master:100b: Polling, connect to master not complete retval1 = 203, retval2 = 203
2016-11-07 08:38:34.161: [  OCRMAS][3342]proath_master:100b: Polling, connect to master not complete retval1 = 203, retval2 = 203
2016-11-07 08:38:34.361: [  OCRMAS][3342]proath_master:100b: Polling, connect to master not complete retval1 = 203, retval2 = 203
2016-11-07 08:38:34.562: [  OCRMAS][3342]proath_master:100b: Polling, connect to master not complete retval1 = 203, retval2 = 203
2016-11-07 08:38:34.762: [  OCRMAS][3342]proath_master:100b: Polling, connect to master not complete retval1 = 203, retval2 = 203
2016-11-07 08:38:34.962: [  OCRMAS][3342]proath_master:100b: Polling, connect to master not complete retval1 = 203, retval2 = 203
2016-11-07 08:38:35.162: [  OCRMAS][3342]proath_master:100b: Polling, connect to master not complete retval1 = 203, retval2 = 203

Evmd log

2016-11-07 08:35:52.156: [  OCRMSG][1]prom_waitconnect: CONN NOT ESTABLISHED (0,29,1,2)
2016-11-07 08:35:52.156: [  OCRMSG][1]GIPC error [29] msg [gipcretConnectionRefused]
2016-11-07 08:35:52.156: [  OCRMSG][1]prom_connect: error while waiting for connection complete [24]
2016-11-07 08:35:52.156: [  CRSOCR][1] OCR context init failure.  Error: PROC-32: Cluster Ready Services on the local node is not running Messaging error [gipcretConnectionRefused] [29]
2016-11-07 08:35:53.156: [    EVMD][1] Waiting for OCR. Seconds elapsed: 0
2016-11-07 08:35:53.157: [  OCRMSG][1]prom_waitconnect: CONN NOT ESTABLISHED (0,29,1,2)
2016-11-07 08:35:53.157: [  OCRMSG][1]GIPC error [29] msg [gipcretConnectionRefused]
2016-11-07 08:35:53.157: [  OCRMSG][1]prom_connect: error while waiting for connection complete [24]
2016-11-07 08:35:53.157: [  CRSOCR][1] OCR context init failure.  Error: PROC-32: Cluster Ready Services on the local node is not running Messaging error [gipcretConnectionRefused] [29]
2016-11-07 08:35:54.158: [  OCRMSG][1]prom_waitconnect: CONN NOT ESTABLISHED (0,29,1,2)
2016-11-07 08:35:54.158: [  OCRMSG][1]GIPC error [29] msg [gipcretConnectionRefused]
2016-11-07 08:35:54.158: [  OCRMSG][1]prom_connect: error while waiting for connection complete [24]
2016-11-07 08:35:54.158: [  CRSOCR][1] OCR context init failure.  Error: PROC-32: Cluster Ready Services on the local node is not running Messaging error [gipcretConnectionRefused] [29]
2016-11-07 08:35:55.158: [  OCRMSG][1]prom_waitconnect: CONN NOT ESTABLISHED (0,29,1,2)
2016-11-07 08:35:55.158: [  OCRMSG][1]GIPC error [29] msg [gipcretConnectionRefused]
2016-11-07 08:35:55.158: [  OCRMSG][1]prom_connect: error while waiting for connection complete [24]
2016-11-07 08:35:55.158: [  CRSOCR][1] OCR context init failure.  Error: PROC-32: Cluster Ready Services on the local node is not running Messaging error [gipcretConnectionRefused] [29]

# /log//ohasd/ohasd.log 中不存在 ==>> partial header received messages

grid@dbnode2:/oracle/app/11.2.0.3/grid/log/dbnode2/crsd> crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4529: Cluster Synchronization Services is online
CRS-4534: Cannot communicate with Event Manager


grid@dbnode2:/oracle/app/11.2.0.3/grid/log/dbnode2/crsd> crsctl stat res -t -init
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS       
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
      1        OFFLINE OFFLINE                               Instance Shutdown   
ora.cluster_interconnect.haip
      1        ONLINE  ONLINE       dbnode2                                       
ora.crf
      1        OFFLINE OFFLINE                                                   
ora.crsd
      1        ONLINE  INTERMEDIATE dbnode2                                       
ora.cssd
      1        ONLINE  ONLINE       dbnode2                                       
ora.cssdmonitor
      1        ONLINE  ONLINE       dbnode2                                       
ora.ctssd
      1        ONLINE  ONLINE       dbnode2                   OBSERVER            
ora.diskmon
      1        OFFLINE OFFLINE                                                   
ora.drivers.acfs
      1        ONLINE  OFFLINE                                                   
ora.evmd
      1        ONLINE  INTERMEDIATE dbnode2                                       
ora.gipcd
      1        ONLINE  ONLINE       dbnode2                                       
ora.gpnpd
      1        ONLINE  ONLINE       dbnode2                                       
ora.mdnsd
      1        ONLINE  ONLINE       dbnode2   

root@dbnode2[/]#ps -ef|grep d.bin
    root  3866650        1   0 08:34:59      -  0:00 /oracle/app/11.2.0.3/grid/bin/cssdagent
    root 25886878        1   0 08:35:51      -  0:00 /oracle/app/11.2.0.3/grid/bin/crsd.bin reboot
    grid  2687378 14550800   0 08:34:59      -  0:00 /oracle/app/11.2.0.3/grid/bin/ocssd.bin
    grid 25100712        1   1 08:34:54      -  0:00 /oracle/app/11.2.0.3/grid/bin/oraagent.bin
    root 23331612        1   0 08:35:44      -  0:00 /oracle/app/11.2.0.3/grid/bin/orarootagent.bin
    root 23790588        1   0 08:34:58      -  0:00 /oracle/app/11.2.0.3/grid/bin/cssdmonitor
    grid  1312134        1   0   Sep 21      - 6291:10 /oracle/app/11.2.0.3/grid/bin/tnslsnr LISTENER -inherit
    grid  2819350        1   0 08:34:54      -  0:00 /oracle/app/11.2.0.3/grid/bin/mdnsd.bin
    grid 13436568        1   0 08:34:55      -  0:00 /oracle/app/11.2.0.3/grid/bin/gpnpd.bin
    root 23463570 25297272   0 08:46:47 pts/10  0:00 grep d.bin
    root 23529150        1   0 08:35:45      -  0:00 /oracle/app/11.2.0.3/grid/bin/octssd.bin reboot
    grid  3016476        1   0 08:34:57      -  0:00 /oracle/app/11.2.0.3/grid/bin/gipcd.bin
    root 14550800        1   0 08:34:59      -  0:00 /bin/sh /oracle/app/11.2.0.3/grid/bin/ocssd
    grid 16582588        1   0 08:35:51      -  0:00 /oracle/app/11.2.0.3/grid/bin/evmd.bin
    root 24053592        1   0 08:34:47      -  0:01 /oracle/app/11.2.0.3/grid/bin/ohasd.bin reboot
     

检查顺序
1, CPU,MEMORY 主机资源正常
2, df 文件目录使用率正常
3, 两节点互Ping private IP 正常
4, IP,mark 地址正常,且未调防火墙及网络环境
5, 检查了数据库/GI sqlnet.ora文件未配置tcp.validnode_checking
6, traceroute private IP 异常

traceroute to 192.168.60.15 (192.168.60.15) from 192.168.60.16 (192.168.60.16), 30 hops max
outgoing MTU = 1500
 1  * * *
 2  * * *
 3  dbnode1-priv (192.168.60.15)  0 ms * *

通常排查网络问题常用ping和traceroute,但是对于这个案例是ping正常,traceroute 会出现超时, 推荐Understanding the Ping and Traceroute Commands , 解决方法切换了私网的网卡,traceroute结果正常数据,CRS启动恢复正常。

打赏

对不起,这篇文章暂时关闭评论。