Crsd start fail and crsd.log show “Policy Engine is not initialized yet”& evmd.log show “[gipcretConnectionRefused] [29]”
最近一套数据库的2节点半夜突然crash,被1节点驱逐, AGENT 启动DB失败,手动重启CRS启动失败,后来发现日志中的现象与MOS中多篇bug很像但又不是,节点2CRS启动失败,AIX环境,crs日志显示”Policy Engine is not initialized yet”,evmd.log 显示”[gipcretConnectionRefused] [29]” , 这里记录一下,希望以后遇到相同问题时可以提供参考。
版本信息
root@dbnode2[/]#oslevel -s 6100-09-07-1614 root@dbnode2[/]#crsctl query crs releaseversion Oracle High Availability Services release version on the local node is [11.2.0.3.0]
Node2 db alert log
2016-11-07 02:52:55.499000 +08:00 Thread 2 advanced to log sequence 5029 (LGWR switch) Current log# 7 seq# 5029 mem# 0: /dev/rzwc_redo07 2016-11-07 02:53:06.801000 +08:00 Archived Log entry 8513 added for thread 2 sequence 5028 ID 0xffffffffde6802cb dest 1: 2016-11-07 03:05:35.906000 +08:00 Restarting dead background process DIA0 DIA0 started with pid=984, OS id=22675762 2016-11-07 03:08:54.383000 +08:00 IPC Send timeout detected. Sender: ospid 2229640 [oracle@dbnode2 (PING)] Receiver: inst 1 binc 185672221 ospid 3867476 2016-11-07 03:10:35.528000 +08:00 IPC Send timeout detected. Sender: ospid 3344084 [oracle@dbnode2 (LCK0)] Receiver: inst 1 binc 2 ospid 7078268 2016-11-07 03:10:44.431000 +08:00 IPC Send timeout detected. Sender: ospid 16451218 [oracle@dbnode2] Receiver: inst 1 binc 185672259 ospid 7733904 2016-11-07 03:10:46.432000 +08:00 IPC Send timeout to 1.2 inc 4 for msg type 65521 from opid 1778 IPC Send timeout: Terminating pid 1778 osid 16451218 2016-11-07 03:10:51.767000 +08:00 IPC Send timeout detected. Sender: ospid 2163290 [oracle@dbnode2 (LMS3)] Receiver: inst 1 binc 185672267 ospid 8651308 2016-11-07 03:10:53.799000 +08:00 IPC Send timeout to 1.4 inc 4 for msg type 65518 from opid 16 Communications reconfiguration: instance_number 1 2016-11-07 03:11:42.460000 +08:00 Detected an inconsistent instance membership by instance 2 Received an instance abort message from instance 1 Please check instance 1 alert and LMON trace files for detail. LMS0 (ospid: 2098886): terminating the instance due to error 481 System state dump requested by (instance=2, osid=2098886 (LMS0)), summary=[abnormal instance termination]. System State dumped to trace file /oracle/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_diag_3474386.trc
Note:
大致在3:06分左右出现了异常,随后几个GES/GCS进程异常IPC通信超时实例crash.
crs alert log
2016-11-07 01:21:42.171 [ctssd(3147414)]CRS-2409:The clock on host dbnode2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode. 2016-11-07 02:03:52.583 [ctssd(3147414)]CRS-2409:The clock on host dbnode2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode. 2016-11-07 02:39:54.942 [ctssd(3147414)]CRS-2409:The clock on host dbnode2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode. 2016-11-07 03:08:45.295 [/oracle/app/11.2.0.3/grid/bin/orarootagent.bin(1836630)]CRS-5818:Aborted command 'check' for resource 'ora.crsd'. Details at (:CRSAGF00113:) {0:0:2} in /oracle/app/11.2.0.3/grid/log/dbnode2/agent/ohasd/orarootagent_root/orarootagent_ root.log. 2016-11-07 03:11:43.225 [/oracle/app/11.2.0.3/grid/bin/oraagent.bin(1442382)]CRS-5011:Check of resource "anbob" failed: details at "(:CLSN00007:)" in "/oracle/app/11.2.0.3/grid/log/dbnode2/agent/crsd/oraagent_oracle/oraagent_oracle.log" 2016-11-07 03:17:07.826 [cssd(1574190)]CRS-1662:Member kill requested by node dbnode1 for member number 1, group DBanbob 2016-11-07 03:18:57.354 [ctssd(3147414)]CRS-2409:The clock on host dbnode2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode. 2016-11-07 03:24:45.377 [/oracle/app/11.2.0.3/grid/bin/orarootagent.bin(1836630)]CRS-5818:Aborted command 'check' for resource 'ora.crsd'. Details at (:CRSAGF00113:) {0:0:2} in /oracle/app/11.2.0.3/grid/log/dbnode2/agent/ohasd/orarootagent_root/orarootagent_ root.log. 2016-11-07 03:39:45.442 [ohasd(2622286)]CRS-2765:Resource 'ora.crsd' has failed on server 'dbnode2'.
CRS log 在重启前一直循环打印
2016-11-07 03:39:52.606: [GIPCHALO][2314] gipchaLowerProcessNode: no valid interfaces found to node for 1625604158 ms, node 111749010 { host 'dbnode1', haName '74a2-a872-a861-8b63', srcLuid 2e76d732-59d936ff, dstLuid 00000000-00000000 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [0 : 0], createTime 1625604158, sentRegister 0, localMonitor 0, flags 0x4 } 2016-11-07 03:39:52.606: [GIPCHGEN][2571] gipchaNodeAddInterface: adding interface information for inf 11174d390 { host 'dbnode1', haName '74a2-a872-a861-8b63', local 0, ip '192.168.60.15:51892', subnet '192.168.60.0', mask '255.255.2 55.0', mac '', ifname '', numRef 0, numFail 0, idxBoot 0, flags 0x2 } 2016-11-07 03:39:53.606: [GIPCHTHR][2314] gipchaWorkerCreateInterface: created remote interface for node 'dbnode1', haName '74a2-a872-a861-8b63', inf 'udp://192.168.60.15:51892' 2016-11-07 03:39:53.606: [GIPCHGEN][2314] gipchaWorkerAttachInterface: Interface attached inf 11174d390 { host 'dbnode1', haName '74a2-a872-a861-8b63', local 1117633f0, ip '192.168.60.15:51892', subnet '192.168.60.0', mask '255.255.25 5.0', mac '', ifname '', numRef 0, numFail 0, idxBoot 0, flags 0x6 } 2016-11-07 03:40:05.609: [GIPCHALO][2314] gipchaLowerProcessAcks: ESTABLISH finished for node 111749010 { host 'dbnode1', haName '74a2-a872-a861-8b63', srcLuid 2e76d732-59d936ff, dstLuid 1ef7c1b6-3573a1f5 numInf 1, contigSeq 7, lastAc k 0, lastValidAck 0, sendSeq [13 : 13], createTime 1625604158, sentRegister 1, localMonitor 0, flags 0x20c } 2016-11-07 03:40:05.609: [GIPCHALO][2314] gipchaLowerProcessWaitQ: triggering deffered startup of msg 110ef55f8 { len 232, seq 0, type gipchaHdrTypeSend (1), lastSeq 0, lastAck 0, minAck 0, flags 0x0, srcLuid 00000000-00000000, dstLu id 00000000-00000000, msgId 0 }, node 111749010 { host 'dbnode1', haName '74a2-a872-a861-8b63', srcLuid 2e76d732-59d936ff, dstLuid 1ef7c1b6-3573a1f5 numInf 1, contigSeq 7, lastAck 0, lastValidAck 0, sendSeq [14 : 14], createTime 16256 04158, sentRegister 1, localMonitor 0, flags 0x20c } 2016-11-07 03:40:07.611: [GIPCXCPT][3342] gipchaInternalResolve: failed to resolve ret gipcretKeyNotFound (36), host 'dbnode2', port '6742-d1a0-946e-b24b', hctx 110eceb10 [0000000000000010] { gipchaContext : host 'dbnode2', name '87fc- 6da2-72ca-d7c0', luid '2e76d732-00000000', numNode 1, numInf 1, usrFlags 0x0, flags 0x5 }, ret gipcretKeyNotFound (36) 2016-11-07 03:40:07.611: [GIPCHGEN][3342] gipchaResolveF [gipcmodGipcResolve : gipcmodGipc.c : 804]: EXCEPTION[ ret gipcretKeyNotFound (36) ] failed to resolve ctx 110eceb10 [0000000000000010] { gipchaContext : host 'dbnode2', name ' 87fc-6da2-72ca-d7c0', luid '2e76d732-00000000', numNode 1, numInf 1, usrFlags 0x0, flags 0x5 }, host 'dbnode2', port '6742-d1a0-946e-b24b', flags 0x0 2016-11-07 03:40:10.613: [ OCRMAS][3342]proath_master: SUCCESSFULLY CONNECTED TO THE MASTER 2016-11-07 03:40:10.613: [ OCRMAS][3342]th_master: NEW OCR MASTER IS 1 2016-11-07 03:40:10.613: [ OCRSRV][1]th_snap_local_spawn: Inside snap local spawn. host is [dbnode2] 2016-11-07 03:40:10.615: [ CRSMAIN][1] Running mode check... 2016-11-07 03:40:10.615: [ CRSMAIN][1] Running as user: root 2016-11-07 03:40:10.615: [ CRSMAIN][1] CRSD running as the Privileged user ... 2016-11-07 08:26:18.902: [ CRSMAIN][515] Policy Engine is not initialized yet! 2016-11-07 08:26:48.903: [ CRSMAIN][515] Policy Engine is not initialized yet! 2016-11-07 08:27:18.902: [ CRSMAIN][515] Policy Engine is not initialized yet! 2016-11-07 08:27:48.906: [ CRSMAIN][515] Policy Engine is not initialized yet! 2016-11-07 08:28:18.902: [ CRSMAIN][515] Policy Engine is not initialized yet! 2016-11-07 08:28:48.902: [ CRSMAIN][515] Policy Engine is not initialized yet! 2016-11-07 08:29:18.902: [ CRSMAIN][515] Policy Engine is not initialized yet! 2016-11-07 08:29:48.902: [ CRSMAIN][515] Policy Engine is not initialized yet! 2016-11-07 08:30:18.902: [ CRSMAIN][515] Policy Engine is not initialized yet! 2016-11-07 08:30:48.902: [ CRSMAIN][515] Policy Engine is not initialized yet! 2016-11-07 08:31:18.902: [ CRSMAIN][515] Policy Engine is not initialized yet! 2016-11-07 08:31:48.902: [ CRSMAIN][515] Policy Engine is not initialized yet! 2016-11-07 08:32:18.902: [ CRSMAIN][515] Policy Engine is not initialized yet!
开始手动尝试启动数据库
尝试手动启CRS Mon Nov 07 08:10:32 2016 Starting ORACLE instance (normal) Specified value of sga_max_size is too small, bumping to 43486543872 ... deferred_segment_creation= FALSE parallel_force_local = TRUE _optimizer_join_factorization= FALSE _optimizer_use_cbqt_star_transformation= FALSE _optimizer_use_feedback = FALSE diagnostic_dest = "/oracle/app/oracle" max_dump_file_size = "200M" trace_enabled = FALSE Cluster communication is configured to use the following interface(s) for this instance 169.254.23.27 cluster interconnect IPC version:Oracle UDP/IP (generic) IPC Vendor 1 proto 2 Mon Nov 07 08:10:33 2016 PMON started with pid=2, OS id=23463648 Mon Nov 07 08:10:33 2016 PSP0 started with pid=3, OS id=16582414 Mon Nov 07 08:10:34 2016 VKTM started with pid=4, OS id=23790424 at elevated priority VKTM running at (10)millisec precision with DBRM quantum (100)ms Mon Nov 07 08:10:34 2016 GEN0 started with pid=5, OS id=23331732 Mon Nov 07 08:10:34 2016 DIAG started with pid=6, OS id=23724950 ... ... ... LMS 3: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Mon Nov 07 08:10:50 2016 LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Mon Nov 07 08:10:50 2016 LMS 4: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Mon Nov 07 08:10:50 2016 LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted <<<< 启动到这里时节点1 实例会出现hang Mon Nov 07 08:14:59 2016 LMD0 (ospid: 23988070) received an instance eviction notification from instance 1 [2] Mon Nov 07 08:14:59 2016 LMON received an instance eviction notification from instance 1 The instance eviction reason is 0x2 The instance eviction map is 2 Mon Nov 07 08:15:01 2016 PMON (ospid: 23463648): terminating the instance due to error 481 Mon Nov 07 08:15:02 2016 System state dump requested by (instance=2, osid=23463648 (PMON)), summary=[abnormal instance termination]. System State dumped to trace file /oracle/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_diag_23724950.trc Dumping diagnostic data in directory=[cdmp_20161107081502], requested by (instance=2, osid=23463648 (PMON)), summary=[abnormal instance termination]. Instance terminated by PMON, pid = 23463648
Note:
An instance eviction message often appears before ora-481 or “error 481” in the alert.log, and the reason for eviction is often 2.
尝试手动重启CRS
2016-11-07 08:35:52.159: [ CRSMAIN][1] Connecting to the CSS Daemon 2016-11-07 08:35:52.159: [ CRSMAIN][1] Getting local node number 2016-11-07 08:35:52.161: [ CRSMAIN][1] Initializing OCR [ CLWAL][1]clsw_Initialize: OLR initlevel [70000] 2016-11-07 08:35:56.703: [ OCRRAW][1]proprioo: for disk 0 (/dev/rlv_crs2), id match (1), total id sets, (2) need recover (0), my votes (1), total votes (2), commit_lsn (7257), lsn (7257) 2016-11-07 08:35:56.703: [ OCRRAW][1]proprioo: my id set: (559948020, 1953421370, 0, 0, 0) 2016-11-07 08:35:56.703: [ OCRRAW][1]proprioo: 1st set: (559948020, 1028247821, 0, 0, 0) 2016-11-07 08:35:56.703: [ OCRRAW][1]proprioo: 2nd set: (559948020, 1953421370, 0, 0, 0) 2016-11-07 08:35:56.703: [ OCRRAW][1]proprioo: for disk 1 (/dev/rlv_crs1), id match (1), total id sets, (2) need recover (0), my votes (1), total votes (2), commit_lsn (7257), lsn (7257) 2016-11-07 08:35:56.703: [ OCRRAW][1]proprioo: my id set: (559948020, 1953421370, 0, 0, 0) 2016-11-07 08:35:56.703: [ OCRRAW][1]proprioo: 1st set: (559948020, 1028247821, 0, 0, 0) 2016-11-07 08:35:56.703: [ OCRRAW][1]proprioo: 2nd set: (559948020, 1953421370, 0, 0, 0) 2016-11-07 08:35:58.724: [ OCRSRV][1]th_init: Successfully retrieved CSS misscount [31]. 2016-11-07 08:35:58.724: [ OCRSRV][1]th_init: Successfully query CLSS mode [3]. 2016-11-07 08:36:00.710: [ OCRSRV][1]th_init:1: FROM PUBDATA Node num [1]Remote Listening Port [0] Cache invalidation port [0] 2016-11-07 08:36:00.710: [ OCRSRV][1]th_init:1.1: FROM PUBDATA Node num [1]CLSC Private IP or GIPC connect string [gipcha] [ OCRMAS][1]th_calc_av:5': Rturn persisted AV [186647296] [11.2.0.3.0] 2016-11-07 08:36:00.720: [ OCRSRV][1]th_init: Decided to use GIPC 2016-11-07 08:36:00.720: [GIPCHGEN][2057] gipchaInternalRegister: Initializing HA GIPC 2016-11-07 08:36:00.720: [GIPCHGEN][2057] gipchaNodeCreate: adding new node 11168d470 { host '', haName '0eea-3e6e-b116-fc24', srcLuid 439aba68-00000000, dstLuid 00000000-00000000 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sen dSeq [0 : 0], createTime 1643372268, sentRegister 0, localMonitor 0, flags 0x1 } 2016-11-07 08:36:00.720: [GIPCHTHR][2314] gipchaWorkerThread: starting worker thread hctx 110eceb10 [0000000000000010] { gipchaContext : host 'dbnode2', name '0eea-3e6e-b116-fc24', luid '439aba68-00000000', numNode 0, numInf 0, usrFla gs 0x0, flags 0x0 } 2016-11-07 08:36:00.720: [GIPCHDEM][2571] gipchaDaemonThread: starting daemon thread hctx 110eceb10 [0000000000000010] { gipchaContext : host 'dbnode2', name '0eea-3e6e-b116-fc24', luid '439aba68-00000000', numNode 0, numInf 0, usrFla gs 0x0, flags 0x0 } 2016-11-07 08:36:00.791: [GIPCHGEN][2571] gipchaNodeAddInterface: adding interface information for inf 111763410 { host '', haName '0eea-3e6e-b116-fc24', local 0, ip '192.168.60.16', subnet '192.168.60.0', mask '255.255.255.0', mac ' 7c-fe-90-cd-11-d4', ifname 'en35', numRef 0, numFail 0, idxBoot 0, flags 0x1 } 2016-11-07 08:36:00.993: [GIPCXCPT][2057] gipchaInternalResolve: failed to resolve ret gipcretKeyNotFound (36), host 'dbnode2', port '4708-4db2-78cb-53a4', hctx 110eceb10 [0000000000000010] { gipchaContext : host 'dbnode2', name '0eea- 3e6e-b116-fc24', luid '439aba68-00000000', numNode 0, numInf 1, usrFlags 0x0, flags 0x5 }, ret gipcretKeyNotFound (36) 2016-11-07 08:36:00.993: [GIPCHGEN][2057] gipchaResolveF [gipcmodGipcResolve : gipcmodGipc.c : 804]: EXCEPTION[ ret gipcretKeyNotFound (36) ] failed to resolve ctx 110eceb10 [0000000000000010] { gipchaContext : host 'dbnode2', name ' 0eea-3e6e-b116-fc24', luid '439aba68-00000000', numNode 0, numInf 1, usrFlags 0x0, flags 0x5 }, host 'dbnode2', port '4708-4db2-78cb-53a4', flags 0x0 2016-11-07 08:36:00.993: [GIPCHTHR][2314] gipchaWorkerCreateInterface: created local interface for node 'dbnode2', haName '0eea-3e6e-b116-fc24', inf 'udp://192.168.60.16:60258' 2016-11-07 08:36:00.993: [ OCRMSG][2057]prom_listen: Port str [4708-4db2-78cb-53a4] 2016-11-07 08:36:00.993: [ OCRSRV][2057]proath_listen: listening to remote requests at portstr [4708-4db2-78cb-53a4] 2016-11-07 08:36:01.049: [GIPCXCPT][3085] gipchaInternalResolve: failed to resolve ret gipcretKeyNotFound (36), host 'dbnode2', port '4315-1ba3-1fd6-f2fe', hctx 110eceb10 [0000000000000010] { gipchaContext : host 'dbnode2', name '0eea- 3e6e-b116-fc24', luid '439aba68-00000000', numNode 0, numInf 1, usrFlags 0x0, flags 0x5 }, ret gipcretKeyNotFound (36) 2016-11-07 08:36:01.049: [GIPCHGEN][3085] gipchaResolveF [gipcmodGipcResolve : gipcmodGipc.c : 804]: EXCEPTION[ ret gipcretKeyNotFound (36) ] failed to resolve ctx 110eceb10 [0000000000000010] { gipchaContext : host 'dbnode2', name ' 0eea-3e6e-b116-fc24', luid '439aba68-00000000', numNode 0, numInf 1, usrFlags 0x0, flags 0x5 }, host 'dbnode2', port '4315-1ba3-1fd6-f2fe', flags 0x0 2016-11-07 08:36:01.049: [ OCRMSG][3085]prom_listen: Port str [4315-1ba3-1fd6-f2fe] 2016-11-07 08:36:01.049: [ OCRSRV][3085]th_invalidate_cache: listening to cache_invalidation requests at portstr [4315-1ba3-1fd6-f2fe] [ OCRMAS][3342]th_calc_av:5': Rturn persisted AV [186647296] [11.2.0.3.0] 2016-11-07 08:36:01.051: [ OCRMAS][3342]prom_dump_pubdata: pubdatactx->prom_pubdata_ocrid=1307813993 2016-11-07 08:36:01.051: [ OCRMAS][3342]prom_dump_pubdata: begin dumping pubdatactx->prom_pubdata_prom_con 2016-11-07 08:36:01.051: [ OCRMAS][3342]prom_dump_con: promcon->cache_invalidation_port=0 2016-11-07 08:36:01.051: [ OCRMAS][3342]prom_dump_con: promcon->remote_listening_port=0 2016-11-07 08:36:01.051: [ OCRMAS][3342]prom_dump_con: promcon->local_listening_port=0 2016-11-07 08:36:01.051: [ OCRMAS][3342]prom_dump_pubdata: end dumping pubdatactx->prom_pubdata_prom_con 2016-11-07 08:36:01.051: [ OCRMAS][3342]prom_dump_pubdata: pubdatactx->prom_pubdata_software_version=186647296 2016-11-07 08:36:01.051: [ OCRMAS][3342]prom_dump_pubdata: pubdatactx->prom_pubdata_active_version=186647296 2016-11-07 08:36:01.051: [ OCRMAS][3342]prom_dump_pubdata: pubdatactx->prom_pubdata_priv_nodename=gipcha 2016-11-07 08:36:03.711: [ OCRMAS][3342]rcfg_con:1: Member [2] joined. Inc [6]. 2016-11-07 08:36:03.711: [ OCRMAS][3342]proath_master: Master changing. cssctx->master [-1] new master [1] is_new_mastership [1] num_rcfg [1]. 2016-11-07 08:36:03.711: [ OCRSRV][3342]th_not_master_change: Master change callback not registered 2016-11-07 08:36:03.711: [ OCRMAS][3342]prom_dump_pubdata: pubdatactx->prom_pubdata_ocrid=1307813993 2016-11-07 08:36:03.711: [ OCRMAS][3342]prom_dump_pubdata: begin dumping pubdatactx->prom_pubdata_prom_con 2016-11-07 08:36:03.711: [ OCRMAS][3342]prom_dump_con: promcon->cache_invalidation_port=0 2016-11-07 08:36:03.711: [ OCRMAS][3342]prom_dump_con: promcon->remote_listening_port=0 2016-11-07 08:36:03.711: [ OCRMAS][3342]prom_dump_con: promcon->local_listening_port=0 2016-11-07 08:36:03.711: [ OCRMAS][3342]prom_dump_pubdata: end dumping pubdatactx->prom_pubdata_prom_con 2016-11-07 08:36:03.711: [ OCRMAS][3342]prom_dump_pubdata: pubdatactx->prom_pubdata_software_version=186647296 2016-11-07 08:36:03.711: [ OCRMAS][3342]prom_dump_pubdata: pubdatactx->prom_pubdata_active_version=186647296 2016-11-07 08:36:03.711: [ OCRMAS][3342]prom_dump_pubdata: pubdatactx->prom_pubdata_priv_nodename=gipcha 2016-11-07 08:36:03.711: [ OCRMAS][3342]th_connect_master: Using GIPC type to connect 2016-11-07 08:36:03.711: [ OCRMAS][3342]th_connect_master:10: Master host name [dbnode1] 2016-11-07 08:36:03.711: [ OCRMAS][3342]proath_connect_master: Attempting to connect to master at address [dbnode1:0348-389c-de71-f8b0] 2016-11-07 08:36:03.712: [GIPCXCPT][3342] gipchaInternalResolve: failed to resolve ret gipcretKeyNotFound (36), host 'dbnode2', port '7c73-6f83-0cf7-d835', hctx 110eceb10 [0000000000000010] { gipchaContext : host 'dbnode2', name '0eea- 3e6e-b116-fc24', luid '439aba68-00000000', numNode 0, numInf 1, usrFlags 0x0, flags 0x5 }, ret gipcretKeyNotFound (36) 2016-11-07 08:36:03.712: [GIPCHGEN][3342] gipchaResolveF [gipcmodGipcResolve : gipcmodGipc.c : 804]: EXCEPTION[ ret gipcretKeyNotFound (36) ] failed to resolve ctx 110eceb10 [0000000000000010] { gipchaContext : host 'dbnode2', name ' 0eea-3e6e-b116-fc24', luid '439aba68-00000000', numNode 0, numInf 1, usrFlags 0x0, flags 0x5 }, host 'dbnode2', port '7c73-6f83-0cf7-d835', flags 0x0 2016-11-07 08:36:22.151: [ CRSMAIN][515] Policy Engine is not initialized yet! 2016-11-07 08:36:52.158: [ CRSMAIN][515] Policy Engine is not initialized yet! 2016-11-07 08:37:22.158: [ CRSMAIN][515] Policy Engine is not initialized yet! 2016-11-07 08:37:32.748: [GIPCHGEN][2314] gipchaNodeCreate: adding new node 111784eb0 { host 'dbnode1', haName '74a2-a872-a861-8b63', srcLuid 439aba68-737599a2, dstLuid 00000000-00000000 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [0 : 0], createTime 1643464296, sentRegister 0, localMonitor 0, flags 0x0 } 2016-11-07 08:37:32.748: [GIPCHALO][2314] gipchaLowerSend: deffering startup of hdr 111741038 { len 232, seq 0, type gipchaHdrTypeSend (1), lastSeq 0, lastAck 0, minAck 0, flags 0x0, srcLuid 00000000-00000000, dstLuid 00000000-000000 00, msgId 0 }, node 111784eb0 { host 'dbnode1', haName '74a2-a872-a861-8b63', srcLuid 439aba68-737599a2, dstLuid 00000000-00000000 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [0 : 0], createTime 1643464296, sentRegister 0, localMonitor 0, flags 0x0 } 2016-11-07 08:37:32.748: [GIPCHALO][2314] gipchaLowerProcessNode: no valid interfaces found to node for 1643464296 ms, node 111784eb0 { host 'dbnode1', haName '74a2-a872-a861-8b63', srcLuid 439aba68-737599a2, dstLuid 00000000-00000000 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [0 : 0], createTime 1643464296, sentRegister 0, localMonitor 0, flags 0x4 } 2016-11-07 08:37:32.749: [GIPCHGEN][2571] gipchaNodeAddInterface: adding interface information for inf 111755bf0 { host 'dbnode1', haName '74a2-a872-a861-8b63', local 0, ip '192.168.60.15:51892', subnet '192.168.60.0', mask '255.255.2 55.0', mac '', ifname '', numRef 0, numFail 0, idxBoot 0, flags 0x2 } 2016-11-07 08:37:33.748: [GIPCHTHR][2314] gipchaWorkerCreateInterface: created remote interface for node 'dbnode1', haName '74a2-a872-a861-8b63', inf 'udp://192.168.60.15:51892' 2016-11-07 08:37:33.748: [GIPCHGEN][2314] gipchaWorkerAttachInterface: Interface attached inf 111755bf0 { host 'dbnode1', haName '74a2-a872-a861-8b63', local 111763410, ip '192.168.60.15:51892', subnet '192.168.60.0', mask '255.255.25 5.0', mac '', ifname '', numRef 0, numFail 0, idxBoot 0, flags 0x6 } 2016-11-07 08:37:52.168: [ CRSMAIN][515] Policy Engine is not initialized yet! 2016-11-07 08:38:02.748: [GIPCXCPT][3342] gipchaInternalResolve: failed to resolve ret gipcretKeyNotFound (36), host 'dbnode2', port '7f22-cf7b-cf3a-c3a0', hctx 110eceb10 [0000000000000010] { gipchaContext : host 'dbnode2', name '0eea- 3e6e-b116-fc24', luid '439aba68-00000000', numNode 1, numInf 1, usrFlags 0x0, flags 0x5 }, ret gipcretKeyNotFound (36) 2016-11-07 08:38:02.748: [GIPCHGEN][3342] gipchaResolveF [gipcmodGipcResolve : gipcmodGipc.c : 804]: EXCEPTION[ ret gipcretKeyNotFound (36) ] failed to resolve ctx 110eceb10 [0000000000000010] { gipchaContext : host 'dbnode2', name ' 0eea-3e6e-b116-fc24', luid '439aba68-00000000', numNode 1, numInf 1, usrFlags 0x0, flags 0x5 }, host 'dbnode2', port '7f22-cf7b-cf3a-c3a0', flags 0x0 2016-11-07 08:38:03.761: [GIPCHALO][2314] gipchaLowerSend: deffering startup of hdr 111748578 { len 232, seq 0, type gipchaHdrTypeSend (1), lastSeq 0, lastAck 0, minAck 0, flags 0x0, srcLuid 00000000-00000000, dstLuid 00000000-000000 00, msgId 0 }, node 111784eb0 { host 'dbnode1', haName '74a2-a872-a861-8b63', srcLuid 439aba68-737599a2, dstLuid 00000000-00000000 numInf 1, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [30 : 30], createTime 1643464296, sentRegiste r 1, localMonitor 0, flags 0x4 } 2016-11-07 08:38:22.168: [ CRSMAIN][515] Policy Engine is not initialized yet! 2016-11-07 08:38:33.761: [ OCRMAS][3342]proath_connect_master:1: could not yet connect to master retval1 = 203, retval2 = 203 2016-11-07 08:38:33.761: [ OCRMAS][3342]th_master:110: Could not yet connect to new master [1] 2016-11-07 08:38:33.961: [ OCRMAS][3342]proath_master:100b: Polling, connect to master not complete retval1 = 203, retval2 = 203 2016-11-07 08:38:34.161: [ OCRMAS][3342]proath_master:100b: Polling, connect to master not complete retval1 = 203, retval2 = 203 2016-11-07 08:38:34.361: [ OCRMAS][3342]proath_master:100b: Polling, connect to master not complete retval1 = 203, retval2 = 203 2016-11-07 08:38:34.562: [ OCRMAS][3342]proath_master:100b: Polling, connect to master not complete retval1 = 203, retval2 = 203 2016-11-07 08:38:34.762: [ OCRMAS][3342]proath_master:100b: Polling, connect to master not complete retval1 = 203, retval2 = 203 2016-11-07 08:38:34.962: [ OCRMAS][3342]proath_master:100b: Polling, connect to master not complete retval1 = 203, retval2 = 203 2016-11-07 08:38:35.162: [ OCRMAS][3342]proath_master:100b: Polling, connect to master not complete retval1 = 203, retval2 = 203
Evmd log
2016-11-07 08:35:52.156: [ OCRMSG][1]prom_waitconnect: CONN NOT ESTABLISHED (0,29,1,2) 2016-11-07 08:35:52.156: [ OCRMSG][1]GIPC error [29] msg [gipcretConnectionRefused] 2016-11-07 08:35:52.156: [ OCRMSG][1]prom_connect: error while waiting for connection complete [24] 2016-11-07 08:35:52.156: [ CRSOCR][1] OCR context init failure. Error: PROC-32: Cluster Ready Services on the local node is not running Messaging error [gipcretConnectionRefused] [29] 2016-11-07 08:35:53.156: [ EVMD][1] Waiting for OCR. Seconds elapsed: 0 2016-11-07 08:35:53.157: [ OCRMSG][1]prom_waitconnect: CONN NOT ESTABLISHED (0,29,1,2) 2016-11-07 08:35:53.157: [ OCRMSG][1]GIPC error [29] msg [gipcretConnectionRefused] 2016-11-07 08:35:53.157: [ OCRMSG][1]prom_connect: error while waiting for connection complete [24] 2016-11-07 08:35:53.157: [ CRSOCR][1] OCR context init failure. Error: PROC-32: Cluster Ready Services on the local node is not running Messaging error [gipcretConnectionRefused] [29] 2016-11-07 08:35:54.158: [ OCRMSG][1]prom_waitconnect: CONN NOT ESTABLISHED (0,29,1,2) 2016-11-07 08:35:54.158: [ OCRMSG][1]GIPC error [29] msg [gipcretConnectionRefused] 2016-11-07 08:35:54.158: [ OCRMSG][1]prom_connect: error while waiting for connection complete [24] 2016-11-07 08:35:54.158: [ CRSOCR][1] OCR context init failure. Error: PROC-32: Cluster Ready Services on the local node is not running Messaging error [gipcretConnectionRefused] [29] 2016-11-07 08:35:55.158: [ OCRMSG][1]prom_waitconnect: CONN NOT ESTABLISHED (0,29,1,2) 2016-11-07 08:35:55.158: [ OCRMSG][1]GIPC error [29] msg [gipcretConnectionRefused] 2016-11-07 08:35:55.158: [ OCRMSG][1]prom_connect: error while waiting for connection complete [24] 2016-11-07 08:35:55.158: [ CRSOCR][1] OCR context init failure. Error: PROC-32: Cluster Ready Services on the local node is not running Messaging error [gipcretConnectionRefused] [29]
# /log//ohasd/ohasd.log 中不存在 ==>> partial header received messages
grid@dbnode2:/oracle/app/11.2.0.3/grid/log/dbnode2/crsd> crsctl check crs CRS-4638: Oracle High Availability Services is online CRS-4535: Cannot communicate with Cluster Ready Services CRS-4529: Cluster Synchronization Services is online CRS-4534: Cannot communicate with Event Manager grid@dbnode2:/oracle/app/11.2.0.3/grid/log/dbnode2/crsd> crsctl stat res -t -init -------------------------------------------------------------------------------- NAME TARGET STATE SERVER STATE_DETAILS -------------------------------------------------------------------------------- Cluster Resources -------------------------------------------------------------------------------- ora.asm 1 OFFLINE OFFLINE Instance Shutdown ora.cluster_interconnect.haip 1 ONLINE ONLINE dbnode2 ora.crf 1 OFFLINE OFFLINE ora.crsd 1 ONLINE INTERMEDIATE dbnode2 ora.cssd 1 ONLINE ONLINE dbnode2 ora.cssdmonitor 1 ONLINE ONLINE dbnode2 ora.ctssd 1 ONLINE ONLINE dbnode2 OBSERVER ora.diskmon 1 OFFLINE OFFLINE ora.drivers.acfs 1 ONLINE OFFLINE ora.evmd 1 ONLINE INTERMEDIATE dbnode2 ora.gipcd 1 ONLINE ONLINE dbnode2 ora.gpnpd 1 ONLINE ONLINE dbnode2 ora.mdnsd 1 ONLINE ONLINE dbnode2 root@dbnode2[/]#ps -ef|grep d.bin root 3866650 1 0 08:34:59 - 0:00 /oracle/app/11.2.0.3/grid/bin/cssdagent root 25886878 1 0 08:35:51 - 0:00 /oracle/app/11.2.0.3/grid/bin/crsd.bin reboot grid 2687378 14550800 0 08:34:59 - 0:00 /oracle/app/11.2.0.3/grid/bin/ocssd.bin grid 25100712 1 1 08:34:54 - 0:00 /oracle/app/11.2.0.3/grid/bin/oraagent.bin root 23331612 1 0 08:35:44 - 0:00 /oracle/app/11.2.0.3/grid/bin/orarootagent.bin root 23790588 1 0 08:34:58 - 0:00 /oracle/app/11.2.0.3/grid/bin/cssdmonitor grid 1312134 1 0 Sep 21 - 6291:10 /oracle/app/11.2.0.3/grid/bin/tnslsnr LISTENER -inherit grid 2819350 1 0 08:34:54 - 0:00 /oracle/app/11.2.0.3/grid/bin/mdnsd.bin grid 13436568 1 0 08:34:55 - 0:00 /oracle/app/11.2.0.3/grid/bin/gpnpd.bin root 23463570 25297272 0 08:46:47 pts/10 0:00 grep d.bin root 23529150 1 0 08:35:45 - 0:00 /oracle/app/11.2.0.3/grid/bin/octssd.bin reboot grid 3016476 1 0 08:34:57 - 0:00 /oracle/app/11.2.0.3/grid/bin/gipcd.bin root 14550800 1 0 08:34:59 - 0:00 /bin/sh /oracle/app/11.2.0.3/grid/bin/ocssd grid 16582588 1 0 08:35:51 - 0:00 /oracle/app/11.2.0.3/grid/bin/evmd.bin root 24053592 1 0 08:34:47 - 0:01 /oracle/app/11.2.0.3/grid/bin/ohasd.bin reboot
检查顺序
1, CPU,MEMORY 主机资源正常
2, df 文件目录使用率正常
3, 两节点互Ping private IP 正常
4, IP,mark 地址正常,且未调防火墙及网络环境
5, 检查了数据库/GI sqlnet.ora文件未配置tcp.validnode_checking
6, traceroute private IP 异常
traceroute to 192.168.60.15 (192.168.60.15) from 192.168.60.16 (192.168.60.16), 30 hops max outgoing MTU = 1500 1 * * * 2 * * * 3 dbnode1-priv (192.168.60.15) 0 ms * *
通常排查网络问题常用ping和traceroute,但是对于这个案例是ping正常,traceroute 会出现超时, 推荐Understanding the Ping and Traceroute Commands , 解决方法切换了私网的网卡,traceroute结果正常数据,CRS启动恢复正常。
对不起,这篇文章暂时关闭评论。