19c(19.4) RAC crash CSSD with ASSERT clsssc.c error On LinuxONE
这是一套Oracle 19.4 的RAC 环境,硬件是IBM 大机LinuxONE(Zlinux), 节点出几次重启目前已确认是由于oracle代码级bug,cssd错误的Reference count 记录为0,导致再次ASSERT资源时异常终止。前段时间有朋友说有一家著名的第三方公司在到处宣传19.4是目前ORACLE 19C里较稳定的版本,如当前的19.8测出了2个bug, 我在想哪个版本还没几百个bug. 如oracle所说他们在19.10修复了大量的内存级错误,当前还是建议安装最新的RU(19.11). 这里记录一下这个问题希望能节约一些相同问题的时间。
# db alert log 2021-06-02T19:14:47.503983+08:00 Thread 1 advanced to log sequence 121034 (LGWR switch) Current log# 19 seq# 121034 mem# 0: +DATADG/ANBOB/ONLINELOG/group_19.361.1024938473 Current log# 19 seq# 121034 mem# 1: +DATADG/ANBOB/ONLINELOG/group_19.362.1024938475 2021-06-02T19:14:48.105272+08:00 ARC0 (PID:11845): Archived Log entry 156893 added for T-1.S-121033 ID 0x1ef6d43e LAD:1 2021-06-02T19:41:41.676208+08:00 Thread 1 advanced to log sequence 121035 (LGWR switch) Current log# 15 seq# 121035 mem# 0: +DATADG/ANBOB/ONLINELOG/group_15.353.1024938399 Current log# 15 seq# 121035 mem# 1: +DATADG/ANBOB/ONLINELOG/group_15.354.1024938401 2021-06-02T19:41:42.260992+08:00 ARC1 (PID:11855): Archived Log entry 156894 added for T-1.S-121034 ID 0x1ef6d43e LAD:1 2021-06-02T20:14:41.815924+08:00 Thread 1 advanced to log sequence 121036 (LGWR switch) Current log# 16 seq# 121036 mem# 0: +DATADG/ANBOB/ONLINELOG/group_16.355.1024938469 Current log# 16 seq# 121036 mem# 1: +DATADG/ANBOB/ONLINELOG/group_16.356.1024938469 2021-06-02T20:14:42.395033+08:00 ARC2 (PID:11857): Archived Log entry 156896 added for T-1.S-121035 ID 0x1ef6d43e LAD:1 2021-06-02T20:30:28.723688+08:00 PMON (ospid: ): terminating the instance due to ORA error 2021-06-02T20:30:28.730845+08:00 Cause - 'Instance is being terminated due to fatal process death (pid: 47, ospid: 11500, FENC)' 2021-06-02T20:30:35.244667+08:00 ORA-1092 : opitsk aborting process 2021-06-02T20:30:35.408915+08:00 Termination issued to instance processes. Waiting for the processes to exit, wait time 5 sec 2021-06-02T20:30:39.473503+08:00 Instance terminated by PMON, pid = 10617 2021-06-02T20:33:48.594207+08:00 Starting ORACLE instance (normal) (OS id: 10286) # lmhb trace log *** 2021-06-02T20:30:21.082828+08:00 Hang Manager: Health: System:90, Cluster:0. Hangs: Local:0, Global:0 Hang Manager: Current no of hangs 1, no of impacted sessions - hangs:0 deadlocks:0 *** 2021-06-02T20:30:23.032771+08:00 Hang Manager: Health: System:91, Cluster:0. Hangs: Local:0, Global:0 Hang Manager: Current no of hangs 1, no of impacted sessions - hangs:0 deadlocks:0 *** 2021-06-02T20:30:25.132866+08:00 Hang Manager: Health: System:90, Cluster:0. Hangs: Local:0, Global:0 Hang Manager: Current no of hangs 1, no of impacted sessions - hangs:0 deadlocks:0 *** 2021-06-02T20:30:27.082521+08:00 Hang Manager: Health: System:90, Cluster:0. Hangs: Local:0, Global:0 Hang Manager: Current no of hangs 1, no of impacted sessions - hangs:0 deadlocks:0 # crsd log 2021-06-02 20:30:13.484 :GIPCHTHR:2470443280: gipchaWorkerWork: workerThread heart beat, time interval since last heartBeat 30047 loopCount 60 sendCount 12 recvCount 36 postCount 12 sendCmplCount 12 recvCmplCount 12 2021-06-02 20:30:25.645 :UiServer:1292880144: [ INFO] {1:24054:17982} Container [ Name: FENCESERVER API_HDR_VER: TextMessage[3] CLIENT: TextMessage[] CLIENT_NAME: TextMessage[ocssd.bin] CLIENT_PID: TextMessage[7987] CLIENT_PRIMARY_GROUP: TextMessage[asmadmin] LOCALE: TextMessage[AMERICAN_AMERICA.AL32UTF8] ] 2021-06-02 20:30:25.645 :UiServer:1292880144: [ INFO] {1:24054:17982} Sending message to AGFW. ctx= 0x3fec8050240, Client PID: 7987 2021-06-02 20:30:25.645 : OCRAPI:1292880144: procr_beg_asmshut: OCR ctx set to donotterminate state. Return [0]. 2021-06-02 20:30:25.645 :UiServer:1292880144: [ INFO] {1:24054:17982} Force-disconnecting [21] existing PE clients... 2021-06-02 20:30:25.645 :UiServer:1292880144: [ INFO] {1:24054:17982} Disconnecting client of command id :27 2021-06-02 20:30:25.647 :UiServer:1292880144: [ INFO] {1:24054:17982} Disconnecting client of command id :42 2021-06-02 20:30:25.647 :UiServer:1292880144: [ INFO] {1:24054:17982} Disconnecting client of command id :45 # ocssd trace log 2021-06-02 20:30:20.982 : CSSD:1599600912: [ INFO] clssgmcpDataUpdtCmpl: Status 0 mbr data updt memberID 17:2:1 from clientID 1:46:2 2021-06-02 20:30:22.286 : CSSD:397408528: [ INFO] clssnmSendingThread: sending status msg to all nodes 2021-06-02 20:30:22.286 : CSSD:397408528: [ INFO] clssnmSendingThread: sent 4 status msgs to all nodes 2021-06-02 20:30:22.603 : CSSD:1599600912: [ INFO] clssgmcpGroupDataResp: Completed request with sequence number(534314) for clientID 1:47:0 2021-06-02 20:30:22.603 : CSSD:1599600912: [ INFO] clssgmcpGroupDataResp: sending type 5, size 167, status 0 to clientID 1:47:0 2021-06-02 20:30:23.021 : CSSD:1602746640: [ INFO] : Processing member data change type 1, size 4 for group HB+ASM, memberID 17:2:1 2021-06-02 20:30:23.021 : CSSD:1602746640: [ INFO] : Sending member data change to GMP for group HB+ASM, memberID 17:2:1 2021-06-02 20:30:23.022 : CSSD:2621958416: [ INFO] clssgmpcMemberDataUpdt: grockName HB+ASM memberID 17:2:1, datatype 1 datasize 4 2021-06-02 20:30:23.022 : CSSD:1599600912: [ INFO] clssgmcpDataUpdtCmpl: Status 0 mbr data updt memberID 17:2:1 from clientID 1:46:2 2021-06-02 20:30:23.146 : CSSD:1605892368: [ INFO] clssgmpcGMCReqWorkerThread: processing msg (0x3ff78043b60) type 2, msg size 76, payload (0x3ff78043b8c) size 32, sequence 3381384, for clientID 1:46:2 2021-06-02 20:30:25.071 : CSSD:1602746640: [ INFO] : Processing member data change type 1, size 4 for group HB+ASM, memberID 17:2:1 2021-06-02 20:30:25.071 : CSSD:1602746640: [ INFO] : Sending member data change to GMP for group HB+ASM, memberID 17:2:1 2021-06-02 20:30:25.072 : CSSD:2621958416: [ INFO] clssgmpcMemberDataUpdt: grockName HB+ASM memberID 17:2:1, datatype 1 datasize 4 2021-06-02 20:30:25.072 : CSSD:1599600912: [ INFO] clssgmcpDataUpdtCmpl: Status 0 mbr data updt memberID 17:2:1 from clientID 1:46:2 2021-06-02 20:30:25.581 : CSSD:1602746640: [ INFO] : Processing member data change type 1, size 600 for group GR+DB_ANBOB, memberID 150:2:1 2021-06-02 20:30:25.581 : CSSD:1602746640: [ INFO] : Sending member data change to GMP for group GR+DB_ANBOB, memberID 150:2:1 2021-06-02 20:30:25.582 : CSSD:2621958416: [ INFO] clssgmpcMemberDataUpdt: grockName GR+DB_ANBOB memberID 150:2:1, datatype 1 datasize 600 2021-06-02 20:30:25.582 : CSSD:1599600912: [ INFO] clssgmSendEventsToMbrs: Group GR+DB_ANBOB, member count 1, event master 0, event type 6, event incarn 49279, event member count 1, pids 11463-14023, 2021-06-02 20:30:25.582 : CSSD:1599600912: [ INFO] clssgmcpDataUpdtCmpl: Status 0 mbr data updt memberID 150:2:1 from clientID 1:84:2 2021-06-02 20:30:25.582 : CSSD:1602746640: [ INFO] clssgmTermMember: Terminating memberID 150:2:1 (0x3ff38075040) in grock GR+DB_ANBOB 2021-06-02 20:30:25.582 : CSSD:1602746640: ASSERT clsssc.c 8607 2021-06-02 20:30:25.582 : CSSD:1602746640: clssscRefFree: object(0x3ff40440be0) has 0 reference prior to decrement, object may have been deallocated! 2021-06-02 20:30:25.582 : CSSD:1602746640: [ INFO] clssnmCheckForNetworkFailure: Entered 2021-06-02 20:30:25.582 : CSSD:1602746640: [ INFO] clssnmCheckForNetworkFailure: skipping 0 defined 0 2021-06-02 20:30:25.582 : CSSD:1602746640: [ INFO] clssnmCheckForNetworkFailure: expiring 0 evicted 0 evicting node 0 this node 1 2021-06-02 20:30:25.582 : CSSD:1602746640: [ INFO] clssnmCheckForNetworkFailure: expiring 0 evicted 0 evicting node 0 this node 2 2021-06-02 20:30:25.582 : CSSD:1602746640: [ INFO] clssnmCheckForNetworkFailure: skipping 3 defined 0 ... ... 2021-06-02 20:30:25.583 : CSSD:1602746640: [ INFO] clssnmCheckForNetworkFailure: skipping 31 defined 0 2021-06-02 20:30:25.583 : CSSD:1602746640: [ INFO] clssscExit: Call to clscal flush successful and clearing the CLSSSCCTX_INIT_CALOG flag so that no further CA logging happens 2021-06-02 20:30:25.583 : CSSD:1602746640: [ INFO] clssnmRemoveNodeInTerm: node 1, m1anbob1 terminated. Removing from its own member and connected bitmaps 2021-06-02 20:30:25.583 : CSSD:1602746640: [ ERROR] ################################### 2021-06-02 20:30:25.583 : CSSD:1602746640: [ ERROR] clssscExit: CSSD aborting from thread GMClientListener 2021-06-02 20:30:25.583 : CSSD:1602746640: [ ERROR] ################################### 2021-06-02 20:30:25.583 : CSSD:1602746640: [ INFO] (:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally 2021-06-02 20:30:25.584 : CSSD:1602746640: [ INFO] ####### Begin Diagnostic Dump ####### 2021-06-02 20:30:25.584 : CSSD:1602746640: [ INFO] ### Begin diagnostic data for the Core layer ### 2021-06-02 20:30:25.584 : CSSD:1602746640: [ INFO] Initialization successfully completed OK 2021-06-02 20:30:25.584 : CSSD:1602746640: [ INFO] Initialization of EXADATA fencing successfully completed OK 2021-06-02 20:30:25.584 : CSSD:1602746640: [ INFO] #### End diagnostic data for the Core layer #### 2021-06-02 20:30:25.584 : CSSD:1602746640: [ INFO] ### Begin diagnostic data for the GM Client layer ### 2021-06-02 20:30:25.586 : CSSD:1602746640: Status for clientID 1:27137:1, pid(6840-300924879), GIPC endpt 0x113b7fef, flags 0x0002, refcount 2, aborted at 0, fence is not progress OK 2021-06-02 20:30:25.586 : CSSD:1602746640: Status for clientID 1:27137:2, pid(6840-300924879), GIPC endpt 0x113b8005, flags 0x0002, refcount 2, aborted at 0, fence is not progress OK 2021-06-02 20:30:25.586 : CSSD:1602746640: Status for clientID 1:28167:1, pid(36775-316594856), GIPC endpt 0x11f1a165, flags 0x0002, refcount 2, aborted at 0, fence is not progress OK ... 2021-06-02 20:30:25.591 : CSSD:1602746640: [ INFO] #### End diagnostic data for the NM layer #### 2021-06-02 20:30:25.591 : CSSD:1602746640: [ INFO] ######## End Diagnostic Dump ######## 2021-06-02 20:30:25.591 : CSSD:1602746640: ----- Call Stack Trace ----- 2021-06-02 20:30:25.591 : CSSD:1602746640: calling call entry argument values in hex 2021-06-02 20:30:25.591 : CSSD:1602746640: location type point (? means dubious value) 2021-06-02 20:30:25.591 : CSSD:1602746640: -------------------- -------- -------------------- ---------------------------- 2021-06-02 20:30:25.592 : CSSD:1602746640: ssdgetcall: Failure to recover Stack Trace: starting frame address is (nil) 2021-06-02 20:30:25.593 : CSSD:1602746640: clssscExit()+1860 call kgdsdst() 2021-06-02 20:30:25.593 : CSSD:1602746640: clssscAssert()+210 call clssscExit() 2021-06-02 20:30:25.593 : CSSD:1602746640: clssscRefFreeInt()+ call clssscAssert() 2021-06-02 20:30:25.593 : CSSD:1602746640: clssgmTermMember()+ call clssscRefFreeInt() 2021-06-02 20:30:25.593 : CSSD:1602746640: clssgmcClientDestroy()+266 call clssgmTermMember() 2021-06-02 20:30:25.593 : CSSD:1602746640: clssscHTDestroyObj()+334 call clssgmcClientDestroy() 2021-06-02 20:30:25.593 : CSSD:1602746640: clssscHTRefDestroyObj()+50 call clssscHTDestroyObj() 2021-06-02 20:30:25.593 : CSSD:1602746640: clssscRefFreeInt()+ call clssscHTRefDestroyObj() 2021-06-02 20:30:25.593 : CSSD:1602746640: clssgmclienteventhndlr() call clssscRefFreeInt() 2021-06-02 20:30:25.593 : CSSD:1602746640: clssscSelect()+1568 call clssgmclienteventhndlr() 2021-06-02 20:30:25.593 : CSSD:1602746640: clssgmProcClientReqs()+2204 call clssscSelect() 2021-06-02 20:30:25.593 : CSSD:1602746640: clssgmclientlsnr()+ call clssgmProcClientReqs() 2021-06-02 20:30:25.593 : CSSD:1602746640: clssscthrdmain()+26 call clssgmclientlsnr() 2021-06-02 20:30:25.593 : CSSD:1602746640: start_thread()+234 call clssscthrdmain() 2021-06-02 20:30:25.593 : CSSD:1602746640: 2021-06-02 20:30:25.593 : CSSD:1602746640: --------------------- Binary Stack Dump --------------------- 2021-06-02 20:30:25.593 : CSSD:1602746640: 2021-06-02 20:30:25.593 : CSSD:1602746640: ========== FRAME [1] (clssscExit()+1860 -> kgdsdst()) ========== 2021-06-02 20:30:25.593 : CSSD:1602746640: defined by frame pointers 0x3ff5f859978 and 0x3ff5f8324f0 2021-06-02 20:30:25.593 : CSSD:1602746640: CALL TYPE: call ERROR SIGNALED: no CALLER: clssscExit
Note:
cssd进程在释放资源0x3ff40440be0时,发现Reference count 错误的值0,导致Assert时异常终止。函数为clssscRefFreeInt> clssscHTRefDestroyObj>clssscHTDestroyObj>…. clssscRefFreeInt>clssscAssert
解决方法
Bug 31992657. 应用patch.
对不起,这篇文章暂时关闭评论。