首页 » Cloud, ORACLE 9i-23ai » 19c(19.4) RAC crash CSSD with ASSERT clsssc.c error On LinuxONE

19c(19.4) RAC crash CSSD with ASSERT clsssc.c error On LinuxONE

这是一套Oracle 19.4 的RAC 环境,硬件是IBM 大机LinuxONE(Zlinux), 节点出几次重启目前已确认是由于oracle代码级bug,cssd错误的Reference count 记录为0,导致再次ASSERT资源时异常终止。前段时间有朋友说有一家著名的第三方公司在到处宣传19.4是目前ORACLE 19C里较稳定的版本,如当前的19.8测出了2个bug, 我在想哪个版本还没几百个bug. 如oracle所说他们在19.10修复了大量的内存级错误,当前还是建议安装最新的RU(19.11). 这里记录一下这个问题希望能节约一些相同问题的时间。

# db alert log

2021-06-02T19:14:47.503983+08:00
Thread 1 advanced to log sequence 121034 (LGWR switch)
  Current log# 19 seq# 121034 mem# 0: +DATADG/ANBOB/ONLINELOG/group_19.361.1024938473
  Current log# 19 seq# 121034 mem# 1: +DATADG/ANBOB/ONLINELOG/group_19.362.1024938475
2021-06-02T19:14:48.105272+08:00
ARC0 (PID:11845): Archived Log entry 156893 added for T-1.S-121033 ID 0x1ef6d43e LAD:1
2021-06-02T19:41:41.676208+08:00
Thread 1 advanced to log sequence 121035 (LGWR switch)
  Current log# 15 seq# 121035 mem# 0: +DATADG/ANBOB/ONLINELOG/group_15.353.1024938399
  Current log# 15 seq# 121035 mem# 1: +DATADG/ANBOB/ONLINELOG/group_15.354.1024938401
2021-06-02T19:41:42.260992+08:00
ARC1 (PID:11855): Archived Log entry 156894 added for T-1.S-121034 ID 0x1ef6d43e LAD:1
2021-06-02T20:14:41.815924+08:00
Thread 1 advanced to log sequence 121036 (LGWR switch)
  Current log# 16 seq# 121036 mem# 0: +DATADG/ANBOB/ONLINELOG/group_16.355.1024938469
  Current log# 16 seq# 121036 mem# 1: +DATADG/ANBOB/ONLINELOG/group_16.356.1024938469
2021-06-02T20:14:42.395033+08:00
ARC2 (PID:11857): Archived Log entry 156896 added for T-1.S-121035 ID 0x1ef6d43e LAD:1
2021-06-02T20:30:28.723688+08:00
PMON (ospid: ): terminating the instance due to ORA error 
2021-06-02T20:30:28.730845+08:00
Cause - 'Instance is being terminated due to fatal process death (pid: 47, ospid: 11500, FENC)'
2021-06-02T20:30:35.244667+08:00
ORA-1092 : opitsk aborting process
2021-06-02T20:30:35.408915+08:00
Termination issued to instance processes. Waiting for the processes to exit, wait time 5 sec
2021-06-02T20:30:39.473503+08:00
Instance terminated by PMON, pid = 10617
2021-06-02T20:33:48.594207+08:00
Starting ORACLE instance (normal) (OS id: 10286)



# lmhb trace log
*** 2021-06-02T20:30:21.082828+08:00
Hang Manager: Health: System:90, Cluster:0. Hangs: Local:0, Global:0
Hang Manager: Current no of hangs 1, no of impacted sessions - hangs:0 deadlocks:0

*** 2021-06-02T20:30:23.032771+08:00
Hang Manager: Health: System:91, Cluster:0. Hangs: Local:0, Global:0
Hang Manager: Current no of hangs 1, no of impacted sessions - hangs:0 deadlocks:0

*** 2021-06-02T20:30:25.132866+08:00
Hang Manager: Health: System:90, Cluster:0. Hangs: Local:0, Global:0
Hang Manager: Current no of hangs 1, no of impacted sessions - hangs:0 deadlocks:0

*** 2021-06-02T20:30:27.082521+08:00
Hang Manager: Health: System:90, Cluster:0. Hangs: Local:0, Global:0
Hang Manager: Current no of hangs 1, no of impacted sessions - hangs:0 deadlocks:0

# crsd log
2021-06-02 20:30:13.484 :GIPCHTHR:2470443280:  gipchaWorkerWork: workerThread heart beat, time interval since last heartBeat 30047 loopCount 60 sendCount 12 recvCount 36 postCount 12 sendCmplCount 12 recvCmplCount 12
2021-06-02 20:30:25.645 :UiServer:1292880144: [     INFO] {1:24054:17982} Container [ Name: FENCESERVER
	API_HDR_VER: 
	TextMessage[3]
	CLIENT: 
	TextMessage[]
	CLIENT_NAME: 
	TextMessage[ocssd.bin]
	CLIENT_PID: 
	TextMessage[7987]
	CLIENT_PRIMARY_GROUP: 
	TextMessage[asmadmin]
	LOCALE: 
	TextMessage[AMERICAN_AMERICA.AL32UTF8]
]
2021-06-02 20:30:25.645 :UiServer:1292880144: [     INFO] {1:24054:17982} Sending message to AGFW. ctx= 0x3fec8050240, Client PID: 7987
2021-06-02 20:30:25.645 :  OCRAPI:1292880144: procr_beg_asmshut: OCR ctx set to donotterminate state. Return [0].
2021-06-02 20:30:25.645 :UiServer:1292880144: [     INFO] {1:24054:17982} Force-disconnecting [21]  existing PE clients...
2021-06-02 20:30:25.645 :UiServer:1292880144: [     INFO] {1:24054:17982} Disconnecting client of command id :27
2021-06-02 20:30:25.647 :UiServer:1292880144: [     INFO] {1:24054:17982} Disconnecting client of command id :42
2021-06-02 20:30:25.647 :UiServer:1292880144: [     INFO] {1:24054:17982} Disconnecting client of command id :45


# ocssd trace log
2021-06-02 20:30:20.982 :    CSSD:1599600912: [     INFO] clssgmcpDataUpdtCmpl: Status 0 mbr data updt memberID 17:2:1 from clientID 1:46:2
2021-06-02 20:30:22.286 :    CSSD:397408528: [     INFO] clssnmSendingThread: sending status msg to all nodes
2021-06-02 20:30:22.286 :    CSSD:397408528: [     INFO] clssnmSendingThread: sent 4 status msgs to all nodes
2021-06-02 20:30:22.603 :    CSSD:1599600912: [     INFO] clssgmcpGroupDataResp: Completed request with sequence number(534314) for clientID 1:47:0
2021-06-02 20:30:22.603 :    CSSD:1599600912: [     INFO] clssgmcpGroupDataResp: sending type 5, size 167, status 0 to clientID 1:47:0
2021-06-02 20:30:23.021 :    CSSD:1602746640: [     INFO]   : Processing member data change type 1, size 4 for group HB+ASM, memberID 17:2:1
2021-06-02 20:30:23.021 :    CSSD:1602746640: [     INFO]   : Sending member data change to GMP for group HB+ASM, memberID 17:2:1
2021-06-02 20:30:23.022 :    CSSD:2621958416: [     INFO] clssgmpcMemberDataUpdt: grockName HB+ASM memberID 17:2:1, datatype 1 datasize 4
2021-06-02 20:30:23.022 :    CSSD:1599600912: [     INFO] clssgmcpDataUpdtCmpl: Status 0 mbr data updt memberID 17:2:1 from clientID 1:46:2
2021-06-02 20:30:23.146 :    CSSD:1605892368: [     INFO] clssgmpcGMCReqWorkerThread: processing msg (0x3ff78043b60) type 2, msg size 76, payload (0x3ff78043b8c) size 32, sequence 3381384, for clientID 1:46:2
2021-06-02 20:30:25.071 :    CSSD:1602746640: [     INFO]   : Processing member data change type 1, size 4 for group HB+ASM, memberID 17:2:1
2021-06-02 20:30:25.071 :    CSSD:1602746640: [     INFO]   : Sending member data change to GMP for group HB+ASM, memberID 17:2:1
2021-06-02 20:30:25.072 :    CSSD:2621958416: [     INFO] clssgmpcMemberDataUpdt: grockName HB+ASM memberID 17:2:1, datatype 1 datasize 4
2021-06-02 20:30:25.072 :    CSSD:1599600912: [     INFO] clssgmcpDataUpdtCmpl: Status 0 mbr data updt memberID 17:2:1 from clientID 1:46:2
2021-06-02 20:30:25.581 :    CSSD:1602746640: [     INFO]   : Processing member data change type 1, size 600 for group GR+DB_ANBOB, memberID 150:2:1
2021-06-02 20:30:25.581 :    CSSD:1602746640: [     INFO]   : Sending member data change to GMP for group GR+DB_ANBOB, memberID 150:2:1
2021-06-02 20:30:25.582 :    CSSD:2621958416: [     INFO] clssgmpcMemberDataUpdt: grockName GR+DB_ANBOB memberID 150:2:1, datatype 1 datasize 600
2021-06-02 20:30:25.582 :    CSSD:1599600912: [     INFO] clssgmSendEventsToMbrs: Group GR+DB_ANBOB, member count 1, event master 0, event type 6, event incarn 49279, event member count 1, pids 11463-14023,  
2021-06-02 20:30:25.582 :    CSSD:1599600912: [     INFO] clssgmcpDataUpdtCmpl: Status 0 mbr data updt memberID 150:2:1 from clientID 1:84:2
2021-06-02 20:30:25.582 :    CSSD:1602746640: [     INFO] clssgmTermMember: Terminating memberID 150:2:1 (0x3ff38075040) in grock GR+DB_ANBOB
2021-06-02 20:30:25.582 :    CSSD:1602746640: ASSERT clsssc.c 8607
2021-06-02 20:30:25.582 :    CSSD:1602746640: clssscRefFree: object(0x3ff40440be0) has 0 reference prior to decrement, object may have been deallocated!
2021-06-02 20:30:25.582 :    CSSD:1602746640: [     INFO] clssnmCheckForNetworkFailure: Entered 
2021-06-02 20:30:25.582 :    CSSD:1602746640: [     INFO] clssnmCheckForNetworkFailure: skipping 0 defined 0 
2021-06-02 20:30:25.582 :    CSSD:1602746640: [     INFO] clssnmCheckForNetworkFailure: expiring 0  evicted 0 evicting node 0 this node 1
2021-06-02 20:30:25.582 :    CSSD:1602746640: [     INFO] clssnmCheckForNetworkFailure: expiring 0  evicted 0 evicting node 0 this node 2
2021-06-02 20:30:25.582 :    CSSD:1602746640: [     INFO] clssnmCheckForNetworkFailure: skipping 3 defined 0 
...
...
2021-06-02 20:30:25.583 :    CSSD:1602746640: [     INFO] clssnmCheckForNetworkFailure: skipping 31 defined 0 
2021-06-02 20:30:25.583 :    CSSD:1602746640: [     INFO] clssscExit: Call to clscal flush successful and clearing the CLSSSCCTX_INIT_CALOG flag so that no further CA logging happens
2021-06-02 20:30:25.583 :    CSSD:1602746640: [     INFO] clssnmRemoveNodeInTerm: node 1, m1anbob1 terminated. Removing from its own member and connected bitmaps
2021-06-02 20:30:25.583 :    CSSD:1602746640: [    ERROR] ###################################
2021-06-02 20:30:25.583 :    CSSD:1602746640: [    ERROR] clssscExit: CSSD aborting from thread GMClientListener
2021-06-02 20:30:25.583 :    CSSD:1602746640: [    ERROR] ###################################
2021-06-02 20:30:25.583 :    CSSD:1602746640: [     INFO] (:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally
2021-06-02 20:30:25.584 :    CSSD:1602746640: [     INFO] ####### Begin Diagnostic Dump #######
2021-06-02 20:30:25.584 :    CSSD:1602746640: [     INFO] ### Begin diagnostic data for the Core layer ###
2021-06-02 20:30:25.584 :    CSSD:1602746640: [     INFO] Initialization successfully completed OK
2021-06-02 20:30:25.584 :    CSSD:1602746640: [     INFO] Initialization of EXADATA fencing successfully completed OK
2021-06-02 20:30:25.584 :    CSSD:1602746640: [     INFO] #### End diagnostic data for the Core layer ####
2021-06-02 20:30:25.584 :    CSSD:1602746640: [     INFO] ### Begin diagnostic data for the GM Client layer ###
2021-06-02 20:30:25.586 :    CSSD:1602746640: Status for clientID 1:27137:1, pid(6840-300924879), GIPC endpt 0x113b7fef, flags 0x0002, refcount 2, aborted at 0, fence is not progress   OK
2021-06-02 20:30:25.586 :    CSSD:1602746640: Status for clientID 1:27137:2, pid(6840-300924879), GIPC endpt 0x113b8005, flags 0x0002, refcount 2, aborted at 0, fence is not progress   OK
2021-06-02 20:30:25.586 :    CSSD:1602746640: Status for clientID 1:28167:1, pid(36775-316594856), GIPC endpt 0x11f1a165, flags 0x0002, refcount 2, aborted at 0, fence is not progress   OK

...
2021-06-02 20:30:25.591 :    CSSD:1602746640: [     INFO] #### End diagnostic data for the NM layer ####
2021-06-02 20:30:25.591 :    CSSD:1602746640: [     INFO] ######## End Diagnostic Dump ########
2021-06-02 20:30:25.591 :    CSSD:1602746640: 

----- Call Stack Trace -----
2021-06-02 20:30:25.591 :    CSSD:1602746640: calling              call     entry                argument values in hex      
2021-06-02 20:30:25.591 :    CSSD:1602746640: location             type     point                (? means dubious value)     
2021-06-02 20:30:25.591 :    CSSD:1602746640: -------------------- -------- -------------------- ----------------------------
2021-06-02 20:30:25.592 :    CSSD:1602746640: ssdgetcall: Failure to recover Stack Trace: starting frame address is (nil)
2021-06-02 20:30:25.593 :    CSSD:1602746640: clssscExit()+1860    call     kgdsdst()            
2021-06-02 20:30:25.593 :    CSSD:1602746640: clssscAssert()+210   call     clssscExit()         
2021-06-02 20:30:25.593 :    CSSD:1602746640: clssscRefFreeInt()+  call     clssscAssert()       
2021-06-02 20:30:25.593 :    CSSD:1602746640: clssgmTermMember()+  call     clssscRefFreeInt()   
2021-06-02 20:30:25.593 :    CSSD:1602746640: clssgmcClientDestroy()+266  call     clssgmTermMember()   
2021-06-02 20:30:25.593 :    CSSD:1602746640: clssscHTDestroyObj()+334  call     clssgmcClientDestroy()  
2021-06-02 20:30:25.593 :    CSSD:1602746640: clssscHTRefDestroyObj()+50  call     clssscHTDestroyObj()  
2021-06-02 20:30:25.593 :    CSSD:1602746640: clssscRefFreeInt()+  call     clssscHTRefDestroyObj()  
2021-06-02 20:30:25.593 :    CSSD:1602746640: clssgmclienteventhndlr()  call     clssscRefFreeInt()   
2021-06-02 20:30:25.593 :    CSSD:1602746640: clssscSelect()+1568  call     clssgmclienteventhndlr()  
2021-06-02 20:30:25.593 :    CSSD:1602746640: clssgmProcClientReqs()+2204  call     clssscSelect()       
2021-06-02 20:30:25.593 :    CSSD:1602746640: clssgmclientlsnr()+  call     clssgmProcClientReqs()  
2021-06-02 20:30:25.593 :    CSSD:1602746640: clssscthrdmain()+26  call     clssgmclientlsnr()   
2021-06-02 20:30:25.593 :    CSSD:1602746640: start_thread()+234   call     clssscthrdmain()     
2021-06-02 20:30:25.593 :    CSSD:1602746640:  
2021-06-02 20:30:25.593 :    CSSD:1602746640: --------------------- Binary Stack Dump ---------------------
2021-06-02 20:30:25.593 :    CSSD:1602746640:  
2021-06-02 20:30:25.593 :    CSSD:1602746640: ========== FRAME [1] (clssscExit()+1860 -> kgdsdst()) ==========
2021-06-02 20:30:25.593 :    CSSD:1602746640: defined by frame pointers 0x3ff5f859978  and 0x3ff5f8324f0
2021-06-02 20:30:25.593 :    CSSD:1602746640: CALL TYPE: call   ERROR SIGNALED: no   CALLER: clssscExit

Note:
cssd进程在释放资源0x3ff40440be0时,发现Reference count 错误的值0,导致Assert时异常终止。函数为clssscRefFreeInt> clssscHTRefDestroyObj>clssscHTDestroyObj>…. clssscRefFreeInt>clssscAssert

解决方法

Bug 31992657. 应用patch.

打赏

对不起,这篇文章暂时关闭评论。