Troubleshooting Oracle 19c RAC a PDB open failed to start with terminating the instance due to ORA error 481
最近一个比较新鲜的案例,环境ORACLE 2-nodes RAC,有3个PDB 多租户架构,在节点2在仅做了某1个PDB级的PGA大小参数后,实例2 crash,并且,重启node2 db instance后,逐个open PDB, 仅当open 此PDB时,实例2会再次crash, 并提示错误:
2025-04-15T12:33:53.433625+08:00 Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_lmon_50145.trc: ORA-29740: evicted by instance number 2, group incarnation 121 Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_lmon_50145.trc (incident=1398624) (PDBNAME=CDB$ROOT): ORA-29740 [] [] [] [] [] [] [] [] [] [] [] [] Incident details in: /u01/app/oracle/diag/rdbms/anbob/anbob1/incident/incdir_1398624/anbob1_lmon_50145_i1398624.trc 2025-04-15T12:33:53.578776+08:00 USER (ospid: 80938): terminating the instance due to ORA error 481
操作
# grep -i "alter system " alert* alert_anbob1.log:QTP_ZZ(3):ALTER SYSTEM SET pga_aggregate_limit=150G SCOPE=SPFILE PDB='QTP_ZZ'; alert_anbob1.log:QTP_ZZ(3):ALTER SYSTEM SET pga_aggregate_target=50G SCOPE=SPFILE PDB='QTP_ZZ;
DB ALERT log
Completed: alter pluggable database QTP_ZZ close immediate 2025-04-15T00:58:16.171882+08:00 alter pluggable database QTP_ZZ open 2025-04-15T00:58:16.174395+08:00 QTP_ZZ(3):Pluggable database QTP_ZZ opening in read write QTP_ZZ(3):SUPLOG: Initialize PDB SUPLOG SGA, old value 0x0, new value 0x18 QTP_ZZ(3):Autotune of undo retention is turned on. QTP_ZZ(3):queued attach DA request 0xa849e3f58 for pdb 3, ospid 41223 2025-04-15T00:58:16.425005+08:00 Increasing priority of 32 RS Domain Action Reconfiguration started (domid 3, new da inc 4, cluster inc 4) Instance 1 is attaching to domain 3 Global Resource Directory partially frozen for domain action Non-local Process blocks cleaned out Set master node info Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted 2025-04-15T00:59:01.857616+08:00 CLMN: clean deferred state objects - failed 2025-04-15T01:03:51.414380+08:00 minact-scn: got error during useg scan e:12751 usn:11 minact-scn: useg scan erroring out with error e:12751 2025-04-15T01:04:24.314328+08:00 Decreasing priority of 32 RS 2025-04-15T01:05:12.397123+08:00 Detected an inconsistent instance membership by instance 2 Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_lmon_24376.trc (incident=782577) (PDBNAME=CDB$ROOT): ORA-29740: evicted by instance number 2, group incarnation 6 Incident details in: /u01/app/oracle/diag/rdbms/anbob/anbob1/incident/incdir_782577/anbob1_lmon_24376_i782577.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. 2025-04-15T01:05:13.189357+08:00 Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_lmon_24376.trc: ORA-29740: evicted by instance number 2, group incarnation 6 Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_lmon_24376.trc (incident=782578) (PDBNAME=CDB$ROOT): ORA-29740 [] [] [] [] [] [] [] [] [] [] [] [] Incident details in: /u01/app/oracle/diag/rdbms/anbob/anbob1/incident/incdir_782578/anbob1_lmon_24376_i782578.trc 2025-04-15T01:05:13.269464+08:00 USER (ospid: 57215): terminating the instance due to ORA error 481 2025-04-15T01:05:13.269822+08:00 Cause - 'Instance is being terminated due to dlm error ' 2025-04-15T01:05:19.951483+08:00 ORA-1092 : opitsk aborting process 2025-04-15T01:05:21.288270+08:00 License high water mark = 1222 2025-04-15T01:05:28.316991+08:00 Termination issued to instance processes. Waiting for the processes to exit, wait time 5 sec 2025-04-15T01:05:29.317677+08:00 Instance terminated by USER, pid = 57215 2025-04-15T01:05:29.759456+08:00 Warning: 2 processes are still attacheded to shmid 25: (size: 118784 bytes, creator pid: 23958, last attach/detach pid: 24483)
PDB PARAMETER
sys@anbob2(762)> select * from v$pdbs; CON_ID DBID CON_UID GUID NAME OPEN_MODE RES OPEN_TIME CREATE_SCN TOTAL_SIZE BLOCK_SIZE RECOVERY SNAPSHOT_PARENT_CON_ID APP APP APP APPLICATION_ROOT_CON_ID APP PRO LOCAL_UNDO UNDO_SCN UNDO_TIMESTAMP CREATION_TIME DIAGNOSTICS_SIZE PDB_COUNT AUDIT_FILES_SIZE MAX_SIZE MAX_DIAGNOSTICS_SIZE MAX_AUDIT_SIZE LAST_CHANGE TEM TENANT_ID UPGRADE_LEVEL GUID_BASE64 ---------- ---------- ---------- -------------------------------- -------------------------------------------------- ---------- --- --------------------------------------------------------------------------- ---------- ---------- ---------- -------- ---------------------- --- --- --- ----------------------- --- --- ---------- ---------- ------------------- ------------------- ---------------- ---------- ---------------- ---------- -------------------- -------------- ----------- --- ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ------------- ------------------------------ 2 2285472469 2285472469 DA2EC12AE5F8B1D5E0532303D80A6F97 PDB$SEED READ ONLY NO 18-JUN-22 11.13.48.006 AM +08:00 317 8.6282E+10 8192 ENABLED NO NO NO NO NO 1 0 2022-03-14 21:36:30 0 0 0 0 0 0 COMMON USER NO 1 2i7BKuX4sdXgUyMD2ApvlwA= 3 3808343839 3808343839 DAF174EB35C75284E0532403D80AC894 QTP_ZZ READ WRITE NO 18-JUN-22 11.32.19.766 AM +08:00 3221042 1.0267E+13 8192 ENABLED NO NO NO NO NO 1 317 2022-03-24 13:52:52 0 0 0 0 0 0 COMMON USER NO 1 2vF06zXHUoTgUyQD2ArIlAA= 4 2224247423 2224247423 DE3F4BF37670A8D8E0532303D80A599D QTP_XX READ WRITE NO 18-JUN-22 11.32.19.767 AM +08:00 12418374 2.1310E+11 8192 ENABLED NO NO NO NO NO 1 317 2022-05-05 15:00:27 0 0 0 0 0 0 COMMON USER NO 1 3j9L83ZwqNjgUyMD2ApZnQA= 5 582974699 582974699 DF2DD911E6D74D1FE0532303D80A8284 QTP_YY READ WRITE NO 18-JUN-22 11.32.19.767 AM +08:00 22662829 3.2740E+11 8192 ENABLED NO NO NO NO NO 1 317 2022-05-17 11:36:36 0 0 0 0 0 0 COMMON USER NO 1 3y3ZEebXTR/gUyMD2AqChAA= sys@anbob2(258)>select * from pdb_spfile$ DB_UNIQ_NAME PDB_UID SID NAME VALUE$ COMMENT$ SPARE1 SPARE2 SPARE3 ------------------------------ --------------- -------------------- ---------------------------------------- -------------------------------------------------- ---------------------------------------- --------------- --------------- -------------------------------------------------------------------------------------------------------------------------------- * 2285472469 * db_securefile 'PREFERRED' 0 0 * 3808343839 * db_securefile 'PREFERRED' 0 0 * 3808343839 * sga_target 322122547200 322122547200 0 * 3808343839 * sga_min_size 64424509440 64424509440 0 * 3808343839 * pga_aggregate_limit 107374182400 161061273600 0 * 3808343839 * pga_aggregate_target 53687091200 69793218560 0 * 3808343839 * open_cursors 20000 20000 0 * 2224247423 * db_securefile 'PREFERRED' 0 0 * 582974699 * db_securefile 'PREFERRED' 0 0 * 2224247423 * sga_target 214748364800 214748364800 0 * 2224247423 * sga_min_size 64424509440 64424509440 0 * 2224247423 * pga_aggregate_limit 107374182400 107374182400 0 * 2224247423 * pga_aggregate_target 53687091200 53687091200 0 * 2224247423 * open_cursors 200 200 0 14 rows selected.
Note:
pdb spfile参数并不在db spfile,而在是DB内的pdb_spfile$表.
incident 日志
adrci> show trace /u01/app/oracle/diag/rdbms/anbob/anbob1/incident/incdir_782577/anbob1_lmon_24376_i782577.trc Output the results to file: /tmp/utsout_185122_1404_4.ado 000000000 ? 000000082 ? ksedmp()+577 call dbkedDefDump() 000000003 000000002 7FFF11575E50 ? 7FFF11575F68 ? 000000000 ? 000000082 ? dbgexPhaseII()+2092 call ksedmp() 0000003EB 000000002 ? 7FFF11575E50 ? 7FFF11575F68 ? 000000000 ? 000000082 ? dbgexProcessError() call dbgexPhaseII() 7FED77D156D8 7FED77CD92A0 +1871 7FFF1157D500 7FFF11575F68 ? 000000000 ? 000000082 ? dbgePostErrorKGE()+ call dbgexProcessError() 7FED77D156D8 7FED77CD92A0 1853 000000001 000000000 000000000 ? 000000082 ? dbkePostKGE_kgsf()+ call dbgePostErrorKGE() 7FED77D559C0 7FED71774380 71 00000742C 000000000 ? 000000000 ? 000000082 ? kgeade()+392 call dbkePostKGE_kgsf() 7FED77D559C0 7FED71774380 00000742C 000000000 ? 000000000 ? 000000082 ? kgeselv()+89 call kgeade() 7FED77D559C0 ? 7FED77D55C08 ? 7FED71774380 ? 00000742C ? 000000000 000000000 ksesec2()+205 call kgeselv() 7FED77D559C0 ? 7FED71774380 ? 00000742C ? 012FC20E0 012FC20E8 000000002 kjxgrdtrt()+1317 call ksesec2() 7FED77D559C0 ? 000000000 000000002 000000000 000000006 0FFFFFFFF kjxgrDiskVote_Valid call kjxgrdtrt() 7FED723BC6F8 000000001 ateMembership()+187 000000006 000000000 ? 4 000000006 ? 0FFFFFFFF ? kjxgrDiskVote_Execu call kjxgrDiskVote_Valid 000000008 000000007 te()+86 ateMembership() 7FED723BC6F8 000000000 ? 000000006 ? 0FFFFFFFF ? kjxgrrcfgchk()+7788 call kjxgrDiskVote_Execu 7FED723BC6F8 000000007 ? te() 7FED723BC6F8 ? 000000000 ? 000000006 ? 0FFFFFFFF ? kjxggpoll()+171 call kjxgrrcfgchk() 7FED723BC6F8 000000000 7FED723BC6F8 ? 000000000 ? 000000006 ? 0FFFFFFFF ? kjfmact()+104 call kjxggpoll() 0722BC200 000000000 7FED723BC6F8 ? 000000000 ? 000000006 ? 0FFFFFFFF ? kjfcln()+4310 call kjfmact() 7FED722BC200 A65F575A0 000000000 000000000 ? 000000006 ? 0FFFFFFFF ? ksbrdp()+1167 call kjfcln() 0600124D0 A65F575A0 ? 000000000 ? 000000000 ? 000000006 ? 0FFFFFFFF ? opirip()+541 call ksbrdp() 0600124D0 ? A65F575A0 ? 000000000 ? 000000000 ? 000000006 ? 0FFFFFFFF ? opidrv()+581 call opirip() 000000032 000000004 7FFF11614E18 000000000 ? 000000006 ? 0FFFFFFFF ? sou2o()+165 call opidrv() 000000032 000000004
lmon trace
LMD0 group 0 GES resources 111744 pool 38 LMD1 group 0 GES resources 111744 pool 38 LMD2 group 0 GES resources 111744 pool 38 LMD3 group 0 GES resources 111744 pool 38 LMD4 group 0 GES resources 111744 pool 38 GES enqueues 169250 GCS latches 4096 GES IPC: Receivers 37 Senders 37 GES IPC: Buffers Receive 1000 Send (i:0 b:0) Reserve 0 GES IPC: Msg Size Regular 512 Batch 8192 Batching factor: enqueue replay 201, ack 223 Batching factor: cache replay 91 size per lock 88 Read-write Instance? 1, Designated Master? 1, BOC? 1, Broadcast SCN mode: 1 CSS cluster type is UNKNOWN (1) *** 2025-04-15T12:14:13.128153+08:00 (CDB$ROOT(1)) kjxggin: CGS tickets = 1000 kjxgmin: set instance reconnect max time to 40 secs kjxgmin: local IPv4 169.254.8.240 (UDP) kjxgrdmpcpu: CPU Total (raw:192 eff:192) Core 96 Socket 4 OCPU 192 kjxgrdmpcpu: High load threshold 245760 CGS/IMR TIMEOUTS: CSS recovery timeout = 31 sec (Total CSS waittime = 65) IMR Reconfig timeout = 75 sec CGS rcfg timeout = 85 sec kjxgmjoin: rimlost event instmap: *** 2025-04-15T12:14:13.222430+08:00 (CDB$ROOT(1)) kjxgmrcfg: Reconfiguration started, type 1 CGS/IMR TIMEOUTS: CSS recovery timeout = 31 sec (Total CSS waittime = 65) IMR Reconfig timeout = 75 sec CGS rcfg timeout = 85 sec kjxgmcs: Setting state to 0 0. 2025-04-15 12:14:13.742 : * Begin lmon rcfg step KJGA_RCFG_BEGIN (kjidomena 0, rcfginfo x0) * local undo 0 (0.0.1.0), kjitxtsn 1, kju_tx_tsn_affinity 1 * kjga: st x0, flg x2000, stp 1.0(0).0, rmno 0 * kjdrmst: domid 0, requester 32767, pt 0, hv 0, rm 0, rcfg int 0, undo 0, datype x0, sizefltr stg 0, intda chkinc 0, intda setinc 0 * adg_enabled? 1 domain 0 valid? 0 * RORA mode = FALSE * ----- RORA state at the beginning of rcfg ------ * adg_enabled 1, roram 32767, last roram 32767, rora_requester 32767 rora_invalid 0, rora_expand 0 adg_roram 32767, adg last roram 32767, adg_rora_requester 32767 rcvinst 32767, domain valid? 0 * ------------------------------------------------- * * kjfcrfg: Dump rbuddy info at the beginning of rcfg: * kji_rbuddy_dmpi2t: dump i2t array: * array is empty * kji_rbuddy_dmpall: dump rbuddy array (rcvinst 32767, dom0 valid 0): * kji_rbuddy_graph: (cinc 0, valid 0, rinst 32767) [ ] * End of rbuddy info dump * Begin rcfg: free use mem = 710104928 (freemem 924863904, count 8) * kjfcqiora: query MULTIPLE LMD ENQUEUE INFO of inst 1 = 5@*@*, vallen 5, strlen 5 * kjfcqiora: parsing of string complete kjfcqiora: skipping namespace 14, not in use anymore * kjfcqiora: query MULTIPLE LMD ENQUEUE INFO of inst 2 = 5@*@*, vallen 5, strlen 5 * kjfcqiora: parsing of string complete kjfcqiora: skipping namespace 14, not in use anymore * kjfcrfg: kjfcqiora returned success 2025-04-15 12:14:13.746 : Reconfiguration started (old inc 0, new inc 119) * kjfcrfg: drm size limit is -1 buffers Dynamic remastering is disabled List of instances (total 2) : 1 2 My inst 1 (I'm a new instance) ' * kjfcrfg: Dump rbuddy info before kji_rbuddy_rcfg: * kji_rbuddy_dmpi2t: dump i2t array: * array is empty * kji_rbuddy_dmpall: dump rbuddy array (rcvinst 32767, dom0 valid 0): * kji_rbuddy_graph: (cinc 0, valid 0, rinst 32767) [ ] * End of rbuddy info dump * kjfcrfg: Dump rbuddy info after kji_rbuddy_rcfg: * kji_rbuddy_dmpi2t: dump i2t array: * array is empty * kji_rbuddy_dmpall: dump rbuddy array (rcvinst 32767, dom0 valid 0): * kji_rbuddy_graph: (cinc 0, valid 0, rinst 32767) [ ] * End of rbuddy info dump * kjfcrfg: sync timeout = 326 secs (2x default) TIMEOUTS: Local health check timeout: 70 sec Rcfg process freeze timeout: 70 sec Remote health check timeout: 140 sec *** 2025-04-15T12:30:43.091118+08:00 (CDB$ROOT(1)) kjfmReceiverHealthCB_Check: Reciever [26] is healthy. kjfmReceiverHealthCB_Check: Reciever [9] is healthy. kjfmReceiverHealthCB_Check: Reciever [5] is healthy. *** 2025-04-15T12:33:04.022623+08:00 (CDB$ROOT(1)) kjfmReceiverHealthCB_Check: Reciever [9] is healthy. kjfmReceiverHealthCB_Check: Reciever [2] is healthy. kjfmReceiverHealthCB_Check: Reciever [11] is healthy. kjfmReceiverHealthCB_Check: Reciever [28] is healthy. *** 2025-04-15T12:33:04.664247+08:00 (CDB$ROOT(1)) 2025-04-15 12:33:04.664 : kjxgrDD_rr_read: Detect reconfig from inst 2, seq 120, reason 3 ================================ == System Network Information == ================================ ==[ Network Interfaces : 5 (5 max) ]============ lo | 127.0.0.1 | 255.0.0.0 | UP|RUNNING bond0 | 10.216.3.35 | 255.255.255.0 | UP|RUNNING bond0:1 | 10.216.3.37 | 255.255.255.0 | UP|RUNNING bond1 | 192.168.103.1 | 255.255.255.0 | UP|RUNNING bond1:1 | 169.254.8.240 | 255.255.224.0 | UP|RUNNING == [ Network Transport Usage (ksipc: avail[xa8] sel[UDP]) (IPv4) ] == ===[ IPv4 Route Table : 4 entries ]============ Destination Gateway Iface 0.0.0.0/0 10.216.3.1 bond0 10.216.3.0/24 0.0.0.0 bond0 169.254.0.0/19 0.0.0.0 bond1 192.168.103.0/24 0.0.0.0 bond1 ===[ ARP Table ]============ IP address HW type Flags HW address Mask Device 10.216.3.36 0x1 0x2 5c:6f:69:55:cb:70 * bond0 10.216.3.39 0x1 0x2 5c:6f:69:55:cb:70 * bond0 169.254.27.60 0x1 0x2 74:50:4e:d8:df:c7 * bond1 10.216.3.38 0x1 0x2 5c:6f:69:55:cb:70 * bond0 192.168.103.2 0x1 0x2 74:50:4e:d8:df:c7 * bond1 10.216.3.37 0x1 0x0 00:00:00:00:00:00 * bond0 10.216.3.1 0x1 0x2 a0:69:d9:91:1e:36 * bond0 ===[ Network Config : 15 devices ]============ bond0 .rp_filter = 1 bond1 .rp_filter = 1 ens14f0 .rp_filter = 1 ens14f1d1 .rp_filter = 1 ens15f0 .rp_filter = 1 ens15f1d1 .rp_filter = 1 ens16f0 .rp_filter = 1 ens16f1 .rp_filter = 1 ens16f2 .rp_filter = 1 ens16f3 .rp_filter = 1 ens31f0 .rp_filter = 1 ens31f1 .rp_filter = 1 ens31f2 .rp_filter = 1 ens31f3 .rp_filter = 1 lo .rp_filter = 0 ==[ Network Interface States: num IF 5 Snapshots 5 ]== ***** info from 292s ago lo | 127.0.0.1 | 255.0.0.0 | UP|RUNNING bond0 | 10.216.3.35 | 255.255.255.0 | UP|RUNNING bond0:1 | 10.216.3.37 | 255.255.255.0 | UP|RUNNING bond1 | 192.168.103.1 | 255.255.255.0 | UP|RUNNING bond1:1 | 169.254.8.240 | 255.255.224.0 | UP|RUNNING ***** info from 52s ago lo | 127.0.0.1 | 255.0.0.0 | UP|RUNNING bond0 | 10.216.3.35 | 255.255.255.0 | UP|RUNNING bond0:1 | 10.216.3.37 | 255.255.255.0 | UP|RUNNING bond1 | 192.168.103.1 | 255.255.255.0 | UP|RUNNING bond1:1 | 169.254.8.240 | 255.255.224.0 | UP|RUNNING ***** info from 112s ago lo | 127.0.0.1 | 255.0.0.0 | UP|RUNNING bond0 | 10.216.3.35 | 255.255.255.0 | UP|RUNNING bond0:1 | 10.216.3.37 | 255.255.255.0 | UP|RUNNING bond1 | 192.168.103.1 | 255.255.255.0 | UP|RUNNING bond1:1 | 169.254.8.240 | 255.255.224.0 | UP|RUNNING ***** info from 172s ago lo | 127.0.0.1 | 255.0.0.0 | UP|RUNNING bond0 | 10.216.3.35 | 255.255.255.0 | UP|RUNNING bond0:1 | 10.216.3.37 | 255.255.255.0 | UP|RUNNING bond1 | 192.168.103.1 | 255.255.255.0 | UP|RUNNING bond1:1 | 169.254.8.240 | 255.255.224.0 | UP|RUNNING ***** info from 232s ago lo | 127.0.0.1 | 255.0.0.0 | UP|RUNNING bond0 | 10.216.3.35 | 255.255.255.0 | UP|RUNNING bond0:1 | 10.216.3.37 | 255.255.255.0 | UP|RUNNING bond1 | 192.168.103.1 | 255.255.255.0 | UP|RUNNING bond1:1 | 169.254.8.240 | 255.255.224.0 | UP|RUNNING kjxgrrcfgchk: Initiating reconfig, reason=3 kjxgrrcfgchk: COMM rcfg - Disk Vote Required kjfmReceiverHealthCB_CheckAll: Recievers are healthy. 2025-04-15 12:33:04.688 : kjxgrnetchk: start 0x2105f75, end 0x21115bf 2025-04-15 12:33:04.688 : kjxgrnetchk: Network Validation wait: 46 sec 2025-04-15 12:33:04.688 : kjxgrnetchk: Sending comm check req to inst 2 kjxgrrcfgchk: prev pstate 6 mapsz 1024 kjxgrrcfgchk: new bmp: 1 2 kjxgrrcfgchk: cnct bmp: 1 2 kjxgrrcfgchk: disc bmp: kjxgrrcfgchk: work bmp: 1 2 kjxgrrcfgchk: rr bmp: 1 2 *** 2025-04-15T12:33:04.689167+08:00 (CDB$ROOT(1)) kjxgmrcfg: Reconfiguration started, type 3 CGS/IMR TIMEOUTS: CSS recovery timeout = 31 sec (Total CSS waittime = 65) IMR Reconfig timeout = 75 sec CGS rcfg timeout = 85 sec kjxgmcs: Setting state to 119 0. kjxgrs0h: disable CGS timeout 2025-04-15 12:33:04.705 : kjxgrDD_rr_read: Detect reconfig from inst 2, seq 120, reason 3 kjxgrsyncnewmap: mem info history mem[1]:0x39 mem[2]:0x39 *** 2025-04-15T12:33:04.739156+08:00 (CDB$ROOT(1)) Name Service frozen kjxgmcs: Setting state to 119 1. kjxgrsyncnewmap: mem info history mem[1]:0x39 mem[2]:0x39 kjxggpoll: change db group poll time to 50 ms kjmsetrmvtots: reconfig ending, lowering RS priority * kjfcdarmrfg: real reconfiguration detected, break out of kjfcdarmrfg * kjfcln: domain action rcfg aborted due to CGS RCFG *** 2025-04-15T12:33:09.565067+08:00 (CDB$ROOT(1)) ===================================================== kjxgmpoll: CGS state (119 1) start 0x67fde180 cur 0x67fde185 rcfgtm 5 sec *** 2025-04-15T12:33:14.572779+08:00 (CDB$ROOT(1)) ===================================================== ===================================================== kjxgmpoll: CGS state (119 1) start 0x67fde180 cur 0x67fde18a rcfgtm 10 sec *** 2025-04-15T12:33:34.580575+08:00 (CDB$ROOT(1)) ===================================================== kjxgmpoll: CGS state (119 1) start 0x67fde180 cur 0x67fde19e rcfgtm 30 sec kjxgmpngin: started oraping facility ===================================================== Group name: anbob Member id: node 0 inst 1 Cached KGXGN event: 0 Group State: State: 119 1 Flags: 0xc4:70100001 SSFlags: 0x0 Reconfig started cur-tm 0x210d183 start-tm 0x2105f76 tmout 0x55 Reconfig state 0x2 chkcnt 0 Reconfig INPG type 3 inc 119 rsn 0 data 0x0 Reconfig COMP type 1 inc 119 rsn 0 data 0x0 Commited Map: 1 2 Commited DISC Map: Commited RECN Map: New Map: 1 2 KGXGN Map: 1 2 DISC Map: RECN Map: KGXGN Map (tmp): 1 2 Master inst: 1 ... Dumping the osd state Dumping the osd context (verbose) dumping IPCLW connections IPCLW:[0.26]{-}[LMOD]:UTIL: [1744691632832754]cnh 0x7f15842fb990 id 5377251930210828319 lport 169.254.8.240:28276 rport 169.254.27.60:47441 trans=UDP ts=1744690453278397 type=RECV ctx 22@2.4 IPCLW:[0.27]{-}[LMOD]:UTIL: [1744691632832754] PCNH 0x7f15842fb990 State: 1 SMSN: 521181074 PKT(521183049.1215248887) Last Rcv 0:0:48.581.581940 Last valid Rcv 0:0:48.581.581940 IPCLW:[0.28]{-}[LMOD]:UTIL: [1744691632832754] Peer: LMON.KSXP_ksipc.36096 AckSeq: 1215248887. # Coalesced: 0 ksxp:lwcnh: pt (nil) cookie 139730491191120 (unknown) (LMON) pd len 32 magic 0x2793aa31 inst 1 inc 4 pid 22 ser 1 unid 22 status 1 IPCLW:[0.29]{-}[LMOD]:UTIL: [1744691632832754]cnh 0x7f15842e7ab0 id 1263109037 lport 169.254.8.240:40281 rport 169.254.27.60:62772 trans=UDP ts=1744690453223269 type=SEND ctx 22@2.4 IPCLW:[0.30]{-}[LMOD]:UTIL: [1744691632832754] ACNH 0x7f15842e7ab0 State: 1 SMSN: 10974043 PKT(10976013.1760611214) # Pending: 0 IPCLW:[0.31]{-}[LMOD]:UTIL: [1744691632832754] Peer: LMON.KSXP_cgs.36096 AckSeq: 1760611214 ksxp:lwcnh: pt (nil) cookie 0 (unknown) (LMON) pd len 32 magic 0x2793aa31 inst 1 inc 4 pid 22 ser 1 unid 22 status 1 KSXPLW: oustanding connections 2, sysinc 119, nodes 2 dumping OSD IPCLW ctx Dumping ksxp state ksxppg=0x7f158a51d6f8 ksxpsg=0x4143a0b898 ksxpsg_a=0x4143a0b898ksxpssg=0x4143a0b5d0 rm=0x4083d340a0 proc state: (pid: 22) [flg: 3 sg: 1] curts 1744691632 wtctr 1172762 Dumping ksxp contexts Context[5] 0x7f158a4aaf50 CGS state 1 Dumping connection queue connection count: 1 port[0] state 1 flag 1 osd 0x40003e9d7438 [(invalid key)] has requests port count: 1 ports 2025-04-15T12:33:52.872276+08:00
The reasons are as follows:
Reason 0 = No reconfiguration
Reason 1 = The Node Monitor generated the reconfiguration.
Reason 2 = An instance death was detected.
Reason 3 = Communications Failure
Reason 4 = Reconfiguration after suspend
Note:
并不是参数的问题,reset后同样无法启动,似乎在ksxp网络层.
可能有效的方法:
Database instances failed to start with error-LCK0 (ospid: xxxxxx): terminating the instance due to ORA error 481 (Doc ID 3058827.1)
之前也遇到过rp_filter导致的通信问题,建议配置为0或2
net.ipv4.conf..rp_filter = 2 net.ipv4.conf..rp_filter = 2
尝试kill了所有节点的gpnpd.bin和gipcd.bin
Bug 32544124 – Instance Restart Terminated Due to DLM Error (Doc ID 32544124.8) 但未匹配上kgnfscrechan stack
客户未尝试,直接重启了所有节点,恢复正常。
目前这篇文章还没有评论(Rss)