Troubleshooting Oracle instance start fail join cluster wait control file enqueue
最近1 Oracle Exadata X7客户ora instance 2被驱逐后,重启db instance 2启动挂起,影响另一实例instance 1, 随后终止启动,实例1运行正常。分析db instance 2启动时在等待control file enqueue超时,OS 日志显示“RDS/IB: conn <192.168.*.3,192.168.*.6,4> racing for more than 1s, retry”,简单记录。
DB alert log inst 2
RNING: inbound connection timed out (ORA-3136) 2023-06-21 11:55:44.532000 +08:00 Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_diag_105594.trc (incident=363606): ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 2, osid 105678' Incident details in: /u01/app/oracle/diag/rdbms/anbob/anbob2/incident/incdir_363606/anbob2_diag_105594_i363606.trc 2023-06-21 11:56:34.057000 +08:00 Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_ora_223793.trc: ORA-00604: error occurred at recursive SQL level 1 ORA-01013: user requested cancel of current operation ORA-06512: at line 2 2023-06-21 11:56:44.173000 +08:00 Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_lgwr_105676.trc (incident=363830): ORA-00239: timeout waiting for control file enqueue: held by 'inst 2, osid 105678' for more than 900 seconds 《《《《《《《《《《《《《《《《《《《《《《《《《《《《 Incident details in: /u01/app/oracle/diag/rdbms/anbob/anbob2/incident/incdir_363830/anbob2_lgwr_105676_i363830.trc opidrv aborting process LGWR ospid (105676) as a result of ORA-603 PMON (ospid: 105570): terminating the instance due to error 470 System state dump requested by (instance=2, osid=105570 (PMON)), summary=[abnormal instance termination]. System State dumped to trace file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_diag_105594_20230621115644.trc ORA-1092 : opitsk aborting process ORA-1092 : opitsk aborting process 2023-06-21 11:56:47.718000 +08:00 ORA-1092 : opitsk aborting process 2023-06-21 11:56:49.276000 +08:00 Instance terminated by PMON, pid = 105570
DB instance 1
SQL> select * from v$lock where type='CF' ; ADDR KADDR SID TY ID1 ID2 LMODE REQUEST CTIME BLOCK ---------------- ---------------- ---------- -- ---------- ---------- ---------- ---------- ---------- ---------- 00000018B32E5C30 00000018B32E5C88 2766 CF 0 0 2 0 26391859 2 SQL> select program,status,event,last_call_et,blocking_session,state from v$session where sid=2766; last_call PROGRAM STATUS EVENT et BLOCKING_SESSION STATE ------------------------------------------------ ------- -------------------- --------- ---------------- ------------------- oracle@dm05db01.anbob.com.cn (CKPT) ACTIVE rdbms ipc message 26391931 WAITING
DB instance 1 ckpt trace
Control file enqueue hold time tracking dump at time: 791292 *** 2023-06-21 10:06:36.561 1: 12230ms (ro) file: kcrr.c line: 3628 count: 934 total: 1444730ms time: 42978 2: 8880ms (rw) file: kcrf.c line: 10076 count: 320 total: 509630ms time: 43233 3: 8050ms (rw) file: kcrf.c line: 10139 count: 926 total: 1893160ms time: 4251900 4: 7010ms (rw) file: kcv.c line: 11974 count: 1674 total: 1969640ms time: 607190 5: 4410ms (rw) file: kct.c line: 7821 count: 886 total: 881030ms time: 1707745 6: 3170ms (rw) file: krse.c line: 1733 count: 1 total: 3170ms time: 3349300 7: 2630ms (ro) file: krss.c line: 871 count: 1 total: 2630ms time: 475070 8: 2460ms (rw) file: kcv.c line: 12507 count: 2 total: 3230ms time: 4221819 9: 2220ms (rw) file: kcv.c line: 12509 count: 5 total: 8070ms time: 3583615 10: 1970ms (rw) file: krbm.c line: 2595 count: 7 total: 7190ms time: 1006926 Control file enqueue hold time tracking dump at time: 798533 *** 2023-06-21 12:07:18.106 1: 12230ms (ro) file: kcrr.c line: 3628 count: 140720308487078 total: 1444730ms time: 107374225378 2: 8880ms (rw) file: kcrf.c line: 10076 count: 140720308486464 total: 509630ms time: 107374225633 3: 8050ms (rw) file: kcrf.c line: 10139 count: 140720308487070 total: 1893160ms time: 107378434300 4: 7010ms (rw) file: kcv.c line: 11974 count: 140720308487819 total: 1970240ms time: 107374789590 5: 4410ms (rw) file: kct.c line: 7821 count: 140720308487030 total: 881030ms time: 107375890145 6: 3170ms (rw) file: krse.c line: 1733 count: 140720308486145 total: 3170ms time: 107377531700 7: 2630ms (ro) file: krss.c line: 871 count: 140720308486145 total: 2630ms time: 107374657470 8: 2460ms (rw) file: kcv.c line: 12507 count: 140720308486146 total: 3230ms time: 107378404219 9: 2220ms (rw) file: kcv.c line: 12509 count: 140720308486149 total: 8070ms time: 107377766015 10: 1970ms (rw) file: krbm.c line: 2595 count: 140720308486151 total: 7190ms time: 107375189326
SQL> select program,status,event,last_call_et,blocking_session,state,paddr from v$session where sid=2766; last_call PROGRAM STATUS EVENT et BLOCKING_SESSION STATE PADDR ------------------------------------------------ ------- -------------------- --------- ---------------- ------------------- ---------------- oracle@dm05db01.anbob.com.cn (CKPT) ACTIVE rdbms ipc message 26392502 WAITING 0000001951B65870 SQL> select pid ,spid from v$process where addr='0000001951B65870'; PID SPID ---------- ------------------------ 35 104889 SQL> oradebug setorapid 35 Oracle pid: 35, Unix process pid: 104889, image: oracle@dm05db01.anbob.com.cn (CKPT) SQL> oradebug short_stack; ksedsts()+465<-ksdxfstk()+32<-ksdxcb()+1927<-sspuser()+112<-__sighandler()<-semtimedop()+10<-skgpwwait()+178<-ksliwat()+2047<-kslwaitctx()+163<-kslwait()+141<-ksarcv()+207<-ksbabs()+330<-ksbrdp()+1045<-opirip()+623<-opidrv()+603<-sou2o()+103<-opimai_real()+250<-ssthrdmain()+265<-main()+201<-__libc_start_main()+253 SQL> oradebug short_stack; ksedsts()+465<-ksdxfstk()+32<-ksdxcb()+1927<-sspuser()+112<-__sighandler()<-semtimedop()+10<-skgpwwait()+178<-ksliwat()+2047<-kslwaitctx()+163<-kslwait()+141<-ksarcv()+207<-ksbabs()+330<-ksbrdp()+1045<-opirip()+623<-opidrv()+603<-sou2o()+103<-opimai_real()+250<-ssthrdmain()+265<-main()+201<-__libc_start_main()+253 SQL> oradebug short_stack; ksedsts()+465<-ksdxfstk()+32<-ksdxcb()+1927<-sspuser()+112<-__sighandler()<-semtimedop()+10<-skgpwwait()+178<-ksliwat()+2047<-kslwaitctx()+163<-kslwait()+141<-ksarcv()+207<-ksbabs()+330<-ksbrdp()+1045<-opirip()+623<-opidrv()+603<-sou2o()+103<-opimai_real()+250<-ssthrdmain()+265<-main()+201<-__libc_start_main()+253 SQL> [oracle@dm05db01 ~]$ sysctl -a|grep sem kernel.sem = 10000 10240000 10000 1024 kernel.sem_next_id = -1
db instance 1
SQL>alter system checkpoint SQL>select checkpont_chage# from v$database; [root@dm05db01 ~]# strace -f -c -p 104889 Process 104889 attached ^CProcess 104889 detached % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 50.33 0.001742 20 89 semctl 30.16 0.001044 0 10474 448 recvmsg 9.53 0.000330 0 6384 3192 setsockopt 5.20 0.000180 1 286 poll 3.90 0.000135 0 3197 sendmsg 0.87 0.000030 1 42 21 semtimedop 0.00 0.000000 0 12 read 0.00 0.000000 0 12 open 0.00 0.000000 0 12 close 0.00 0.000000 0 11 getrusage 0.00 0.000000 0 161 times ------ ----------- ----------- --------- --------- ---------------- 100.00 0.003461 20680 3661 total
Note:
在实例1 checkpoint正常。
实例2 禁用DB instance 启动
因为当前crs和ASM 都正常,db instance 2总是会自动被crs拉起,因为实例2在启动时长时间持有controlfile ENQ 资源,导致实例1挂起,需要在解决instance 2前停止自启动。
[oracle@dm05db02 ~]$ srvctl disable instance -d anbob -i anbob2
分析实例2启动失败trace
vi /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_diag_105594.trc Dump continued from file: /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_diag_105594.trc ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 2, osid 105678' ========= Dump for incident 363606 (ORA 494) ======== ----- Beginning of Customized Incident Dump(s) ----- ------------------------------------------------------------------------------- ENQUEUE [CF] HELD FOR TOO LONG enqueue holder: 'inst 2, osid 105678' Process 'inst 2, osid 105678' is holding an enqueue for maximum allowed time. The process will be terminated after collecting some diagnostics. It is possible the holder may release the enqueue before the termination is attempted. If so, the process will not be terminated and this incident will only be served as a warning. Oracle Support Services triaging information: to find the root-cause, look at the call stack of process 'inst 2, osid 105678' located below. Ask the developer that owns the first NON-service layer in the stack to investigate. Common service layers are enqueues (ksq), latches (ksl), library cache pins and locks (kgl), and row cache locks (kqr). Current Wait Stack: 0: waiting for 'control file parallel write' files=0x2, block#=0x6e, requests=0x2 wait_id=896400246 seq_num=62826 snap_id=1 wait times: snap=18 min 31 sec, exc=18 min 31 sec, total=18 min 31 sec wait times: max=infinite, heur=18 min 31 sec wait counts: calls=0 os=0 in_wait=1 iflags=0x5a0 There are 932 sessions blocked by this session. Dumping one waiter: inst: 2, sid: 2687, ser: 1 wait event: 'enq: CF - contention' p1: 'name|mode'=0x43460005 p2: '0'=0x0 p3: 'operation'=0x0 row_wait_obj#: 4294967295, block#: 0, row#: 0, file# 0 min_blocked_time: 23 secs, waiter_cache_ver: 29350 Wait State: fixed_waits=0 flags=0x22 boundary=(nil)/-1 Session Wait History: elapsed time of 0.000125 sec since current wait 0: waited for 'control file parallel write' files=0x2, block#=0x70, requests=0x2 wait_id=896400245 seq_num=62825 snap_id=1 wait times: snap=5 min 12 sec, exc=5 min 12 sec, total=5 min 12 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000329 sec of elapsed time 1: waited for 'direct path write' file number=0x3db, first dba=0x1, block cnt=0x1 wait_id=896400244 seq_num=62824 snap_id=1 wait times: snap=10 min 25 sec, exc=10 min 25 sec, total=10 min 25 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.001413 sec of elapsed time 2: waited for 'control file sequential read' file#=0x0, block#=0x6f, blocks=0x1 wait_id=896400243 seq_num=62823 snap_id=1 wait times: snap=0.001013 sec, exc=0.001013 sec, total=0.001013 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000021 sec of elapsed time 3: waited for 'control file parallel write' files=0x2, block#=0x6c, requests=0x2 wait_id=896400242 seq_num=62822 snap_id=1 wait times: snap=5 min 11 sec, exc=5 min 11 sec, total=5 min 11 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.001367 sec of elapsed time 4: waited for 'control file sequential read' file#=0x0, block#=0x6d, blocks=0x1 wait_id=896400241 seq_num=62821 snap_id=1 wait times: snap=0.000545 sec, exc=0.000545 sec, total=0.000545 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.001562 sec of elapsed time 5: waited for 'control file sequential read' file#=0x0, block#=0x6b, blocks=0x1 wait_id=896400240 seq_num=62820 snap_id=1 wait times: snap=0.001161 sec, exc=0.001161 sec, total=0.001161 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000050 sec of elapsed time 6: waited for 'control file sequential read' file#=0x0, block#=0x7c, blocks=0x1 wait_id=896400239 seq_num=62819 snap_id=1 wait times: snap=0.000470 sec, exc=0.000470 sec, total=0.000470 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000524 sec of elapsed time 7: waited for 'control file sequential read' file#=0x0, block#=0x7a, blocks=0x1 wait_id=896400238 seq_num=62818 snap_id=1 wait times: snap=0.000357 sec, exc=0.000357 sec, total=0.000357 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000159 sec of elapsed time 8: waited for 'control file sequential read' file#=0x0, block#=0x78, blocks=0x1 wait_id=896400237 seq_num=62817 snap_id=1 wait times: snap=0.000392 sec, exc=0.000392 sec, total=0.000392 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000179 sec of elapsed time 9: waited for 'control file sequential read' file#=0x0, block#=0x76, blocks=0x1 wait_id=896400236 seq_num=62816 snap_id=1 wait times: snap=0.000373 sec, exc=0.000373 sec, total=0.000373 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000033 sec of elapsed time Sampled Session History of session 2766 serial 1 --------------------------------------------------- The sampled session history is constructed by sampling the target session every 1 second. The sampling process captures at each sample if the session is in a non-idle wait, an idle wait, or not in a wait. If the session is in a samples in a single interval the session may NOT be continuously idle or not in a wait (the sampling process does not know). The history is displayed in reverse chronological order. sample interval: 1 sec, max history 900 sec --------------------------------------------------- [1130 samples, 11:36:55 - 11:55:44] idle wait at each sample [CF][current][num io 20][tot tim 1249367 ms] Dumping process 35.105678 info: *** 2023-06-21 11:55:44.539 Process diagnostic dump for oracle@dm05db02.anbob.com.cn (CKPT), OS id=105678, --blocker session pid: 35, proc_ser: 1, sid: 2766, sess_ser: 1 ------------------------------------------------------------------------------- os thread scheduling delay history: (sampling every 1.000000 secs) 0.000000 secs at [ 11:55:43 ] NOTE: scheduling delay has not been sampled for 0.694673 secs 0.000000 secs from [ 11:55:39 - 11:55:44 ], 5 sec avg 0.000000 secs from [ 11:54:44 - 11:55:44 ], 1 min avg 0.000000 secs from [ 11:50:44 - 11:55:44 ], 5 min avg loadavg : 3.66 3.31 5.50 Memory (Avail / Total) = 84835.05M / 772256.64M Swap (Avail / Total) = 24576.00M / 24576.00M F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 0 S oracle 105678 1 0 80 0 - 26284836 - 2022 ? 2-00:59:07 ora_ckpt_anbob2 Short stack dump: ksedsts()+465<-ksdxfstk()+32<-ksdxcb()+1927<-sspuser()+112<-__sighandler()<-__poll()+24<-sskgxp_selectex()+392<-skgxpiwait()+4318<-skgxpwaiti()+1916<-skgxpwait()+162<-ossnet_wait_all()+248<-ossnet_wait()+212 <-ossdisk_reopen()+293<-ossnet_reopen_dev_handles()+161<-ossnet_reconnect_box_handle_int()+949<-ossnet_vrpc_completion()+1285<-ossnet_process_netmsg_completion()+117<-ossnet_wait_all()+397<-ossnet_wait()+212 <-oss_wait()+991<-kfk_wait_on_oss_io()+406<-kfk_block_oss()+349<-kfk_block()+901<-kfk_transit_waitIO()+188<-kfk_transitIO()+1770<-kfioWaitIO()+721<-kfioRequestPriv()+351<-kfioRequest()+691<-ksfd_kfioRequest( )+649<-ksfd_osmbio()+413<-ksfdbio()+1579<-ksfdss_bio()+75<-ksfdbio()+2637<-kccwbp()+607<-kccrec_wdf()+364<-kccfhb()+1345<-kcccmt()+251<-kccecx()+61<-kcvcca()+2904<-ksbcti()+513<-ksbabs()+1742<-ksbrdp()+1045< -opirip()+623<-opidrv()+603<-sou2o()+103<-opimai_real()+250<-ssthrdmain()+265<-main()+201<-__libc_start_main()+253 vi /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_lgwr_105676.trc *** 2023-06-21 11:56:44.175 *** SESSION ID:(2687.1) 2023-06-21 11:56:44.175 *** CLIENT ID:() 2023-06-21 11:56:44.175 *** SERVICE NAME:(SYS$BACKGROUND) 2023-06-21 11:56:44.175 *** MODULE NAME:() 2023-06-21 11:56:44.175 *** ACTION NAME:() 2023-06-21 11:56:44.175 Dump continued from file: /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_lgwr_105676.trc ORA-00239: timeout waiting for control file enqueue: held by 'inst 2, osid 105678' for more than 900 seconds ========= Dump for incident 363830 (ORA 239) ======== ----- Beginning of Customized Incident Dump(s) ----- ------------------------------------------------------------------------------- TIMEOUT WAITING FOR CONTROL FILE ENQUEUE requested mode : X enqueue holder : 'inst 2, osid 105678' enqueue held time : 900 seconds The control file enqueue was held by process 'inst 2, osid 105678' for more than 900 seconds. Attempts to terminate the enqueue holder were unsuccesful, therefore terminating the current process. To resolve this problem try restarting 'instance 2' or rebooting the machine that that 'instance 2' is on.
最开始db 2 alert 日志启动时也有显示HANGMgr (dia0)进程发现ckpt进程持有CF enq超过了900s,同时lgwr也在请求X cf enq失败,因为ckpt的持续持有,ckpt的control file parallel write和direct path write写相关的等待都很久有5分钟甚至10分钟,但是读正常。这种有可能是多路径、同步复制、存储链路等部分存在问题。
OS 日志
Jun 21 13:55:51 dm05db02 kernel: [152204048.913900] ll header: 00000010: 00 00 00 00 08 06 00 00 ........ Jun 21 13:55:52 dm05db02 kernel: [152204049.277575] RDS/IB: conn <192.168.*.3,192.168.*.6,4> racing for more than 1s, retry Jun 21 13:55:52 dm05db02 kernel: [152204049.560413] IPv4: martian source 192.168.*.4 from 192.168.*.2, on dev ib1 Jun 21 13:55:52 dm05db02 kernel: [152204049.560419] ll header: 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Jun 21 13:55:52 dm05db02 kernel: [152204049.560421] ll header: 00000010: 00 00 00 00 08 06 00 00 ........ Jun 21 13:55:52 dm05db02 kernel: [152204049.913696] IPv4: martian source 192.168.*.4 from 192.168.*.1, on dev ib1 Jun 21 13:55:52 dm05db02 kernel: [152204049.913716] ll header: 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Jun 21 13:55:52 dm05db02 kernel: [152204049.913718] ll header: 00000010: 00 00 00 00 08 06 00 00 ........ Jun 21 13:55:53 dm05db02 kernel: [152204050.672772] IPv4: martian source 192.168.*.4 from 192.168.*.2, on dev ib1 Jun 21 13:55:53 dm05db02 kernel: [152204050.672791] ll header: 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Jun 21 13:55:53 dm05db02 kernel: [152204050.672793] ll header: 00000010: 00 00 00 00 08 06 00 00 ........ Jun 21 13:55:53 dm05db02 kernel: [152204050.913285] IPv4: martian source 192.168.*.4 from 192.168.*.1, on dev ib1 Jun 21 13:55:53 dm05db02 kernel: [152204050.913305] ll header: 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Jun 21 13:55:53 dm05db02 kernel: [152204050.913307] ll header: 00000010: 00 00 00 00 08 06 00 00 ........ Jun 21 13:55:54 dm05db02 kernel: [152204051.142976] RDS/IB: conn <192.168.*.3,192.168.*.6,4> racing for more than 1s, retry # ibstat CA 'mlx4_0' CA type: MT4099 Number of ports: 2 Firmware version: 2.35.5538 Hardware version: 1 Node GUID: 0x506b4b030061e530 System image GUID: 0x506b4b030061e533 Port 1: State: Active Physical state: LinkUp Rate: 40 Base lid: 3 LMC: 0 SM lid: 2 Capability mask: 0x02514868 Port GUID: 0x506b4b030061e531 Link layer: InfiniBand Port 2: State: Active Physical state: LinkUp Rate: 40 Base lid: 4 LMC: 0 SM lid: 2 Capability mask: 0x02514868 Port GUID: 0x506b4b030061e532 Link layer: InfiniBand # lspci|grep InfiniBand 3b:00.0 InfiniBand: Mellanox Technologies MT27500 Family [ConnectX-3]
Note:
有RDS/ib 这类提示,因为Oracle Exadata machine使用RDS协议处理私网或存储间的网络通信,既然出说问题就说明这两个网络间存在网络延迟现象, 但现场反馈使用ibping 、rds-ping命令测试正常,硬件无异常亮灯。ibstat显示都是up,固件版本为 2.35.5538
控制文件测试
SELECT * FROM V$CONTROLFILE STATUS NAME IS_ BLOCK_SIZE FILE_SIZE_BLKS ------- -------------------------------------------------------------------------------- --- ---------- -------------- +DATA_DM01/anbob/controlfile/current01.ctl NO 16384 29760 +RECO_DM01/anbob/controlfile/current02.ctl NO 16384 29760 SQL> SELECT 16*29760 FROM DUAL; 16*29760 ---------- 476160 # su - grid [grid@dm05db02 ~]$ asmcmd ASMCMD> cp +DATA_DM01/anbob/controlfile/current01.ctl /tmp/ctl ^C [grid@dm05db01 ~]$ asmcmd ASMCMD> cp +DATA_DM01/anbob/controlfile/current01.ctl /tmp/ctl ^C ASMCMD> ls +DATA_DM01/anbob/controlfile/current01.ctl current01.ctl ASMCMD> ls -l +DATA_DM01/anbob/controlfile/current01.ctl ^[[[A
Note:
控制文件有450MB左右大小,在ASMCMD直接复制controlfile 在两个节点都挂起,但是过一会偶尔又回正常,秒回结果。
$ asmcmd ASMCMD> ls -l +DATA_DM01/anbob/controlfile/current01.ctl Type Redund Striped Time Sys Name N current01.ctl => +DATA_DM01/anbob/CONTROLFILE/current.765.985169275 ASMCMD> ls -l +RECO_DM01/anbob/controlfile/current02.ctl Type Redund Striped Time Sys Name N current02.ctl => +RECO_DM01/anbob/CONTROLFILE/current.1522.985169281 ASMCMD> cp +DATA_DM01/anbob/controlfile/current01.ctl /tmp/ctl_bak copying +DATA_DM01/anbob/controlfile/current01.ctl -> /tmp/ctl_bak ASMCMD>
说明当前的IB存储网络不稳定或存在异常或bug.
MOS中未找到匹配案列, Oracle Linux: RDS/IB Re-connection And Stalling for More Than 1 Min (Doc ID 2611709.1)记录
Jan 1 00:00:48 hostname kernel: [12751701.033437] RDS/IB: connection <::ffff:192.168.0.3,::ffff:192.168.1.142,0> dropped due to 'incoming REQ in CONN_UP state'
Jan 1 00:00:48 hostname kernel: [12751701.033771] RDS/IB: re-connect <::ffff:192.168.1.142,::ffff:192.168.0.3,0> stalling for more than 1 min...(drops=1 err=0)
Jan 1 00:00:48 hostname kernel: [12751701.035083] RDS/IB: re-connect <::ffff:192.168.1.142,::ffff:192.168.0.1,0> stalling for more than 1 min...(drops=1 err=0)
不完全匹配,是因为Infiniband card firmware bug. 需要升级到Upgrade the Infiniband ConnectX-3 firmware to version 2.35.6302 or newer version.
最终客户重新插拔了一下instance 2主机的IB线,居然恢复了正常。
对不起,这篇文章暂时关闭评论。