Troubloshooting Oracle RAC node reboot and OS log show “kernel: qla2xxx[ ] Abort command issued”
近期1客户Oracle RAC 节点OS重启,协助分析原因,db层无日志错误输出,RAC层有vote disk I/O timeout, OS层 qla2xxx [0000:81:00.0]-801c:7: Abort command 和DEVICE RESET 操作。qla2xxx 是QLogic FC HBA的驱动,怀疑重启是HBA卡导致IO失败,导致disk timeout, CRS发起reboot. 简单记录该问题。
db alert log
Mon Jun 19 10:11:17 2023 *********************************************************************** Fatal NI connect error 12170. VERSION INFORMATION: TNS for Linux: Version 11.2.0.4.0 - Production Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production Time: 19-JUN-2023 10:11:17 Tracing not turned on. Tns error struct: ns main err code: 12535 TNS-12535: TNS:operation timed out ns secondary err code: 12606 nt main err code: 0 nt secondary err code: 0 nt OS err code: 0 Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=10.10.51.124)(PORT=37840)) Mon Jun 19 10:20:17 2023 <<<<<<<<<<<<<<<<<<< log gap Starting ORACLE instance (normal) LICENSE_MAX_SESSION = 0 LICENSE_SESSIONS_WARNING = 0 Initial number of CPU is 96 Number of processor cores in the system is 48 Number of processor sockets in the system is 4
Note
db alert 无error
LHMB trce
*** 2023-06-19 10:09:45.578 ============================== LMON (ospid: 7091) has not moved for 128 sec (1687140585.1687140457) kjfmGCR_HBCheckAll: LMON (ospid: 7091) has status 2 : waiting for event 'control file sequential read' for 116 secs with wait_id 843992236. ===[ Wait Chain ]=== Wait chain is empty. kjgcr_Main: KJGCR_ACTION - id 5 *** 2023-06-19 10:10:06.058 ============================== LMON (ospid: 7091) has not moved for 148 sec (1687140605.1687140457) kjfmGCR_HBCheckAll: LMON (ospid: 7091) has status 2 : waiting for event 'control file sequential read' for 137 secs with wait_id 843992236. ===[ Wait Chain ]=== Wait chain is empty. kjgcr_Main: KJGCR_ACTION - id 5 *** 2023-06-19 10:10:26.537 ============================== LMON (ospid: 7091) has not moved for 169 sec (1687140626.1687140457) kjfmGCR_HBCheckAll: LMON (ospid: 7091) has status 2 : waiting for event 'control file sequential read' for 157 secs with wait_id 843992236. ===[ Wait Chain ]=== Wait chain is empty. kjgcr_Main: KJGCR_ACTION - id 5 *** 2023-06-19 10:10:47.017 ============================== LMON (ospid: 7091) has not moved for 189 sec (1687140646.1687140457) kjfmGCR_HBCheckAll: LMON (ospid: 7091) has status 2 : waiting for event 'control file sequential read' for 178 secs with wait_id 843992236. ===[ Wait Chain ]=== Wait chain is empty. kjgcr_Main: KJGCR_ACTION - id 5 *** 2023-06-19 10:11:07.497 ============================== LMON (ospid: 7091) has not moved for 209 sec (1687140666.1687140457) kjfmGCR_HBCheckAll: LMON (ospid: 7091) has status 2 : waiting for event 'control file sequential read' for 198 secs with wait_id 843992236. ===[ Wait Chain ]=== Wait chain is empty. kjgcr_Main: KJGCR_ACTION - id 5 *** 2023-06-19 10:11:27.977 ============================== LMON (ospid: 7091) has not moved for 230 sec (1687140687.1687140457) kjfmGCR_HBCheckAll: LMON (ospid: 7091) has status 2 : waiting for event 'control file sequential read' for 219 secs with wait_id 843992236. ===[ Wait Chain ]=== Wait chain is empty. kjgcr_Main: KJGCR_ACTION - id 5
Note:
出现LMON hang, 等待control file sequential read持续时间增长,约从2023-06-19 10:07开始。
GI Alert log
2023-06-19 10:07:51.826: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 8010 msecs. 2023-06-19 10:07:51.826: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 7870 msecs. 2023-06-19 10:07:53.826: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 9360 msecs. 。。。 2023-06-19 10:08:08.979: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 25020 msecs. 2023-06-19 10:08:08.979: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 25160 msecs. 2023-06-19 10:08:08.979: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 25020 msecs. 2023-06-19 10:08:10.829: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 26360 msecs. 2023-06-19 10:08:16.831: 。。。 。。。 2023-06-19 10:08:49.111: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 65290 msecs. 。。。 [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 82880 msecs. 2023-06-19 10:09:15.843: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 91370 msecs. 2023-06-19 10:09:15.843: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 91880 msecs. 2023-06-19 10:09:15.843: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 92020 msecs. 2023-06-19 10:09:15.843: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 91880 msecs. 2023-06-19 10:09:18.971: [crsd(5234)]CRS-2765:Resource 'ora.OCR.dg' has failed on server 'ANBOB1'. 2023-06-19 10:09:19.025: [crsd(5234)]CRS-2878:Failed to restart resource 'ora.OCR.dg' 2023-06-19 10:09:19.026: [crsd(5234)]CRS-2769:Unable to failover resource 'ora.OCR.dg'. 2023-06-19 10:09:23.284: [cssd(4644)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file /dev/mapper/ocr01 will be considered not functional in 99530 milliseconds 2023-06-19 10:09:23.844: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 99380 msecs. 2023-06-19 10:09:23.844: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 99890 msecs. 2023-06-19 10:09:23.844: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 100030 msecs. 2023-06-19 10:09:23.844: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 99890 msecs. 2023-06-19 10:09:24.285: [cssd(4644)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file /dev/mapper/ocr03 will be considered not functional in 99180 milliseconds 2023-06-19 10:09:31.846: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 107380 msecs. 2023-06-19 10:09:31.846: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 107890 msecs. 2023-06-19 10:09:31.846: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 108030 msecs. 2023-06-19 10:09:31.846: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 107890 msecs. 。。。 。。。 2023-06-19 10:10:06.852: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 142890 msecs. 2023-06-19 10:10:06.852: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 143030 msecs. 2023-06-19 10:10:06.852: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 142890 msecs. 2023-06-19 10:10:13.134: [cssd(4644)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file /dev/mapper/ocr01 will be considered not functional in 49680 milliseconds 2023-06-19 10:10:14.134: [cssd(4644)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file /dev/mapper/ocr03 will be considered not functional in 49330 milliseconds 2023-06-19 10:10:14.854: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 150390 msecs. 2023-06-19 10:10:14.854: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 150900 msecs. 2023-06-19 10:10:14.854: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 151040 msecs. 2023-06-19 10:10:14.854: 。。。 [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 166900 msecs. 2023-06-19 10:10:30.857: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 167040 msecs. 2023-06-19 10:10:30.857: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 166900 msecs. 2023-06-19 10:10:39.858: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 175390 msecs. 2023-06-19 10:10:39.858: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 175900 msecs. 2023-06-19 10:10:39.858: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 176040 msecs. 2023-06-19 10:10:39.858: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 175900 msecs. 2023-06-19 10:10:43.339: [cssd(4644)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file /dev/mapper/ocr01 will be considered not functional in 19480 milliseconds 2023-06-19 10:10:44.339: [cssd(4644)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file /dev/mapper/ocr03 will be considered not functional in 19130 milliseconds 2023-06-19 10:10:47.860: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 183390 msecs. 2023-06-19 10:10:47.860: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 183900 msecs. 2023-06-19 10:10:47.860: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 184040 msecs. 2023-06-19 10:10:47.860: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 183900 msecs. 2023-06-19 10:10:56.494: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 192030 msecs. 2023-06-19 10:10:56.494: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 192540 msecs. 2023-06-19 10:10:56.494: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 192680 msecs. 2023-06-19 10:10:56.494: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 192540 msecs. 2023-06-19 10:11:03.216: [cssd(4644)]CRS-1604:CSSD voting file is offline: /dev/mapper/ocr01; details at (:CSSNM00058:) in /u01/app/11.2.0/grid/log/ANBOB1/cssd/ocssd.log. 2023-06-19 10:11:04.217: [cssd(4644)]CRS-1604:CSSD voting file is offline: /dev/mapper/ocr03; details at (:CSSNM00058:) in /u01/app/11.2.0/grid/log/ANBOB1/cssd/ocssd.log. 2023-06-19 10:11:04.217: [cssd(4644)]CRS-1606:The number of voting files available, 1, is less than the minimum number of voting files required, 2, resulting in CSSD termination to ensure data integrity; details at (:CSSNM00018:) in /u01/app/11.2.0/grid/log/ANBOB1/cssd/ocssd.log 2023-06-19 10:11:04.217: [cssd(4644)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/11.2.0/grid/log/ANBOB1/cssd/ocssd.log 2023-06-19 10:11:04.262: [cssd(4644)]CRS-1652:Starting clean up of CRSD resources. 2023-06-19 10:11:05.422: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 200950 msecs. 2023-06-19 10:11:05.422: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 201460 msecs. 2023-06-19 10:11:05.422: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 201600 msecs. 2023-06-19 10:11:05.422: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 201460 msecs. 2023-06-19 10:11:05.590: [/u01/app/11.2.0/grid/bin/oraagent.bin(6404)]CRS-5016:Process "/u01/app/11.2.0/grid/opmn/bin/onsctli" spawned by agent "/u01/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/u01/app/11.2.0/grid/log/ANBOB1/agent/crsd/oraagent_grid//oraagent_grid.log" 2023-06-19 10:11:06.092: [/u01/app/11.2.0/grid/bin/oraagent.bin(6404)]CRS-5016:Process "/u01/app/11.2.0/grid/bin/lsnrctl" spawned by agent "/u01/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/u01/app/11.2.0/grid/log/ANBOB1/agent/crsd/oraagent_grid//oraagent_grid.log" 2023-06-19 10:11:06.095: [/u01/app/11.2.0/grid/bin/oraagent.bin(6404)]CRS-5016:Process "/u01/app/11.2.0/grid/bin/lsnrctl" spawned by agent "/u01/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/u01/app/11.2.0/grid/log/ANBOB1/agent/crsd/oraagent_grid//oraagent_grid.log" 2023-06-19 10:11:11.209: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingMonitorThread not scheduled for 7990 msecs. 2023-06-19 10:11:13.649: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 209180 msecs. 2023-06-19 10:11:13.649: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 209690 msecs. 2023-06-19 10:11:13.649: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 209830 msecs. 2023-06-19 10:11:13.649: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 209690 msecs. 2023-06-19 10:11:19.854: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingMonitorThread not scheduled for 16640 msecs. 2023-06-19 10:11:21.905: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 217440 msecs. 2023-06-19 10:11:21.905: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 217950 msecs. 2023-06-19 10:11:21.905: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 218090 msecs. 2023-06-19 10:11:21.905: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 217950 msecs. 2023-06-19 10:11:28.056: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingMonitorThread not scheduled for 24840 msecs. 2023-06-19 10:11:30.109: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 225640 msecs. 2023-06-19 10:11:30.109: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 226150 msecs. 2023-06-19 10:11:30.109: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThread not scheduled for 226290 msecs. 2023-06-19 10:11:30.109: [cssd(4644)]CRS-1719:Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread not scheduled for 226150 msecs. 2023-06-19 10:18:38.098: [ohasd(4848)]CRS-2112:The OLR service started on node ANBOB1. 2023-06-19 10:18:38.125: [ohasd(4848)]CRS-8011:reboot advisory message from host: ANBOB1, component: cssmonit, with time stamp: L-2023-06-19-10:11:32.220 <<<<<<<<<<<<<<< [ohasd(4848)]CRS-8013:reboot advisory message text: oracssdmonitor is rebooting this node due to vote disk I/O timeout (no disk activity for 28030 milliseconds). 2023-06-19 10:18:38.129: [ohasd(4848)]CRS-8017:location: /etc/oracle/lastgasp has 2 reboot advisory log files, 1 were announced and 0 errors occurred 2023-06-19 10:18:38.125: [ohasd(4848)]CRS-1301:Oracle High Availability Service started on node ANBOB1. 2023-06-19 10:18:38.617: [/u01/app/11.2.0/grid/bin/oraagent.bin(5570)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/u01/app/11.2.0/grid/log/ANBOB1/agent/ohasd/oraagent_grid//oraagent_grid.log" 2023-06-19 10:18:41.778: [/u01/app/11.2.0/grid/bin/orarootagent.bin(5574)]CRS-2302:Cannot get GPnP profile. Error CLSGPNP_NO_DAEMON (GPNPD daemon is not running). 2023-06-19 10:18:44.024: [gpnpd(5734)]CRS-2328:GPNPD started on node ANBOB1. 2023-06-19 10:18:46.427: [cssd(5807)]CRS-1713:CSSD daemon is started in clustered mode
Note:
显示CSSD发vdiskping也就是disk hearbeat超时,也是从2023-06-19 10:07:43开始,11g RAC默认disktimeout 为200s, reboottime为3s. 在10:11:04 开始做rebootless 清理CRS资源, 但cssd多个线程hang, 10:11:32发起reboot, 10:18重启动。
OS message
Jun 19 10:05:01 ANBOB1 su: (to oracle) root on none Jun 19 10:05:02 ANBOB1 systemd: Removed slice User Slice of root. Jun 19 10:05:05 ANBOB1 su: (to oracle) root on none Jun 19 10:05:05 ANBOB1 systemd: Started Session c245246 of user oracle. Jun 19 10:06:05 ANBOB1 su: (to oracle) root on none Jun 19 10:06:05 ANBOB1 systemd: Started Session c245247 of user oracle. Jun 19 10:07:05 ANBOB1 su: (to oracle) root on none Jun 19 10:07:05 ANBOB1 systemd: Started Session c245248 of user oracle. Jun 19 10:07:43 ANBOB1 kernel: qla2xxx [0000:81:00.0]-801c:7: Abort command issued nexus=7:1:4 -- 2003. Jun 19 10:08:28 ANBOB1 kernel: qla2xxx [0000:81:00.0]-801c:7: Abort command issued nexus=7:1:0 -- 2003. <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Jun 19 10:08:41 ANBOB1 su: (to oracle) root on none Jun 19 10:08:41 ANBOB1 systemd: Started Session c245249 of user oracle. Jun 19 10:09:13 ANBOB1 kernel: qla2xxx [0000:81:00.0]-801c:7: Abort command issued nexus=7:1:2 -- 2003. Jun 19 10:09:33 ANBOB1 systemd-udevd: worker [26686] /devices/virtual/block/dm-5 is taking a long time <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Jun 19 10:09:33 ANBOB1 systemd-udevd: worker [26688] /devices/virtual/block/dm-12 is taking a long time <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Jun 19 10:10:01 ANBOB1 systemd: Created slice User Slice of root. Jun 19 10:10:01 ANBOB1 systemd: Started Session 236697 of user root. Jun 19 10:10:01 ANBOB1 systemd: Started Session 236698 of user root. Jun 19 10:10:01 ANBOB1 su: (to oracle) root on none Jun 19 10:10:32 ANBOB1 kernel: qla2xxx [0000:81:00.0]-8009:7: DEVICE RESET ISSUED nexus=7:1:0 cmd=00000000c2007889. <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Jun 19 10:10:32 ANBOB1 kernel: qla2xxx [0000:81:00.0]-800e:7: DEVICE RESET SUCCEEDED nexus:7:1:0 cmd=00000000c2007889. Jun 19 10:10:41 ANBOB1 su: (to oracle) root on none Jun 19 10:10:41 ANBOB1 systemd: Started Session c245250 of user oracle. Jun 19 10:11:04 ANBOB1 avahi-daemon[2471]: Withdrawing address record for 10.10.53.153 on bond0. Jun 19 10:11:04 ANBOB1 avahi-daemon[2471]: Withdrawing address record for 10.10.53.155 on bond0. Jun 19 10:11:05 ANBOB1 ntpd[2629]: Deleting interface #12 bond0:3, 10.10.53.155#123, interface stats: received=0, sent=0, dropped=0, active_time=823276 secs Jun 19 10:11:05 ANBOB1 ntpd[2629]: Deleting interface #10 bond0:1, 10.10.53.153#123, interface stats: received=0, sent=0, dropped=0, active_time=36846710 secs Jun 19 10:11:29 ANBOB1 kernel: qla2xxx [0000:81:00.0]-801c:7: Abort command issued nexus=7:1:0 -- 2003. Jun 19 10:11:29 ANBOB1 kernel: qla2xxx [0000:81:00.0]-8009:7: DEVICE RESET ISSUED nexus=7:1:0 cmd=00000000c2007889. Jun 19 10:11:29 ANBOB1 kernel: qla2xxx [0000:81:00.0]-800e:7: DEVICE RESET SUCCEEDED nexus:7:1:0 cmd=00000000c2007889. Jun 19 10:11:29 ANBOB1 kernel: qla2xxx [0000:81:00.0]-8009:7: DEVICE RESET ISSUED nexus=7:1:2 cmd=00000000fd95afce. Jun 19 10:11:29 ANBOB1 kernel: qla2xxx [0000:81:00.0]-800e:7: DEVICE RESET SUCCEEDED nexus:7:1:2 cmd=00000000fd95afce. Jun 19 10:18:03 ANBOB1 kernel: microcode: microcode updated early to revision 0x16, date = 2019-06-17 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< OS reboot Jun 19 10:18:03 ANBOB1 kernel: Linux version 5.4.17-2011.6.2.el7uek.x86_64 (mockbuild@jenkins-172-17-0-2-64d560bf-4b94-4bde-a5f8-f55888be27a0) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39.0.3) (GCC)) #2 SMP Thu Sep 3 14:09:14 PDT 2020 Jun 19 10:18:03 ANBOB1 kernel: Command line: BOOT_IMAGE=/vmlinuz-5.4.17-2011.6.2.el7uek.x86_64 root=/dev/mapper/vg_ggdb1-lv_root ro crashkernel=auto rd.lvm.lv=vg_ggdb1/lv_root rd.lvm.lv=vg_ggdb1/lv_swap rhgb quiet LANG=en_US.UTF-8 Jun 19 10:18:03 ANBOB1 kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers' Jun 19 10:18:03 ANBOB1 kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers' Jun 19 10:18:03 ANBOB1 kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers' Jun 19 10:18:03 ANBOB1 kernel: x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256 Jun 19 10:18:03 ANBOB1 kernel: x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format. Jun 19 10:18:03 ANBOB1 kernel: BIOS-provided physical RAM map: Jun 19 10:18:03 ANBOB1 kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009e7ff] usable
Note:
根据MESSAGE日志的报错涉及
7:1:0 — dm-12 — data01
7:1:2 — dm-13 — ocr01
7:1:4 — dm-4 — ocr03
7:1:1– dm-5 — asmdata02
日志含义
kernel: qla2xxx [0000:81:00.0]-801c:7: Abort command issued nexus=7:1:0 — 2003.
qla2xxx is the name of the driver or kernel module.
[0000:81:00.0] is the PCI bus information of the device.
801c is a hexadecimal id which uniquely identifies the part of the code from where the message originated.
7 is the host number of the scsi target.
Abort command issued nexus=7:1:0 The driver aborted the command that was in progress to the scsi target 7:1:0.
2002 means the reset succeeded, 2003 means the reset failed.
10:11:04 开始释放CRS资源,对应OS日志的删除VIP Interface,在此之前只有data01的链路reset成功。而ocr依赖的链路在REBOOT时才成功。这也许是有双链路而依旧重启的原因,但长时间hang为什么没有链路切换或reset是HBA 驱动层的问题。 需要联系硬件工程师分析HBA卡问题原因,如是否需要固件升级等。
dmesg
[ 3.768929] [TTM] Zone kernel: Available graphics memory: 528137490 KiB [ 3.768930] [TTM] Zone dma32: Available graphics memory: 2097152 KiB [ 3.768931] [TTM] Initializing pool allocator [ 3.768935] [TTM] Initializing DMA pool allocator [ 3.775376] fbcon: astdrmfb (fb0) is primary device [ 3.778660] qla2xxx [0000:00:00.0]-0005: : QLogic Fibre Channel HBA Driver: 10.01.00.25-k. [ 3.779073] qla2xxx [0000:81:00.0]-011c: : MSI-X vector count: 32. [ 3.779076] qla2xxx [0000:81:00.0]-001d: : Found an ISP2031 irq 35 iobase 0x00000000dc296659. [ 3.781010] qla2xxx 0000:81:00.0: DMAR: 32bit DMA uses non-identity mapping [ 3.790795] megaraid_sas 0000:41:00.0: megasas_enable_intr_fusion is called outbound_intr_mask:0x40000000 [ 3.790810] megaraid_sas 0000:41:00.0: INIT adapter done [ 3.790812] megaraid_sas 0000:41:00.0: JBOD sequence map is disabled megasas_setup_jbod_map 5675 [ 3.791859] megaraid_sas 0000:41:00.0: pci id : (0x1000)/(0x005b)/(0x1000)/(0x9271) [ 3.791862] megaraid_sas 0000:41:00.0: unevenspan support : no
Note:
有1个 qla2xxx [0000:81:00.0] QLogic Fibre Channel HBA
确认qla2xxx 设备
# lspci|grep 81 81:00.0 Fibre Channel: OLogic Corp.TSP8324-based16Gb Fibre Channel to pCi Express Adapter 81:00.1 Fibre Channel: OLogic Corp.TSP8324-based16Gb Fibre Channel to pCi Express Adapter
当时没有查看该信息,但反馈有2块HBA卡,而不是同1 HBA的2个口。
检查卡错误基于上面的PCI地址
grep 81:00.0 var/log/messages | grep qla2xxx
启用qla2xxx driver扩展日志
Enable extended logging for the qla2xxx driver to try to capture any additional error messages when the issue occurs Raw $ chmod u+w /sys/module/qla2xxx/parameters/ql2xextended_error_logging $ echo "1" > /sys/module/qla2xxx/parameters/ql2xextended_error_logging Increase scsi extended event logging to get more information from the SCSI layer. $ sysctl -w dev.scsi.logging_level=0x1003
— enjoy —
对不起,这篇文章暂时关闭评论。