Troubleshooting database crash and HPUX-ia64 Error: 11 caused by IO error( SFP 光衰 )
一个2节点RAC on HPUX平台的数据库实例1突然因为I/O 错误,数据文件读写失败,自动重启后不久再次因为Vating Disk不可用crs crash, 另一节点正常。后暂时停止问题节点,排查硬件的底层环境问题,确认数据库主机节点的SFP(Small Form-factor Pluggable)光衰减严重,导致数据传输丢帧,频繁重发数据帧导致数据库存储I/O延迟过高(58ms-100s)或间断,数据库IO错误。
Db alert log
# db alert log 2019-08-19 11:11:28.886000 +08:00 Errors in file /oracle/app/oracle/diag/rdbms/prianbob/anbob1/trace/anbob1_dbw9_9393.trc: ORA-27063: number of bytes read/written is incorrect HPUX-ia64 Error: 11: Resource temporarily unavailable Additional information: -1 Additional information: 8192 WARNING: Write Failed. group:2 disk:41 AU:273629 offset:393216 size:8192 Errors in file /oracle/app/oracle/diag/rdbms/prianbob/anbob1/trace/anbob1_dbw9_9393.trc: ORA-27063: number of bytes read/written is incorrect HPUX-ia64 Error: 11: Resource temporarily unavailable Additional information: -1 Additional information: 8192 WARNING: Write Failed. group:2 disk:39 AU:254967 offset:98304 size:8192 Errors in file /oracle/app/oracle/diag/rdbms/prianbob/anbob1/trace/anbob1_dbw9_9393.trc: ORA-27063: number of bytes read/written is incorrect HPUX-ia64 Error: 11: Resource temporarily unavailable Additional information: -1 Additional information: 8192 WARNING: Write Failed. group:2 disk:29 AU:241881 offset:229376 size:8192 Errors in file /oracle/app/oracle/diag/rdbms/prianbob/anbob1/trace/anbob1_dbw9_9393.trc: ORA-27063: number of bytes read/written is incorrect HPUX-ia64 Error: 11: Resource temporarily unavailable Additional information: -1 Additional information: 8192 WARNING: Write Failed. group:2 disk:5 AU:418940 offset:753664 size:8192 Errors in file /oracle/app/oracle/diag/rdbms/prianbob/anbob1/trace/anbob1_dbw9_9393.trc: ORA-15080: synchronous I/O operation to a disk failed WARNING: failed to write mirror side 1 of virtual extent 4540 logical extent 0 of file 908 in group 2 on disk 5 allocation unit 418940 Errors in file /oracle/app/oracle/diag/rdbms/prianbob/anbob1/trace/anbob1_dbw9_9393.trc: ORA-15080: synchronous I/O operation to a disk failed WARNING: failed to write mirror side 1 of virtual extent 29563 logical extent 0 of file 573 in group 2 on disk 29 allocation unit 241881 Errors in file /oracle/app/oracle/diag/rdbms/prianbob/anbob1/trace/anbob1_dbw9_9393.trc: ORA-15080: synchronous I/O operation to a disk failed WARNING: failed to write mirror side 1 of virtual extent 29658 logical extent 0 of file 590 in group 2 on disk 39 allocation unit 254967 Errors in file /oracle/app/oracle/diag/rdbms/prianbob/anbob1/trace/anbob1_dbw9_9393.trc: ORA-15080: synchronous I/O operation to a disk failed WARNING: failed to write mirror side 1 of virtual extent 27522 logical extent 0 of file 650 in group 2 on disk 41 allocation unit 273629 KCF: read, write or open error, block=0x35c130 online=1 file=426 '+DATADG/prianbob/datafile/ryya_lv30_249' error=15081 txt: '' Errors in file /oracle/app/oracle/diag/rdbms/prianbob/anbob1/trace/anbob1_dbw9_9393.trc: Errors in file /oracle/app/oracle/diag/rdbms/prianbob/anbob1/trace/anbob1_dbw9_9393.trc: ORA-63999: data file suffered media failure ORA-01114: IO error writing block to file 426 (block # 3522864) ORA-01114: data file 426: '+DATADG/prianbob/datafile/ryya_lv30_249' ORA-15081: failed to submit an I/O operation to a disk ORA-15081: failed to submit an I/O operation to a disk System state dump requested by (instance=1, osid=9393 (DBW9)), summary=[abnormal instance termination]. Mon Aug 19 11:38:20 2019 ORA-1092 : opitsk aborting process Mon Aug 19 11:38:25 2019 Instance termination failed to kill one or more processes Instance terminated by ASMB, pid = 16652 Mon Aug 19 11:39:49 2019 Starting ORACLE instance (normal) Mon Aug 19 11:40:01 2019
# gi alert log
2019-08-19 11:34:51.282 [cssd(27396)]CRS-1649:An I/O error occured for voting file: /dev/rdisk/disk446; details at (:CSSNM00060:) in /oracle/app/11.2.0.3/grid/log/anbob1/cssd/ocssd.log. 2019-08-19 11:34:51.282 [cssd(27396)]CRS-1649:An I/O error occured for voting file: /dev/rdisk/disk444; details at (:CSSNM00060:) in /oracle/app/11.2.0.3/grid/log/anbob1/cssd/ocssd.log. 2019-08-19 11:34:51.282 [cssd(27396)]CRS-1649:An I/O error occured for voting file: /dev/rdisk/disk445; details at (:CSSNM00060:) in /oracle/app/11.2.0.3/grid/log/anbob1/cssd/ocssd.log. 2019-08-19 11:36:32.114 [cssd(27396)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file /dev/rdisk/disk444 will be considered not functional in 99149 milliseconds 2019-08-19 11:36:32.114 [cssd(27396)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file /dev/rdisk/disk445 will be considered not functional in 99052 milliseconds 2019-08-19 11:36:32.114 [cssd(27396)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file /dev/rdisk/disk446 will be considered not functional in 99148 milliseconds 2019-08-19 11:37:21.604 [cssd(27396)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file /dev/rdisk/disk444 will be considered not functional in 49659 milliseconds 2019-08-19 11:37:21.605 [cssd(27396)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file /dev/rdisk/disk445 will be considered not functional in 49562 milliseconds 2019-08-19 11:37:21.605 [cssd(27396)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file /dev/rdisk/disk446 will be considered not functional in 49657 milliseconds 2019-08-19 11:37:51.904 [cssd(27396)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file /dev/rdisk/disk444 will be considered not functional in 19359 milliseconds 2019-08-19 11:37:51.905 [cssd(27396)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file /dev/rdisk/disk445 will be considered not functional in 19262 milliseconds 2019-08-19 11:37:51.905 [cssd(27396)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file /dev/rdisk/disk446 will be considered not functional in 19357 milliseconds
Note:
数据库报 ORA-27063 、 HPUX-ia64 Error: 11: Resource temporarily unavailable、 ORA-15080 、ORA-63999 、ORA-01114 、 ORA-15081 等IO错误, 确认是的OS Layer一定存在IO异常, 后来VD 访问也出现了200 秒的 timeout 访问超时。
在LINUX 平台,多路径软件可能存在BUG, 如OEL平台的multipath bug 1344381 ,这个案例是HPUX平台,使用的是HP 3PAR存储。主机上未做任何调整,因此需要排查存储链路, 因为共享存储的另一个RAC 实例正常,很有可能出在节点1主机到存储之间。
存储链路上的故障点有, 如果有级联交换机会更复杂。
HBA SFP — cable –Switch SFP — Switch SFP — cable — Storage SFP
如果从存储和操作系统日志无法定位故障点,可以试试supportshow的信息定位。
ANBOB_1:FID128:admin> porterrshow -i 65 frames enc crc crc too too bad enc disc link loss loss frjt fbsy c3timeout pcs tx rx in err g_eof shrt long eof out c3 fail sync sig tx rx err 65: 823.5m 1.8g 1.4k 1.4k 1.4k 0 0 16 521.0k 10.9k 0 0 0 0 0 10.9k 0 0 ANBOB_1:FID128:admin> porterrshow -i 79 frames enc crc crc too too bad enc disc link loss loss frjt fbsy c3timeout pcs tx rx in err g_eof shrt long eof out c3 fail sync sig tx rx err 79: 1.8g 823.6m 0 0 0 0 0 0 0 2.8k 0 0 0 0 0 0 2.8k 0 ANBOB_1:FID128:admin> supportshow portshow 65 portDisableReason: None portCFlags: 0x0 portFlags: 0x4021 PRESENT U_PORT DISABLED LED LocalSwcFlags: 0x0 portType: 17.0 portState: 2 Offline Protocol: FC portPhys: 5 No_Sync portScn: 2 Offline port generation number: 22400 state transition count: 35 portId: 014100 portIfId: 43220039 portWwn: 20:41:00:27:f8:33:95:00 portWwn of device(s) connected: None Distance: normal portSpeed: N8Gbps Credit Recovery: Inactive LE domain: 0 Peer beacon: Off FC Fastwrite: OFF Interrupts: 0 Link_failure: 0 Frjt: 0 Unknown: 0 Loss_of_sync: 0 Fbsy: 0 Lli: 80 Loss_of_sig: 0 Proc_rqrd: 267 Protocol_err: 0 Timed_out: 0 Invalid_word: 522566 Rx_flushed: 0 Invalid_crc: 1414 Tx_unavail: 0 Delim_err: 0 Free_buffer: 0 Address_err: 0 Overrun: 0 Lr_in: 0 Suspended: 0 Lr_out: 40 Parity_err: 0 Ols_in: 0 2_parity_err: 0 Ols_out: 0 CMI_bus_err: 0
下面从操作系统上检查光纤通信, 这里记录一下不同OS检查光纤设备的方法
OS 列出HP机上连接的光纤卡设备 查看光纤卡的WWN号 ---------- ------------------------ --------------------------- HPUX ioscan -kfnNCfc fcmsutil /dev/fcd0 Solars luxadm probe ** or **luxadm -e port luxadm -e dump_map /devices/pci@0,0/pci1022,7450@2/pci1077,101@1/fp@0,0:devctl AIX lsdev -Cc adapter -S a | grep fcs lscfg -vpl fcs0 Linux lspci -nn| grep "Fibre Channel"** or ** systool -c fc_host systool -c fc_host -v host3 ** or **more /sys/class/fc_host/host?/port_name
光纤收发器最重要的两个因素:Tx功率和Rx功率。光学Tx功率是离开该设备的信号电平,它应该在发射机功率范围内。Rx功率是从远端设备接收的输入信号电平,它应该落在接收功率范围内。
dBm(毫瓦分贝)与mW(毫瓦)都是光功率的单位。
两者之间的换算关系:dBm=10xlgP,其中P为功率,单位为mW。
# ioscan -kfnNCfc
Class I H/W Path Driver S/W State H/W Type Description
======================================================================
fc 0 43/0/0/2/0/0/0 fcd CLAIMED INTERFACE HP AH401A 8Gb Dual Port PCIe Fibre Channel Adapter (FC Port 1)
/dev/fcd0
fc 1 43/0/0/2/0/0/1 fcd CLAIMED INTERFACE HP AH401A 8Gb Dual Port PCIe Fibre Channel Adapter (FC Port 2)
/dev/fcd1
fc 2 43/0/2/0/0/0/0 fcd CLAIMED INTERFACE HP AH401A 8Gb Dual Port PCIe Fibre Channel Adapter (FC Port 1)
/dev/fcd2
fc 3 43/0/2/0/0/0/1 fcd CLAIMED INTERFACE HP AH401A 8Gb Dual Port PCIe Fibre Channel Adapter (FC Port 2)
/dev/fcd3
fc 4 43/0/2/2/0/0/0 fcd CLAIMED INTERFACE HP AH401A 8Gb Dual Port PCIe Fibre Channel Adapter (FC Port 1)
/dev/fcd4
fc 5 43/0/2/2/0/0/1 fcd CLAIMED INTERFACE HP AH401A 8Gb Dual Port PCIe Fibre Channel Adapter (FC Port 2)
/dev/fcd5
fc 6 45/0/0/2/0/0/0 fcd CLAIMED INTERFACE HP AH401A 8Gb Dual Port PCIe Fibre Channel Adapter (FC Port 1)
/dev/fcd6
fc 7 45/0/0/2/0/0/1 fcd CLAIMED INTERFACE HP AH401A 8Gb Dual Port PCIe Fibre Channel Adapter (FC Port 2)
# fcmsutil /dev/fcd7 sfp
ADDRESS 0xA0:
Identifier: SFP
Ext. Identifier: GBIC/SFP function is defined by serial ID only
Connector: LC
Transceiver: 0x0 0x0 0x0 0x0 0x40 0x40 0xc 0x54
Encoding: 8B10B
BR, Nominal: 8500 Mbits/sec.
Length (9um): 0 km
Length (9um): 0 m
Length (50um): 50 m
Length (62.5um): 20 m
Length (Copper): 19 m
Vendor Name: HP-F
Vendor OUI: 0x009065
Vendor PN: AJ718A
Vendor rev: A
Wavelength: 0x0352
CC_BASE: 0x92
Options: -RATE SELECT is implemented: 1
-TX_DISABLE implemented & disables serial output: 1
-TX_FAULT is implemented. (See SFP MSA): 1
-Loss of Signal, inverted from standard in SFP MSA:0
-Loss of Signal, signal as defined in SFP MSA: 1
BR, max: 0%
BR, min: 0%
Vendor SN: MY84250LUJ
Date Code (yymmdd): 140620
Diagnostic Monitoring Type: -Digital diagnostic monitoring implemented: 1
-Internally Calibrated: 1
-Externally Calibrated: 0
-Received power measurement type: 1 (Average Power)
-Address change required: 0
Enhanced Options: -Alarm/warning flags implemented: 1
-Soft TX_DISABLE control and monitoring: 1
-Soft TX_DISABLE monitoring: 1
-Soft RX_LOS monitoring: 1
-Soft RATE_SELECT control and monitoring: 1
-Application Select control per SFF-8079: 0
SFF-8472 Compliance: 0x3
CC_EXT: 0x60
Vendor Specific: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
ADDRESS 0xA2:
Temp High Alarm: 90.000 C
Temp Low Alarm: -10.000 C
Temp High Warning: 85.000 C
Temp Low Warning: -5.000 C
Voltage High Alarm: 3.70 Volts
Voltage Low Alarm: 2.90 Volts
Voltage High Warning: 3.60 Volts
Voltage Low Warning: 3.00 Volts
Bias High Alarm: 17.00 mA
Bias Low Alarm: 1.00 mA
Bias High Warning: 14.00 mA
Bias Low Warning: 2.00 mA
TX Power High Alarm: 0.63 mW
TX Power Low Alarm: 0.13 mW
TX Power High Warning: 0.56 mW
TX Power Low Warning: 0.16 mW
RX Power High Alarm: 1.26 mW
RX Power Low Alarm: 0.01 mW
RX Power High Warning: 1.00 mW
RX Power Low Warning: 0.02 mW
Checksum (bytes 0-94): Valid
Temperature: 31.254 C
Vcc: 3.34 Volts
TX Bias: 8.04 mA
TX Power: 0.48 mW
RX Power: 0.36 mW
Status/Control Bits: -TX Disable State: 0
-Soft TX Disable: 0
-Rate Select State: 0
-Soft Rate Select: 0
-TX Fault: 0
-LOS: 0
-Data_Ready_Bar: 0
-Temp High Alarm: 0
-Temp Low Alarm: 0
-Vcc High Alarm: 0
-Vcc Low Alarm: 0
-TX Bias High Alarm: 0
-TX Bias Low Alarm: 0
-TX Power High Alarm: 0
-TX Power Low Alarm: 0
-RX Power High Alarm: 0
-RX Power Low Alarm: 0
-Temp High Warning: 0
-Temp Low Warning: 0
-Vcc High Warning: 0
-Vcc Low Warning: 0
-TX Bias High Warning: 0
-TX Bias Low Warning: 0
-TX Power High Warning: 0
-TX Power Low Warning: 0
-RX Power High Warning: 0
-RX Power Low Warning: 0
Vendor Specific:: 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0x01
Tip:
问题时间点主机测的SFP 光功率低到了0.2mW, 因为存储有多路径,可以暂时禁用该链路,更换SFP 后,恢复正常。
对不起,这篇文章暂时关闭评论。