Exadata 故障3例:ORA-27302: failure occurred at: skgxpcnclrpc, 内存耗尽,Cellserver disk error
上周遇到几例Oracle Exadata Machine上的故障,简单记录一下问题现象,涉及db 实例重启失败报措OS资源相关skgxpcnclrpc, 与内存耗尽后进程系统失败,IO hang/error , 及cell 存储节点坏盘日志的输出。
1, ORA-27302: failure occurred at: skgxpcnclrpc
实例自动重启后,启动失败
db alert log
2023-02-14T05:36:01.577384+08:00
Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_ora_233839.trc:
ORA-27603: Cell storage I/O error, I/O failed on disk o/192.168.xx.xx;192.168.xx.xx//box/predicate at offset 0 for data length 0
ORA-27626: Exadata error: 12 (Network error)
ORA-27300: OS system dependent operation:connection invalid failed with status: 0
ORA-27301: OS failure message: Error 0
ORA-27302: failure occurred at: skgxpcnclrpc
有时ASM 日志提示
ORA-00600: internal error code, arguments: [kfmdPriJoin99], [], [], [], [], [], [], [] ORA-29702: error occurred in Cluster Group Service operation ORA-15055: unable to connect to ASM instance
或实例没有crash, 已连接进程正常,但新连接提示ORA-29701: unable to connect to Cluster Manager
是因为kernel服务systemd-tmpfiles-clean.service 如果活动,会自动清理/temp下的一些文件,默认是时间超过10天, 会影响.oracle下一些socket文件影响实例启动和连接。 建议排除oracle这些文件清理被systemd-tmpfiles-clean ,配置文件中/usr/lib/tmpfiles.d/tmp.conf增加如下 :
x /tmp/.oracle*
x /var/tmp/.oracle*
x /usr/tmp/.oracle*
重启time 服务,如果已经遇到上面实例启动问题,需要重启crs重新生成temp下socket file.
# systemctl restart systemd-tmpfiles-clean.timer
2, 内存耗尽
message or dmesg 日志输出
Feb 15 10:20:56 anbob_com01 kernel: [104421625.834779] connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 108963581369, last ping 108963586373, now 108963591392
Feb 15 10:20:56 anbob_com01 kernel: [104421625.849030] connection1:0: detected conn error (1022)
Feb 15 10:22:37 anbob_com01 kernel: [104421801.861301] INFO: task gipcd.bin:91190 blocked for more than 120 seconds.
Feb 15 10:22:37 anbob_com01 kernel: [104421801.869314] Tainted: P O 4.1.12-94.7.8.el6uek.x86_64 #2
Feb 15 10:22:37 anbob_com01 kernel: [104421801.877365] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886475] gipcd.bin D ffff88575bf83a60 0 91190 1 0x00000080
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886479] ffff8855d660f980 0000000000000082 ffff8856a52bf000 ffff885e8c79c600
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886481] ffff885e83c3e180 ffff8855d660c008 ffff885e83c3e1e8 ffffffffffffffff
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886483] 0000000000000404 ffff8856a52bf000 ffff8855d660f9a0 ffffffff816b54fe
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886485] Call Trace:
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886492] [] schedule+0x3e/0x90
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886494] [] rwsem_down_read_failed+0xa5/0x130
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886498] [] ? __handle_mm_fault+0x1cb/0x370
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886502] [] call_rwsem_down_read_failed+0x14/0x30
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886503] [] ? down_read+0x24/0x30
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886507] [] __do_page_fault+0x39f/0x490
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886508] [] do_page_fault+0x37/0x90
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886510] [] page_fault+0x28/0x40
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886512] [] ? copy_user_enhanced_fast_string+0x5/0x10
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886514] [] ? copy_to_iter+0x81/0x2d0
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886518] [] skb_copy_datagram_iter+0x74/0x290
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886522] [] unix_stream_recvmsg+0x413/0x780
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886525] [] ? __pollwait+0x120/0x120
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886527] [] sock_recvmsg+0x4b/0x60
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886529] [] SYSC_recvfrom+0xf1/0x180
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886531] [] ? __audit_syscall_entry+0xac/0x110
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886534] [] ? do_audit_syscall_entry+0x6c/0x70
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886536] [] ? syscall_trace_enter_phase1+0x153/0x180
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886538] [] SyS_recvfrom+0xe/0x10
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886540] [] system_call_fastpath+0x12/0xce
查看free 内存并且低于1G, 并且db ,asm alert日志中出现了out of memory 错误 .
进一步需要分析内存占用情况,如top 查看是否有内存溢出进程或file cache过大问题。
3, cell server 设备报错
当从db server上发现CELL disk offline,同时可能会后台进程hang,导致实例CRASH(正常CELL只坏1个normal冗余的磁盘组不会影响)
NOTE: disk 9 (DGRECO_xxxxxxxADM01) in group 2 (DGRECO) is offline for writes NOTE: disk 10 (DGRECO_xxxxxxCELADM01) in group 2 (DGRECO) is offline for writes NOTE: disk 11 (DGRECO_CD_xxxxxCELADM01) in group 2 (DGRECO) is offline for writes 2023-02-14T05:29:39.826030+08:00 Dumping diagnostic data in directory=[cdmp_20230214052939], requested by (instance=1, osid=384187 (RMV5)), summary=[incident=809159]. 2023-02-14T05:29:41.969429+08:00 ossnet_fail_defcon: Giving up on Cell 192.168.xx.xx as retry limit (9) reached. 2023-02-14T05:29:50.695936+08:00 ossnet_fail_defcon: Giving up on Cell 192.168.xx.xx as retry limit (9) reached. 2023-02-14T05:29:55.999727+08:00 TT04: Standby redo logfile selected for thread 2 sequence 1018972 for destination LOG_ARCHIVE_DEST_2 2023-02-14T05:30:04.439190+08:00 LG02 (ospid: 104783) waits for event 'log file parallel write' for 63 secs
OS LOG
Feb 14 05:28:40 anbob_com01 kernel: sd 0:2:0:0: [sda] tag#6 CDB: Read(16) 88 00 00 00 00 03 4a f1 f0 00 00 00 08 00 00 00 Feb 14 05:28:40 anbob_com01 kernel: sd 0:2:0:0: task abort: FAILED scmd(ffff882fbf6d4b40) Feb 14 05:28:40 anbob_com01 kernel: sd 0:2:0:0: target reset called for scmd(ffff882fbf6d0540) Feb 14 05:28:40 anbob_com01 kernel: sd 0:2:0:0: [sda] tag#25 megasas: target reset FAILED!! Feb 14 05:28:40 anbob_com01 kernel: sd 0:2:0:0: [sda] tag#11 Controller reset is requested due to IO timeout#012SCSI command pointer: (ffff8817be583640)#011 SCSI host state: 5#011 SCSI Feb 14 05:28:40 xdata01celadm01 kernel: IO request frame:#012#011f10f0000 00000000 00000000 00040420 00600002 00000020 00000000 00100000 #012#01100000000 00000010 00000000 00000000 00000000 00000000 00000000 02000000 #012#01100000088 d8d70500 00000188 00000008 00000000 00000000 00000000 00000000 #012#01100140012 00000009 d7d88801 00000005 00000800 00000000 00021000 00000000 #012#01174600000 0000002f 00010000 00000000 74610000 0000002f 00010000 00000000 #012#01174620000 0000002f 00010000 00000000 74630000 0000002f 00010000 00000000 #012#01174640000 0000002f 00010000 00000000 74650000 0000002f 00010000 00000000 #012#01174660000 0000002f 00010000 00000000 6c095000 00000000 00000090 80000000 Feb 14 05:28:40 anbob_com01 kernel: megaraid_sas 0000:5e:00.0: IO/DCMD timeout is detected, forcibly FAULT Firmware Feb 14 05:28:42 anbob_com01 kernel: megaraid_sas 0000:5e:00.0: waiting for controller reset to finish Feb 14 05:28:44 anbob_com01 kernel: megaraid_sas 0000:5e:00.0: Number of host crash buffers allocated: 512 Feb 14 05:28:44 anbob_com01 kernel: megaraid_sas 0000:5e:00.0: Crash Dump is available,number of copied buffers: 99 Feb 14 05:28:44 anbob_com01 kernel: megaraid_sas 0000:5e:00.0: Found FW in FAULT state, will reset adapter scsi0. Feb 14 05:28:44 anbob_com01 kernel: megaraid_sas 0000:5e:00.0: resetting fusion adapter scsi0. Feb 14 05:28:47 anbob_com01 kernel: megaraid_sas 0000:5e:00.0: waiting for controller reset to finish Feb 14 05:28:52 anbob_com01 kernel: megaraid_sas 0000:5e:00.0: waiting for controller reset to finish Feb 14 05:28:57 anbob_com01 kernel: megaraid_sas 0000:5e:00.0: waiting for controller reset to finish Feb 14 05:28:58 anbob_com01 kernel: megaraid_sas 0000:5e:00.0: Waiting for FW to come to ready state Feb 14 05:29:02 anbob_com01 kernel: megaraid_sas 0000:5e:00.0: waiting for controller reset to finish Feb 14 05:45:17 anbob_com01 kernel: Initializing cgroup subsys cpuset ---reboot
检查OS重启后sda是否还在报错
行 14685: Feb 15 12:36:57 anbob_com01 kernel: sd 0:2:0:0: [sda] tag#7 Add. Sense: Unrecovered read error
行 14686: Feb 15 12:36:57 anbob_com01 kernel: sd 0:2:0:0: [sda] tag#5 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
行 14687: Feb 15 12:36:57 anbob_com01 kernel: sd 0:2:0:0: [sda] tag#7 CDB: Read(16) 88 00 00 00 00 05 f0 54 98 01 00 00 07 ff 00 00
行 14688: Feb 15 12:36:57 anbob_com01 kernel: blk_update_request: critical medium error, dev sda, sector 25506912257
行 14689: Feb 15 12:36:57 anbob_com01 kernel: sd 0:2:0:0: [sda] tag#4 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
行 14690: Feb 15 12:36:57 anbob_com01 kernel: sd 0:2:0:0: [sda] tag#5 Sense Key : Medium Error [current]
行 14691: Feb 15 12:36:57 anbob_com01 kernel: sd 0:2:0:0: [sda] tag#4 Sense Key : Medium Error [current]
行 14692: Feb 15 12:36:57 anbob_com01 kernel: sd 0:2:0:0: [sda] tag#5 Add. Sense: Unrecovered read error
行 14693: Feb 15 12:36:57 anbob_com01 kernel: sd 0:2:0:0: [sda] tag#4 Add. Sense: Unrecovered read error
行 14694: Feb 15 12:36:57 anbob_com01 kernel: sd 0:2:0:0: [sda] tag#5 CDB: Read(16) 88 00 00 00 00 05 f0 54 90 01 00 00 08 00 00 00
行 14695: Feb 15 12:36:57 anbob_com01 kernel: sd 0:2:0:0: [sda] tag#4 CDB: Read(16) 88 00 00 00 00 05 f0 54 88 01 00 00 08 00 00 00
行 14696: Feb 15 12:36:57 anbob_com01 kernel: blk_update_request: critical medium error, dev sda, sector 25506910209
重启后sda 还依旧在报错,建议硬件厂家分析硬件是否损坏,core dump有生成也可以crash分析调用。
对不起,这篇文章暂时关闭评论。