Exadata 故障3例:ORA-27302: failure occurred at: skgxpcnclrpc，内存耗尽，Cellserver disk error

首页 » Exadata » Exadata 故障3例:ORA-27302: failure occurred at: skgxpcnclrpc，内存耗尽，Cellserver disk error

Exadata 故障3例:ORA-27302: failure occurred at: skgxpcnclrpc，内存耗尽，Cellserver disk error

2023/02/20
Exadata
333 views
Exadata 故障3例:ORA-27302: failure occurred at: skgxpcnclrpc，内存耗尽，Cellserver disk error已关闭评论

上周遇到几例Oracle Exadata Machine上的故障，简单记录一下问题现象，涉及db 实例重启失败报措OS资源相关skgxpcnclrpc，与内存耗尽后进程系统失败，IO hang/error , 及cell 存储节点坏盘日志的输出。

1， ORA-27302: failure occurred at: skgxpcnclrpc
实例自动重启后，启动失败
db alert log

2023-02-14T05:36:01.577384+08:00
Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_ora_233839.trc:
ORA-27603: Cell storage I/O error, I/O failed on disk o/192.168.xx.xx;192.168.xx.xx//box/predicate at offset 0 for data length 0
ORA-27626: Exadata error: 12 (Network error)
ORA-27300: OS system dependent operation:connection invalid failed with status: 0
ORA-27301: OS failure message: Error 0
ORA-27302: failure occurred at: skgxpcnclrpc

有时ASM 日志提示

ORA-00600: internal error code, arguments: [kfmdPriJoin99], [], [], [], [], [], [], []
ORA-29702: error occurred in Cluster Group Service operation
ORA-15055: unable to connect to ASM instance

或实例没有crash, 已连接进程正常，但新连接提示ORA-29701: unable to connect to Cluster Manager
是因为kernel服务systemd-tmpfiles-clean.service 如果活动，会自动清理/temp下的一些文件，默认是时间超过10天，会影响.oracle下一些socket文件影响实例启动和连接。建议排除oracle这些文件清理被systemd-tmpfiles-clean ，配置文件中/usr/lib/tmpfiles.d/tmp.conf增加如下 :
x /tmp/.oracle*
x /var/tmp/.oracle*
x /usr/tmp/.oracle*
重启time 服务，如果已经遇到上面实例启动问题，需要重启crs重新生成temp下socket file.
# systemctl restart systemd-tmpfiles-clean.timer

2, 内存耗尽
message or dmesg 日志输出

Feb 15 10:20:56 anbob_com01 kernel: [104421625.834779]  connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 108963581369, last ping 108963586373, now 108963591392
Feb 15 10:20:56 anbob_com01 kernel: [104421625.849030]  connection1:0: detected conn error (1022)
Feb 15 10:22:37 anbob_com01 kernel: [104421801.861301] INFO: task gipcd.bin:91190 blocked for more than 120 seconds.
Feb 15 10:22:37 anbob_com01 kernel: [104421801.869314]       Tainted: P           O    4.1.12-94.7.8.el6uek.x86_64 #2
Feb 15 10:22:37 anbob_com01 kernel: [104421801.877365] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886475] gipcd.bin       D ffff88575bf83a60     0 91190      1 0x00000080
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886479]  ffff8855d660f980 0000000000000082 ffff8856a52bf000 ffff885e8c79c600
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886481]  ffff885e83c3e180 ffff8855d660c008 ffff885e83c3e1e8 ffffffffffffffff
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886483]  0000000000000404 ffff8856a52bf000 ffff8855d660f9a0 ffffffff816b54fe
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886485] Call Trace:
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886492]  [] schedule+0x3e/0x90
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886494]  [] rwsem_down_read_failed+0xa5/0x130
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886498]  [] ? __handle_mm_fault+0x1cb/0x370
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886502]  [] call_rwsem_down_read_failed+0x14/0x30
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886503]  [] ? down_read+0x24/0x30
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886507]  [] __do_page_fault+0x39f/0x490
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886508]  [] do_page_fault+0x37/0x90
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886510]  [] page_fault+0x28/0x40
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886512]  [] ?  copy_user_enhanced_fast_string+0x5/0x10
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886514]  [] ? copy_to_iter+0x81/0x2d0
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886518]  [] skb_copy_datagram_iter+0x74/0x290
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886522]  [] unix_stream_recvmsg+0x413/0x780
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886525]  [] ? __pollwait+0x120/0x120
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886527]  [] sock_recvmsg+0x4b/0x60
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886529]  [] SYSC_recvfrom+0xf1/0x180
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886531]  [] ? __audit_syscall_entry+0xac/0x110
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886534]  [] ? do_audit_syscall_entry+0x6c/0x70
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886536]  [] ? syscall_trace_enter_phase1+0x153/0x180
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886538]  [] SyS_recvfrom+0xe/0x10
Feb 15 10:22:37 anbob_com01 kernel: [104421801.886540]  [] system_call_fastpath+0x12/0xce

查看free 内存并且低于1G，并且db ,asm alert日志中出现了out of memory 错误 .
进一步需要分析内存占用情况，如top 查看是否有内存溢出进程或file cache过大问题。

3， cell server 设备报错
当从db server上发现CELL disk offline，同时可能会后台进程hang，导致实例CRASH（正常CELL只坏1个normal冗余的磁盘组不会影响）

NOTE: disk 9 (DGRECO_xxxxxxxADM01) in group 2 (DGRECO) is offline for writes
NOTE: disk 10 (DGRECO_xxxxxxCELADM01) in group 2 (DGRECO) is offline for writes
NOTE: disk 11 (DGRECO_CD_xxxxxCELADM01) in group 2 (DGRECO) is offline for writes
2023-02-14T05:29:39.826030+08:00
Dumping diagnostic data in directory=[cdmp_20230214052939], requested by (instance=1, osid=384187 (RMV5)), summary=[incident=809159].
2023-02-14T05:29:41.969429+08:00
ossnet_fail_defcon: Giving up on Cell 192.168.xx.xx as retry limit (9) reached.
2023-02-14T05:29:50.695936+08:00
ossnet_fail_defcon: Giving up on Cell 192.168.xx.xx as retry limit (9) reached.
2023-02-14T05:29:55.999727+08:00
TT04: Standby redo logfile selected for thread 2 sequence 1018972 for destination LOG_ARCHIVE_DEST_2
2023-02-14T05:30:04.439190+08:00
LG02 (ospid: 104783) waits for event 'log file parallel write' for 63 secs

OS LOG

Feb 14 05:28:40 anbob_com01 kernel: sd 0:2:0:0: [sda] tag#6 CDB: Read(16) 88 00 00 00 00 03 4a f1 f0 00 00 00 08 00 00 00
Feb 14 05:28:40 anbob_com01 kernel: sd 0:2:0:0: task abort: FAILED scmd(ffff882fbf6d4b40)
Feb 14 05:28:40 anbob_com01 kernel: sd 0:2:0:0: target reset called for scmd(ffff882fbf6d0540)
Feb 14 05:28:40 anbob_com01 kernel: sd 0:2:0:0: [sda] tag#25 megasas: target reset FAILED!!
Feb 14 05:28:40 anbob_com01 kernel: sd 0:2:0:0: [sda] tag#11 Controller reset is requested due to IO timeout#012SCSI command pointer: (ffff8817be583640)#011 SCSI host state: 5#011 SCSI
Feb 14 05:28:40 xdata01celadm01 kernel: IO request frame:#012#011f10f0000 00000000 00000000 00040420 00600002 00000020 00000000 00100000 #012#01100000000 00000010 00000000 00000000 00000000 00000000 00000000 02000000 #012#01100000088 d8d70500 00000188 00000008 00000000 00000000 00000000 00000000 #012#01100140012 00000009 d7d88801 00000005 00000800 00000000 00021000 00000000 #012#01174600000 0000002f 00010000 00000000 74610000 0000002f 00010000 00000000 #012#01174620000 0000002f 00010000 00000000 74630000 0000002f 00010000 00000000 #012#01174640000 0000002f 00010000 00000000 74650000 0000002f 00010000 00000000 #012#01174660000 0000002f 00010000 00000000 6c095000 00000000 00000090 80000000 
Feb 14 05:28:40 anbob_com01 kernel: megaraid_sas 0000:5e:00.0: IO/DCMD timeout is detected, forcibly FAULT Firmware
Feb 14 05:28:42 anbob_com01 kernel: megaraid_sas 0000:5e:00.0: waiting for controller reset to finish
Feb 14 05:28:44 anbob_com01 kernel: megaraid_sas 0000:5e:00.0: Number of host crash buffers allocated: 512
Feb 14 05:28:44 anbob_com01 kernel: megaraid_sas 0000:5e:00.0: Crash Dump is available,number of copied buffers: 99
Feb 14 05:28:44 anbob_com01 kernel: megaraid_sas 0000:5e:00.0: Found FW in FAULT state, will reset adapter scsi0.
Feb 14 05:28:44 anbob_com01 kernel: megaraid_sas 0000:5e:00.0: resetting fusion adapter scsi0.
Feb 14 05:28:47 anbob_com01 kernel: megaraid_sas 0000:5e:00.0: waiting for controller reset to finish
Feb 14 05:28:52 anbob_com01 kernel: megaraid_sas 0000:5e:00.0: waiting for controller reset to finish
Feb 14 05:28:57 anbob_com01 kernel: megaraid_sas 0000:5e:00.0: waiting for controller reset to finish
Feb 14 05:28:58 anbob_com01 kernel: megaraid_sas 0000:5e:00.0: Waiting for FW to come to ready state
Feb 14 05:29:02 anbob_com01 kernel: megaraid_sas 0000:5e:00.0: waiting for controller reset to finish
Feb 14 05:45:17 anbob_com01 kernel: Initializing cgroup subsys cpuset  ---reboot

检查OS重启后sda是否还在报错

	行 14685: Feb 15 12:36:57 anbob_com01 kernel: sd 0:2:0:0: [sda] tag#7 Add. Sense: Unrecovered read error
	行 14686: Feb 15 12:36:57 anbob_com01 kernel: sd 0:2:0:0: [sda] tag#5 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
	行 14687: Feb 15 12:36:57 anbob_com01 kernel: sd 0:2:0:0: [sda] tag#7 CDB: Read(16) 88 00 00 00 00 05 f0 54 98 01 00 00 07 ff 00 00
	行 14688: Feb 15 12:36:57 anbob_com01 kernel: blk_update_request: critical medium error, dev sda, sector 25506912257
	行 14689: Feb 15 12:36:57 anbob_com01 kernel: sd 0:2:0:0: [sda] tag#4 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
	行 14690: Feb 15 12:36:57 anbob_com01 kernel: sd 0:2:0:0: [sda] tag#5 Sense Key : Medium Error [current] 
	行 14691: Feb 15 12:36:57 anbob_com01 kernel: sd 0:2:0:0: [sda] tag#4 Sense Key : Medium Error [current] 
	行 14692: Feb 15 12:36:57 anbob_com01 kernel: sd 0:2:0:0: [sda] tag#5 Add. Sense: Unrecovered read error
	行 14693: Feb 15 12:36:57 anbob_com01 kernel: sd 0:2:0:0: [sda] tag#4 Add. Sense: Unrecovered read error
	行 14694: Feb 15 12:36:57 anbob_com01 kernel: sd 0:2:0:0: [sda] tag#5 CDB: Read(16) 88 00 00 00 00 05 f0 54 90 01 00 00 08 00 00 00
	行 14695: Feb 15 12:36:57 anbob_com01 kernel: sd 0:2:0:0: [sda] tag#4 CDB: Read(16) 88 00 00 00 00 05 f0 54 88 01 00 00 08 00 00 00
	行 14696: Feb 15 12:36:57 anbob_com01 kernel: blk_update_request: critical medium error, dev sda, sector 25506910209

重启后sda 还依旧在报错，建议硬件厂家分析硬件是否损坏，core dump有生成也可以crash分析调用。

打赏

critical medium error, exadata, rwsem_down_read_failed, skgxpcnclrpc

对不起，这篇文章暂时关闭评论。

上一篇： Linux message show “systemd-logind: Failed to start user slice xx, The maximum number of pending replies per connection has been reached”

下一篇： How to diag High Memory Utilization on HP-UX ? (内存使用高)

ANBOB™