Troubleshooting ora-07445 [__lwp_kill()+48] [SIGIOT] error and instance crash
最近有一套11.2.0.3 rac on hpux ia31的节点1重启, 重启前数据库出现了ora-7445 [__lwp_kill()+48] 错误
# db alert
Sun Jul 26 03:05:26 2015 Thread 1 advanced to log sequence 44945 (LGWR switch) Current log# 5 seq# 44945 mem# 0: /dev/yyb_oravg02/ryyb_redo05 Sun Jul 26 03:05:29 2015 Archived Log entry 38703 added for thread 1 sequence 44944 ID 0x1474c95c dest 1: Sun Jul 26 03:05:45 2015 Exception [type: SIGIOT, unknown code] [ADDR:0x6CA9] [PC:0xC0000000003125F0, __lwp_kill()+48] [exception issued by pid: 27817, uid: 1024] [flags: 0x0, count: 1] Errors in file /oracle/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_lms3_27817.trc (incident=704134): ORA-07445: exception encountered: core dump [__lwp_kill()+48] [SIGIOT] [ADDR:0x6CA9] [PC:0xC0000000003125F0] [unknown code] [] Incident details in: /oracle/app/oracle/diag/rdbms/anbob/anbob1/incident/incdir_704134/anbob1_lms3_27817_i704134.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Sun Jul 26 03:05:47 2015 Dumping diagnostic data in directory=[cdmp_20150726030547], requested by (instance=1, osid=27817 (LMS3)), summary=[incident=704134]. Sun Jul 26 03:05:49 2015 PMON (ospid: 27789): terminating the instance due to error 484 # anbob1_lms3_27817.trc trace *** 2015-07-26 03:05:45.259 SKGXP:[9ffffffffd5834d8.5]{0}: (27817 25925) ach 9ffffffffd599770 : RDMA: No Active: No State: PASSIVE OPEN (40) SKGXP:[9ffffffffd5834d8.41]{0}: sconno: 0x434e3dda aconno: 0x5a0ee48b sadmno: 0x6459330c aadmno: 0x630e0d86 creqtime: 212728 SKGXP:[9ffffffffd5834d8.42]{0}: fragsz: 32768 cdt_bits: 3 tot_cdts: 8 cdts: max_sends: 8 SKGXP:[9ffffffffd5834d8.43]{0}: seqnxt: 58072 last_ack: 58072 lmseqn: 0 transform_side: No inactive_time: 5048206018 SKGXP:[9ffffffffd5834d8.44]{0}: SKGXP:[9ffffffffd5834d8.45]{0}: Dumping Sliding Window SKGXP:[9ffffffffd5834d8.46]{0}: Slot: 0 State: FREE SKGXP:[9ffffffffd5834d8.47]{0}: ftype: 2 seqn: 58064 first_seqn: 58064 flen: 1520 fragno: 0 tot_frags: 1 SKGXP:[9ffffffffd5834d8.48]{0}: rqh: 0000000000000000 ttrt: 0 tc: 0 xmit_ts: 0 SKGXP:[9ffffffffd5834d8.49]{0}: Slot: 1 State: FREE SKGXP:[9ffffffffd5834d8.50]{0}: ftype: 2 seqn: 58065 first_seqn: 58065 flen: 1520 fragno: 0 tot_frags: 1 SKGXP:[9ffffffffd5834d8.51]{0}: rqh: 0000000000000000 ttrt: 0 tc: 0 xmit_ts: 0 SKGXP:[9ffffffffd5834d8.52]{0}: Slot: 2 State: FREE SKGXP:[9ffffffffd5834d8.53]{0}: ftype: 2 seqn: 58066 first_seqn: 58066 flen: 2672 fragno: 0 tot_frags: 1 SKGXP:[9ffffffffd5834d8.54]{0}: rqh: 0000000000000000 ttrt: 0 tc: 0 xmit_ts: 0 SKGXP:[9ffffffffd5834d8.55]{0}: Slot: 3 State: FREE SKGXP:[9ffffffffd5834d8.56]{0}: ftype: 2 seqn: 58067 first_seqn: 58067 flen: 944 fragno: 0 tot_frags: 1 SKGXP:[9ffffffffd5834d8.57]{0}: rqh: 0000000000000000 ttrt: 0 tc: 0 xmit_ts: 0 SKGXP:[9ffffffffd5834d8.58]{0}: Slot: 4 State: FREE SKGXP:[9ffffffffd5834d8.59]{0}: ftype: 2 seqn: 58068 first_seqn: 58068 flen: 1376 fragno: 0 tot_frags: 1 SKGXP:[9ffffffffd5834d8.60]{0}: rqh: 0000000000000000 ttrt: 0 tc: 0 xmit_ts: 0 SKGXP:[9ffffffffd5834d8.61]{0}: Slot: 5 State: FREE SKGXP:[9ffffffffd5834d8.62]{0}: ftype: 2 seqn: 58069 first_seqn: 58069 flen: 368 fragno: 0 tot_frags: 1 SKGXP:[9ffffffffd5834d8.63]{0}: rqh: 0000000000000000 ttrt: 0 tc: 0 xmit_ts: 0 SKGXP:[9ffffffffd5834d8.64]{0}: Slot: 6 State: FREE SKGXP:[9ffffffffd5834d8.65]{0}: ftype: 2 seqn: 58070 first_seqn: 58070 flen: 3248 fragno: 0 tot_frags: 1 SKGXP:[9ffffffffd5834d8.66]{0}: rqh: 0000000000000000 ttrt: 0 tc: 0 xmit_ts: 0 SKGXP:[9ffffffffd5834d8.67]{0}: Slot: 7 State: FREE SKGXP:[9ffffffffd5834d8.68]{0}: ftype: 2 seqn: 58071 first_seqn: 58071 flen: 224 fragno: 0 tot_frags: 1 SKGXP:[9ffffffffd5834d8.69]{0}: rqh: 0000000000000000 ttrt: 0 tc: 0 xmit_ts: 0 SKGXP:[9ffffffffd5834d8.70]{0}: SKGXP Assertion FALSE Failed at location skgxp_slide_recv_window:gen-oow line_num: 19499 <<<<<<<<<<<<<<< *** 2015-07-26 03:05:45.291 Exception [type: SIGIOT, unknown code] [ADDR:0x6CA9] [PC:0xC0000000003125F0, __lwp_kill()+48] [exception issued by pid: 27817, uid: 1024] [flags: 0x0, count: 1] Incident 704134 created, dump file: /oracle/app/oracle/diag/rdbms/anbob/anbob1/incident/incdir_704134/anbob1_lms3_27817_i704134.trc ORA-07445: exception encountered: core dump [__lwp_kill()+48] [SIGIOT] [ADDR:0x6CA9] [PC:0xC0000000003125F0] [unknown code] [] ssexhd: crashing the process... Background_Core_Dump = partial ksdbgcra: writing core file to directory '/oracle/app/oracle/diag/rdbms/anbob/anbob1/cdump'
# /oracle/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_lms3_27817.trc
ORA-07445: exception encountered: core dump [__lwp_kill()+48] [SIGIOT] [ADDR:0x6CA9] [PC:0xC0000000003125F0] [unknown code] [] *** 2015-07-26 03:05:45.416 dbkedDefDump(): Starting a non-incident diagnostic dump (flags=0x3, level=3, mask=0x0) ----- SQL Statement (None) ----- Current SQL information unavailable - no cursor. ----- Call Stack Trace ----- skdstdst <-__lwp_kill()+48<-__pthread_kill()+2512<-_raise()+224<-abort()+544<-_assert()+608<-skgxp_assert()+784 <-skgxp_assert_recv()+1488<-skgxp_window_land_recv()+624 <-skgxpprcrcv()+160<-skgxp_recv_next_fragment()+3504<-skgxprusr()+1008<-skgxpiwait()+9728 ----- End of Abridged Call Stack Trace -----
skgxpwait() This file contains the OSD(Operating System Dependent) API used by the TCP/IP version of IPC
skgxp_recv_next_fragment() receive next fragment on the wire
skgxp_window_land_recv() do sliding window processing for the received fragment
assert () Macro to get the row-cache latch. It stores the comment in the PGA to help identify who got the latch and who frees it
Note:
skgxp_slide_recv_window 没有找到相关的解释,不过可以猜测是网络滑动窗口协议使用,基于package的传速数据是分块 接受,通过buffer 缓冲再交给上层的应用层,最后在有序的拼装在一起,该function出错应该是在最后的组装阶段或中间的部分package传送失败导致。
主要怀疑是网络或主机和应用的问题,稳定的网络是RAC 高效的前提,使用netstat -s 查看了checksums的值,节点2值为0 ,节点1(重启节点)值为21473
#node 1 udp: 211615 incomplete headers 21473 bad checksums #<============= 190142 socket overflows ip: 242745393114 total packets received 227417 bad IP headers 10347787171 fragments received 64410 fragments dropped (dup or out of space) #《=============== 28719 fragments dropped after timeout 0 packets forwarded 17 packets not forwardable #node2 udp: 648577 incomplete headers 0 bad checksums #<<<<<<<<<<<<<< 648577 socket overflows ip: 230910024388 total packets received 323248 bad IP headers 9016546368 fragments received 0 fragments dropped (dup or out of space) 0 fragments dropped after timeout 0 packets forwarded 0 packets not forwardable
从MOS中找到了几个相似BUG
Bug 19520489 : ORA-7445 [__LWP_KILL()+8] FOLLOWED BY INSTANCE CRASH Bug 18518529 : ORA-7445 [__LWP_KILL()+8] [SIGIOT] AND INSTANCE CRASH Bug 18011512 : LMON: TERMINATING THE INSTANCE, NOT ABLE TO START IT AGAIN Bug 12753779 - LMS PROCESSES DIE WITH ORA-07445: EXCEPTION ENCOUNTERED: CORE DUMP [_KILL()+48] Bug 14119119 : ORA-7445 [__LWP_KILL+48]
关于hostname长度的bug不符合,且本实例的lms 进程数一致, 和SR多次沟通后,确认了以下修改方案:
1,Cut ip_fragment_timeout to 100 (1 second). (Default is 60 seconds). 2,Increase the ip_reass_mem_limit to 10000000 (10MB) (Default is 2 MB) 3,Increase the socket_udp_rcvbuf_default and socket_udp_sndbuf_default udp_sendspace >=max([(DB_BLOCK_SIZE * DB_FILE_MULTIBLOCK_READ_COUNT) + 4096], 66536); udp_recvspace >=udp_sendspace*2 (on hpux) 4, Increase the _lm_tickets and gcs_server_processes parameter values (according to the actual situation)
— update 20170329
近日有朋友出现了相同问题,特更新一下。我们的案例当时有点小复杂,问题是已经解决。
遇到该类问题我建议先调整上面的OS参数观察, 我们的案例当时是调整了参数后虽然没有再出现ora-7445,但是有几套相同的环境也出现了crash, 并且也出现了bad checksum和overflow, 没出现ORA-7445, 后来没办法确认是网络还是数据库问题,因为该库是核心库,所以以解决其它库的问题时,这套出现ora-7445的数据库在修改了上面的参数后,在没有再出现ora-7445或者说是确认是否修改参数解决了ora-7445前,又安装了下面的补丁。
客户的强势要求下,Oracle dev 部门针对这个case 特意提供了一个merge patch, 解释说是加强了lms等后台进程的健壮性。安装后1年多时间没在再出现问题。
补丁程序21252795: MERGE REQUEST ON TOP OF DATABASE PSU 11.2.0.3.7 FOR BUGS 18719357 16088176
先决条件补丁程序
16619892 DATABASE PATCH SET UPDATE 11.2.0.3.7 (INCLUDES CPUJUL2013)
此补丁程序所解决的 Bug
16088176 LNX64-12.1-RAC-CDB: LMD PROC HIT ORA-600 [KJMSCNDSCQ:TIMEOUT] AND INST CRASH
16819962 CDB_RAC : INSTANCE TERMINATED BY LMD0 – LMON RECEIVED AN INSTANCE EVICTION
17452841 LMS HIT ORA-600 [KJCTSRW:1]
17801017 INSTANCES ARE EVICTED FROM CLUSTER WHEN INTERNAL DLM MESSAGING STALLS
17847764 FA + INDEX COMP: ORA-481 LMON INST EVICT – ABNORMAL INSTANCE TERMINATION BY LMD0
对不起,这篇文章暂时关闭评论。