12c R2 RAC instance crash due to CKPT is hung, using Multipathing
前段时间朋友一套12C R2 3Nodes RAC on Linux using multipath 环境的crash总是不定时的crash(几天一次), 使用的是LINUX的多路径软件的默认配置。
— alert log
2017-11-21T21:52:48.474246+08:00 Thread 3 advanced to log sequence 26159 (LGWR switch) Current log# 17 seq# 26159 mem# 0: +DATADG/anbob/ONLINELOG/group_17 2017-11-21T21:55:42.614168+08:00 IMR0 (ospid: 53353) waits for event 'control file sequential read' for 58 secs. 2017-11-21T21:55:42.614240+08:00 IMR0 (ospid: 53353) is hung in an acceptable location (cfio 0x11.00). 2017-11-21T21:55:55.374988+08:00 IMR0 (ospid: 53353) waits for event 'control file sequential read' for 81 secs. 2017-11-21T21:55:58.554970+08:00 IMR has experienced some problems and can't re-enable, retry count 1 2017-11-21T21:56:20.580385+08:00 CKPT (ospid: 53295) waits for event 'control file parallel write' for 95 secs. 2017-11-21T21:56:22.776564+08:00 Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob3/trace/anbob3_lmhb_53275.trc (incident=824802) (PDBNAME=CDB$ROOT): ORA-29770: global enqueue process CKPT (OSID 53295) is hung for more than 70 seconds Incident details in: /u01/app/oracle/diag/rdbms/anbob/anbob3/incident/incdir_824802/anbob3_lmhb_53275_i824802.trc 2017-11-21T21:56:25.265258+08:00 LOCK_DBGRP: GCR_SYSTEST debug event locked group GR+DB_anbob by memno 0 ERROR: Some process(s) is not making progress. LMHB (ospid: 53275) is terminating the instance. Please check LMHB trace file for more details. Please also check the CPU load, I/O load and other system properties for anomalous behavior ERROR: Some process(s) is not making progress. LMHB (ospid: 53275): terminating the instance due to error 29770 2017-11-21T21:56:25.763179+08:00 System state dump requested by (instance=3, osid=53275 (LMHB)), summary=[abnormal instance termination]. System State dumped to trace file /u01/app/oracle/diag/rdbms/anbob/anbob3/trace/anbob3_diag_53235_20171121215625.trc 2017-11-21T21:56:27.521552+08:00 License high water mark = 172 2017-11-21T21:56:31.945109+08:00 Instance terminated by LMHB, pid = 53275 2017-11-21T21:56:31.945889+08:00 Warning: 2 processes are still attach to shmid 458758: (size: 24576 bytes, creator pid: 53153, last attach/detach pid: 53235) 2017-11-21T21:56:32.530103+08:00 USER (ospid: 34072): terminating the instance 2017-11-21T21:56:32.535249+08:00 Instance terminated by USER, pid = 34072
— lmnb trace anbob3_lmhb_53275_i824802.trc
----- Beginning of Customized Incident Dump(s) ----- ================================================== === CKPT (ospid: 53295) Heartbeat Report ================================================== CKPT (ospid: 53295) has no heartbeats for 103 sec. (threshold 70) : heartbeat state 0x11.00 (cfio) pso-flag 0x100 : waiting for event 'control file parallel write' for 97 secs with wait_id 4255818. ===[ Wait Chain ]=== Wait chain is empty. =[SNAPSHOT SUMMARY]= Process moved according to collected snapshots ===[NO-HB PROCESS CALLSTACK]=== ksedsts()+346<-ksdxfstk()+71<-ksdxcb()+912<-sspuser()+217<-__sighandler()<-io_queue_run()+100<-skgfrwat()+196 <-ksfdwtio()+306<-ksfdwat_internal()+234<-kfk_reap_ufs_async_io()+166<-kfk_reap_ios_from_subsys()+1000<-kfk_reap_ios() +409<-kfk_io1()+1449<-kfk_transitIO()+451<-kfioReapIO()+633<-kfioRequestPriv()+216<-kfioRequest()+337<-ksfdafRequest() +742<-ksfdafWait()+371<-ksfdwtio()+773<-ksfdwat_internal()+234<-ksfdblock()+325<-ksfdbio()+2910<-ksfdbio()+3035<-kccwbp() +1035<-kccbmp_wdl()+257<-kccfhb()+144<-kcccmt()+244<-kccecx()+80<-kctmttrest()+303<-kctrcp()+1942<-kcvcca()+189<-ksbcti() +247<-ksbabs()+11524<-ksbrdp()+1079<-opirip()+609<-opidrv()+602<-sou2o()+145<-opimai_real()+202<-ssthrdmain()+417<-main() +262<-__libc_start_main()+253 ============================== ============================== Dumping PROCESS CKPT (ospid: 53295) States ============================== ===[ System Load State ]=== CPU Total 56 Raw 56 Core 28 Socket 2 Load normal: Cur 1144 Highmark 71680 (4.46 280.00) ===[ Latch State ]=== Not in Latch Get ===[ Session State Object ]=== ---------------------------------------- SO: 0x23251ba70, type: 4, owner: 0x1f1d4c328, flag: INIT/-/-/-/0x00 if: 0x3 c: 0x3 proc=0x1f1d4c328, name=session, file=ksu.h LINE:15737, pg=0, conuid=1 SGA version=(1,0) (session) sid: 4720 ser: 27225 trans: (nil), creator: 0x1f1d4c328 flags: (0x51) USR/- flags2: (0x409) -/-/INC flags_idl: (0x1) status: BSY/-/-/- kill: -/-/-/- DID: 0003-0027-0000000A0000-0000-00000000, short-term DID: txn branch: (nil) con_id/con_uid/con_name: 1/1/CDB$ROOT con_logonuid: 1 con_logonid: 1 con_scuid: 1 con_scid: 1 edition#: 0 user#/name: 0/SYS oct: 0, prv: 0, sql: (nil), psql: (nil) stats: 0x1ffa5a0b0, PX stats: 0x1101de44 service name: SYS$BACKGROUND Current Wait Stack: 0: waiting for 'control file parallel write' files=0x1, block#=0x27, requests=0x1 wait_id=4255818 seq_num=61824 snap_id=1 wait times: snap=1 min 37 sec, exc=1 min 37 sec, total=1 min 37 sec wait times: max=infinite, heur=1 min 37 sec wait counts: calls=0 os=0 in_wait=1 iflags=0x5a0 There are 4 sessions blocked by this session. Dumping one waiter: inst: 3, sid: 3389, ser: 36254 wait event: 'enq: CF - contention' p1: 'name|mode'=0x43460004 p2: '0'=0x0 p3: 'operation'=0x0 row_wait_obj#: 213, block#: 1969, row#: 0, file# 1 min_blocked_time: 92 secs, waiter_cache_ver: 59413 Wait State: fixed_waits=0 flags=0x22 boundary=(nil)/-1 Session Wait History: elapsed time of 0.000027 sec since current wait 0: waited for 'control file parallel write' files=0x1, block#=0x29, requests=0x1 wait_id=4255817 seq_num=61823 snap_id=1 wait times: snap=0.000462 sec, exc=0.000462 sec, total=0.000462 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000029 sec of elapsed time 1: waited for 'control file parallel write' files=0x1, block#=0x372, requests=0x1 wait_id=4255816 seq_num=61822 snap_id=1 wait times: snap=0.000589 sec, exc=0.000589 sec, total=0.000589 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000051 sec of elapsed time 2: waited for 'control file sequential read' file#=0x0, block#=0x371, blocks=0x1 wait_id=4255815 seq_num=61821 snap_id=1 wait times: snap=0.000279 sec, exc=0.000279 sec, total=0.000279 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000010 sec of elapsed time 3: waited for 'control file sequential read' file#=0x0, block#=0x2a, blocks=0x1 wait_id=4255814 seq_num=61820 snap_id=1 wait times: snap=0.000272 sec, exc=0.000272 sec, total=0.000272 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000004 sec of elapsed time 4: waited for 'control file sequential read' file#=0x0, block#=0x28, blocks=0x1 wait_id=4255813 seq_num=61819 snap_id=1 wait times: snap=0.000280 sec, exc=0.000280 sec, total=0.000280 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000009 sec of elapsed time 5: waited for 'control file sequential read' file#=0x0, block#=0x1, blocks=0x1 wait_id=4255812 seq_num=61818 snap_id=1 wait times: snap=0.000319 sec, exc=0.000319 sec, total=0.000319 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000070 sec of elapsed time 6: waited for 'enq: CF - contention' name|mode=0x43460005, 0=0x0, operation=0x0 wait_id=4255811 seq_num=61817 snap_id=1 wait times: snap=0.000257 sec, exc=0.000257 sec, total=0.000257 sec wait times: max=0.000000 sec wait counts: calls=3 os=3 occurred after 0.000800 sec of elapsed time 7: waited for 'rdbms ipc message' timeout=0x12c, =0x0, =0x0 wait_id=4255810 seq_num=61816 snap_id=1 wait times: snap=3.002193 sec, exc=3.002193 sec, total=3.002193 sec wait times: max=3.000000 sec wait counts: calls=1 os=1 occurred after 0.000124 sec of elapsed time 8: waited for 'control file sequential read' file#=0x0, block#=0x34a, blocks=0x1 wait_id=4255809 seq_num=61815 snap_id=1 wait times: snap=0.000284 sec, exc=0.000284 sec, total=0.000284 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000005 sec of elapsed time 9: waited for 'control file sequential read' file#=0x0, block#=0x2a, blocks=0x1 wait_id=4255808 seq_num=61814 snap_id=1 wait times: snap=0.000341 sec, exc=0.000341 sec, total=0.000341 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000006 sec of elapsed time ... ... *** 2017-11-21T21:56:24.918643+08:00 loadavg : 4.47 2.09 1.21 System user time: 0.00 sys time: 0.00 context switch: 19791 Memory (Avail / Total) = 107148.16M / 257929.37M Swap (Avail / Total) = 32768.00M / 32768.00M F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 0 S oracle 53295 1 0 80 0 - 19975755 read_e Nov13 ? 00:06:16 ora_ckpt_anbob3 Name: ora_ckpt_anbob3 State: S (sleeping) Tgid: 53295 Pid: 53295 PPid: 1 TracerPid: 0 Uid: 602 602 602 602 Gid: 601 604 604 604 Utrace: 0 FDSize: 512 Groups: 601 602 603 604 606 VmPeak: 79907024 kB VmSize: 79903020 kB VmLck: 0 kB VmHWM: 3446908 kB VmRSS: 3438240 kB VmData: 25352 kB VmStk: 164 kB VmExe: 342688 kB VmLib: 30328 kB VmPTE: 14816 kB VmSwap: 0 kB Threads: 1 SigQ: 1/1031607 SigPnd: 0000000000000000 ShdPnd: 0000000000000000 SigBlk: 0000000000000000 SigIgn: 0000000016400207 SigCgt: 00000003c9887cf8 CapInh: 0000000000000000 CapPrm: 0000000000000000 CapEff: 0000000000000000 CapBnd: ffffffffffffffff Cpus_allowed: ffffff,ffffffff Cpus_allowed_list: 0-55 Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003 Mems_allowed_list: 0-1 voluntary_ctxt_switches: 3914186 nonvoluntary_ctxt_switches: 43390 Short stack dump: ksedsts()+346<-ksdxfstk()+71<-ksdxcb()+912<-sspuser()+217<-__sighandler()<-io_queue_run()+100<-skgfrwat()+196<-ksfdwtio() +306<-ksfdwat_internal()+234<-kfk_reap_ufs_async_io()+166<-kfk_reap_ios_from_subsys()+1000<-kfk_reap_ios()+409<-kfk_io1() +1449<-kfk_transitIO()+451<-kfioReapIO()+633<-kfioRequestPriv()+216<-kfioRequest()+337<-ksfdafRequest()+742<-ksfdafWait() +371<-ksfdwtio()+773<-ksfdwat_internal()+234<-ksfdblock()+325<-ksfdbio()+2910<-ksfdbio()+3035<-kccwbp()+1035<-kccbmp_wdl() +257<-kccfhb()+144<-kcccmt()+244<-kccecx()+80<-kctmttrest()+303<-kctrcp()+1942<-kcvcca()+189<-ksbcti()+247<-ksbabs()+11524 <-ksbrdp()+1079<-opirip()+609<-opidrv()+602<-sou2o()+145<-opimai_real()+202<-ssthrdmain()+417<-main()+262<-__libc_start_main()+253
Note:
从日志看是CKPT进程hang超过了70秒所以LMON进程终止了实例, 从CKPT的trace文件中Call Stack看I/O的可能性较大, 开始有说这个数据库使用的是linux的多路径软件,使用的默认配置, 通常来说I/O错误对于ASM镜像和多路径以上的应用进程在做I/O操作时是通明的。但是在I/O重试期间可能会导致进程hang, 对于像CKPT\LMON 长时间的HANG是致命的, 在MOS中没有找到相关的BUG个NOTE但有一篇Extend RAC因远程IO失败导致数据库crash的case, 参照尝试性的调整了多路径的配置参数如下:
polling_interval= 10 no_path_retry = 3 /sys/module/scsi_transport_fc/parameters/dev_loss_tmo = 30
polling_interval: Specifies the interval between two path checks in seconds. For properly functioning paths, the interval between checks will gradually increase to (4 * polling_interval). The default value is 5.
no_path_retry: A numeric value for this attribute specifies the number of times the system should attempt to use a failed path before disabling queueing.
A value of fail indicates immediate failure, without queuing.A value of queue indicates that queuing should not stop until the path is fixed.
The “dev_loss_tmo” option controls how long the SCSI layer waits after a SCSI device fails before marking it as failed.
未提供OS,和多路径日志。 但做了以前调整重启了多路径服务后,观察了1个月时间数据库实例没有再crash过。
对不起,这篇文章暂时关闭评论。