Troubleshooting RMAN backup hang waiting for Autobackup, DB alert log ORA-00230
昨天遇到一个问题oracle 19c(19.10), 最近一段时间的RMAN备份任务都失败了,错误中有ORA-00230: operation disallowed: snapshot control file enqueue unavailable, 看着是cf的enqueue请求失败。 处理起来比较简单,找到blocker session就可以解决了, 之前的《Troubleshooting performance event ‘enq: CF – contention’》记录过一些分析方法,本次的案例当前还没有匹配的已知bug,这里简单的记录。
RMAN: Backup Fails with ORA-230 Snapshot Controlfile Enqueue Unavailable [ID 145619.1]
分享过一个SQL:
Execute the following query to determine which job is causing the wait
SELECT s.sid, username AS "User", program, module, action, logon_time "Logon", l.* FROM v$session s, v$enqueue_lock l WHERE l.sid = s.sid and l.type = 'CF' AND l.id1 = 0 and l.id2 = 2;
这里确实可以找到是一个M002进程,但表示是什么意思? ID2 = 2是什么意思? 不确认的事情尽量不要操作,我们尝试手动备份一下问题是不是真的存在hang,比如备份一个spfile。
oracle@anbob1:/home/oracle> rman target / debug trace=rman_debug.log RMAN> show all 2> ; using target database control file instead of recovery catalog RMAN configuration parameters for database with db_unique_name PRIBILL are: CONFIGURE RETENTION POLICY TO REDUNDANCY 1; # default CONFIGURE BACKUP OPTIMIZATION OFF; CONFIGURE DEFAULT DEVICE TYPE TO DISK; # default CONFIGURE CONTROLFILE AUTOBACKUP ON; # default CONFIGURE CONTROLFILE AUTOBACKUP FORMAT FOR DEVICE TYPE DISK TO '%F'; # default CONFIGURE DEVICE TYPE DISK PARALLELISM 10 BACKUP TYPE TO BACKUPSET; CONFIGURE DATAFILE BACKUP COPIES FOR DEVICE TYPE DISK TO 1; # default CONFIGURE ARCHIVELOG BACKUP COPIES FOR DEVICE TYPE DISK TO 1; # default CONFIGURE MAXSETSIZE TO UNLIMITED; # default CONFIGURE ENCRYPTION FOR DATABASE OFF; # default CONFIGURE ENCRYPTION ALGORITHM 'AES128'; # default CONFIGURE COMPRESSION ALGORITHM 'BASIC' AS OF RELEASE 'DEFAULT' OPTIMIZE FOR LOAD TRUE ; # default CONFIGURE RMAN OUTPUT TO KEEP FOR 7 DAYS; # default CONFIGURE ARCHIVELOG DELETION POLICY TO NONE; # default CONFIGURE SNAPSHOT CONTROLFILE NAME TO '+ARCHDG/xxxxxxx.f'; RMAN> backup spfile format '/home/oracle/spfile_bak_f'; RMAN-03090: Starting backup at 11-APR-22 RMAN-12016: using channel ORA_DISK_1 RMAN-12016: using channel ORA_DISK_2 RMAN-12016: using channel ORA_DISK_3 RMAN-12016: using channel ORA_DISK_4 RMAN-12016: using channel ORA_DISK_5 RMAN-12016: using channel ORA_DISK_6 RMAN-12016: using channel ORA_DISK_7 RMAN-12016: using channel ORA_DISK_8 RMAN-12016: using channel ORA_DISK_9 RMAN-12016: using channel ORA_DISK_10 RMAN-08008: channel ORA_DISK_1: starting full datafile backup set RMAN-08010: channel ORA_DISK_1: specifying datafile(s) in backup set RMAN-08113: including current SPFILE in backup set RMAN-08038: channel ORA_DISK_1: starting piece 1 at 11-APR-22 RMAN-08044: channel ORA_DISK_1: finished piece 1 at 11-APR-22 RMAN-08530: piece handle=/home/oracle/spfile_bak_f tag=TAG20220411T173319 comment=NONE RMAN-08540: channel ORA_DISK_1: backup set complete, elapsed time: 00:00:02 RMAN-03091: Finished backup at 11-APR-22 RMAN-03090: Starting Control File and SPFILE Autobackup at 11-APR-22 -- hang ^C
— debug trace
RMAN-03090: Starting Control File and SPFILE Autobackup at 11-APR-22 DBGMISC: ENTERED krmicomp [17:33:21.598] DBGANY: trying implicit resync before compiling command Control File and SPFILE Autobackup [17:33:21.598] ... DBGPLSQL: the compiled command tree is: [17:33:21.768] (krmicomp) DBGPLSQL: 1 CMD type=Control File and SPFILE Autobackup cmdid=1 status=NOT STARTED DBGPLSQL: 1 STEPstepid=1 cmdid=1 status=NOT STARTED DBGPLSQL: 1 TEXTNOD = declare DBGPLSQL: 2 TEXTNOD = ncopies number; DBGPLSQL: 3 TEXTNOD = copyno number; DBGPLSQL: 4 TEXTNOD = handle varchar2(512); ... DBGPLSQL: 31 TEXTNOD = DBGPLSQL: 32 PRMVAL = rsid := 51096; rsts := 1101749601; DBGPLSQL: 33 TEXTNOD = setBackupParams(FALSE); DBGPLSQL: 34 TEXTNOD = if (krmicd.getParams(1, p2, p3, p4, p5, t1, t2, t3)) then DBGPLSQL: 35 TEXTNOD = p1 := 1; DBGPLSQL: 36 TEXTNOD = end if; DBGPLSQL: 37 TEXTNOD = sys.dbms_backup_restore.setRmanStatusRowId(rsid=>rsid, rsts=>rsts); DBGPLSQL: 38 TEXTNOD = DBGPLSQL: 39 TEXTNOD = <> -- retry on failure to get snapshot enqueue DBGPLSQL: 40 TEXTNOD = begin DBGPLSQL: 41 TEXTNOD = sys.dbms_backup_restore.DoAutobackup(ncopies => ncopies, DBGPLSQL: 42 TEXTNOD = cfaudate => lcfaudate, DBGPLSQL: 43 TEXTNOD = seq => lsequence, DBGPLSQL: 44 TEXTNOD = format => lbautfmt, DBGPLSQL: 45 TEXTNOD = p1 => p1, DBGPLSQL: 46 TEXTNOD = p2 => p2, DBGPLSQL: 47 TEXTNOD = p3 => p3, DBGPLSQL: 48 TEXTNOD = p4 => t1); DBGPLSQL: 49 TEXTNOD = exception DBGPLSQL: 50 TEXTNOD = when sys.dbms_backup_restore.snapshot_enqueue_busy then DBGPLSQL: 51 TEXTNOD = -- DBGPLSQL: 52 TEXTNOD = -- DBGPLSQL: 53 TEXTNOD = -- DBGPLSQL: 54 TEXTNOD = -- DBGPLSQL: 55 TEXTNOD = if busy_retries = 180 then DBGPLSQL: 56 TEXTNOD = krmicd.writeMsg(20029, 'cannot make a snapshot controlfile'); DBGPLSQL: 57 TEXTNOD = raise; DBGPLSQL: 58 TEXTNOD = end if; DBGPLSQL: 59 TEXTNOD = DBGPLSQL: 60 TEXTNOD = busy_retries := busy_retries + 1; .. DBGMISC: Executing command Control File and SPFILE Autobackup [17:33:21.770] (krmiexe) DBGRPC: krmxr - entering DBGRPC: krmxpoq - returning rpc_number: 16 with status: FINISHED131 for channel default DBGRPC: krmxr - channel default has rpc_count: 16 .. DBGRPC: krmxrpc - channel ORA_DISK_1 kpurpc2 err=0 db=target proc=SYS.DBMS_BACKUP_RESTORE.SETRMANSTATUSROWID excl: 0 DBGRPC: krmxrpc - channel ORA_DISK_1 kpurpc2 err=3123 db=target proc=SYS.DBMS_BACKUP_RESTORE.DOAUTOBACKUP excl: 0 DBGRPC: krmxr - channel ORA_DISK_1 returned from peicnt DBGRPC: krmxpoq - returning rpc_number: 51 with status: STARTED111 for channel ORA_DISK_1 DBGMISC: ENTERED krmchkintrvl [17:33:21.778] DBGMISC: krmsod2s: time is: 11/4/2022 17:33:21 = returning: 1101749601 DBGMISC: last*time_krmxpon initialized ENTERED krmqgns krmqgns: looking for work for channel default (krmqgns) DBGRPC: krmqgns: commands remaining to be executed: (krmqgns) DBGRPC: CMD type=Control File and SPFILE Autobackup cmdid=1 status=STARTED ... DBGMISC: ENTERED krmchkintrvl [17:33:22.781] DBGMISC: krmsod2s: time is: 11/4/2022 17:33:22 = returning: 1101749602 DBGMISC: EXITED krmchkintrvl [17:33:22.781] elapsed time [00:00:00:00.000] DBGRPC: krmxr - channel ORA_DISK_1 calling peicnt DBGRPC: krmxrpc - channel ORA_DISK_1 kpurpc2 err=230 db=target proc=SYS.DBMS_BACKUP_RESTORE.DOAUTOBACKUP excl: 177 DBGRPC: krmxrpc - caloing krmxtrim: with message of length 177: @@@ORA-00230: operation disallowed: snapshot control file enqueue unavailable DBGRPC: ORA-06512: at "SYS.DBMS_BACKUP_RESTORE", line 6773 DBGRPC: ORA-06512: at "SYS.DBMS_BACKUP_RESTORE", line 6766
Note:
RMAN发生在CONTROLFILE AUTOBACKUP阶段挂起,如果尝试禁用后AUTOBACKUP后(CONFIGURE CONTROLFILE AUTOBACKUP OFF),备份可以正常,但是db alert log还在一直有其它进程报错ORA-00230。
db alert log
022-04-11T18:11:50.353144+08:00
******************** WARNING **************************
The errors during server control file autobackup are not
fatal, as it is attempted after sucessful completion of
the command. However, it is recomended to take an RMAN
control file backup as soon as possible because the
autobackup failed with the following error:
ORA-00230: operation disallowed: snapshot control file enqueue unavailable
******************** END OF WARNING *******************
2022-04-11T18:11:50.354271+08:00
Errors in file /oracle/app/oracle/diag/rdbms/pribill/bill1/trace/bill1_m001_101668.trc:
ORA-00230: operation disallowed: snapshot control file enqueue unavailable
2022-04-11T18:12:00.235066+08:00
ARC2 (PID:98254): Archived Log entry 101518 added for T-1.S-41046 ID 0x859e4e90 LAD:1
2022-04-11T18:12:07.760192+08:00
Control file autobackup failed with following error
ORA-00230: operation disallowed: snapshot control file enqueue unavailable
2022-04-11T18:12:28.760837+08:00
Control file autobackup failed with following error
ORA-00230: operation disallowed: snapshot control file enqueue unavailable
Note:
说明当前还是有进程一直没有释放Controlfile enqueue, 继续找holder.
SQL> select * from v$lock where TYPE='CF'; ADDR KADDR SID TY ID1 ID2 LMODE REQUEST CTIME BLOCK CON_ID ---------------- ---------------- ---------- -- ---------- ---------- ---------- ---------- ---------- ---------- ---------- 0000000D27AD0600 0000000D27AD0630 3118 CF 0 2 6 0 3088467 2 0 0000000D279D53D8 0000000D279D5408 8456 CF 0 0 2 0 12673954 2 0 SQL> @lt cf Show lock type info from V$LOCK_TYPE for lock cf TYPE LOCK NAME ID1 MEANING ID2 MEANING LT_ DESCRIPTION ---- -------------------------------- ------------------------- ------------------------- --- ------------------------------------------------------------ CF Controlfile Transaction 0 operation NO Synchronizes accesses to the controlfile SQL> @usid 3118 USERNAME SID AUDSID OSUSER MACHINE PROGRAM SPID OPID CPID SQL_ID HASH_VALUE LASTCALL STATUS SADDR PADDR TADDR LOGON_TIME ----------------------- -------------- ----------- ---------------- ------------------ -------------------- -------------- ------ ------------------------ ------------- ----------- ---------- -------- ---------------- ---------------- ---------------- ------------------- '3118,65101' 0 oracle anbob1 (M002) 91436 443 91436 0 3253702 ACTIVE 0000000D63EF6B00 0000000DC34B3CB8 2022-03-05 02:10:29
Note:
我们从v$lock view可以看到CF当前的以mode 6持有的会话,对应的ID2为2, 记录的是operation, 最开始提到的v$enqueue_lock VIEW,基本上字段和v$lock匹配,ID2=2 猜测是有对controlfile operation=2修改。 mode 6的holder session为3118对应的M002进程。为什么是Mnnn进程?
Mnnn进程是MMON的slave附属进程,从ASH 查看了MMON进程没有异常的wait,和持续执行的action, 我们知道MMON负责AWR相关信息的收集,MMON进程和它的附属进程都可以kill 正常不会导致实例crash, 在之前的《Oracle致命进程》一文有提到, 那为什么是Mnnn进程?有一种可能是如开启了闪回数据库,收集闪回相关的统计信息时因为太大而堵塞, 这套库并未启用闪回数据库, 上面我们注意到在是autobackup controlfile时hang, 那autobackup controlfile是哪个进程负责的呢? 在去年的数据嘉年化的PPT中有分享过,正是Mnnn进程。
所以我们有理由相信是上次M002进程未release CF enqueue.
看下blocker session
SQL> select * from v$session where sid=3118; SADDR SID SERIAL# AUDSID PADDR USER# USERNAME COMMAND OWNERID TADDR LOCKWAIT STATUS SERVER SCHEMA# SCHEMANAME OSUSER PROCESS MACHINE PORT TERMINAL PROGRAM TYPE SQL_ADDRESS SQL_HASH_VALUE SQL_ID SQL_CHILD_NUMBER SQL_EXEC_START SQL_EXEC_ID PREV_SQL_ADDR PREV_HASH_VALUE PREV_SQL_ID PREV_CHILD_NUMBER PREV_EXEC_START PREV_EXEC_ID PLSQL_ENTRY_OBJECT_ID PLSQL_ENTRY_SUBPROGRAM_ID PLSQL_OBJECT_ID ---------------- ---------- ---------- ---------- ---------------- ---------- -------------------------------------------------------------------------------------------------------------------------------- ---------- ---------- ---------------- ---------------- -------- --------- ---------- -------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------- ------------------------ ---------------------------------------------------------------- ---------- ------------------------------ ------------------------------------------------ ---------- ---------------- -------------- ------------- ---------------- ------------------- ----------- ---------------- --------------- ------------- ----------------- ------------------- ------------ --------------------- ------------------------- --------------- PLSQL_SUBPROGRAM_ID MODULE MODULE_HASH ACTION ACTION_HASH CLIENT_INFO FIXED_TABLE_SEQUENCE ROW_WAIT_OBJ# ROW_WAIT_FILE# ROW_WAIT_BLOCK# ROW_WAIT_ROW# TOP_LEVEL_CALL# LOGON_TIME LAST_CALL_ET PDM FAILOVER_TYPE FAILOVER_M FAI RESOURCE_CONSUMER_GROUP PDML_STA PDDL_STA PQ_STATU CURRENT_QUEUE_DURATION CLIENT_IDENTIFIER BLOCKING_SE BLOCKING_INSTANCE BLOCKING_SESSION FINAL_BLOCK FINAL_BLOCKING_INSTANCE FINAL_BLOCKING_SESSION SEQ# EVENT# EVENT P1TEXT P1 P1RAW P2TEXT P2 P2RAW P3TEXT ------------------- ---------------------------------------------------------------- ----------- ---------------------------------------------------------------- ----------- ---------------------------------------------------------------- -------------------- ------------- -------------- --------------- ------------- --------------- ------------------- ------------ --- ------------- ---------- --- -------------------------------- -------- -------- -------- ---------------------- ---------------------------------------------------------------- ----------- ----------------- ---------------- ----------- ----------------------- ---------------------- ---------- ---------- ---------------------------------------------------------------- ---------------------------------------------------------------- ---------- ---------------- ---------------------------------------------------------------- ---------- ---------------- ---------------------------------------------------------------- P3 P3RAW WAIT_CLASS_ID WAIT_CLASS# WAIT_CLASS WAIT_TIME SECONDS_IN_WAIT STATE WAIT_TIME_MICRO TIME_REMAINING_MICRO TIME_SINCE_LAST_WAIT_MICRO SERVICE_NAME SQL_TRAC SQL_T SQL_T SQL_TRACE_ SESSION_EDITION_ID CREATOR_ADDR CREATOR_SERIAL# ECID SQL_TRANSLATION_PROFILE_ID PGA_TUNABLE_MEM SHARD_DD CON_ID ---------- ---------------- ------------- ----------- ---------------------------------------------------------------- ---------- --------------- ------------------- --------------- -------------------- -------------------------- ---------------------------------------------------------------- -------- ----- ----- ---------- ------------------ ---------------- --------------- ---------------------------------------------------------------- -------------------------- --------------- -------- ---------- EXTERNAL_NAME --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- PLSQL ----- 0000000D63EF6B00 3118 65101 0 0000000DC34B3CB8 0 0 2147483644 ACTIVE DEDICATED 0 SYS oracle 91436 anbob1 0 UNKNOWN oracle@anbob1 (M002) BACKGROUND 00 0 0000000B3F936118 124107020 gjaap3w3qbf8c 0 2022-04-11 17:58:19 17191814 MMON_SLAVE 152940696 Intensive AutoTask Dispatcher 2862313749 3459593550 22 1 20515 0 59 2022-03-05 02:10:29 3253763 NO NONE NONE NO ORA$AUTOTASK DISABLED ENABLED ENABLED 0 NO HOLDER NO HOLDER 14519 75 class slave wait slave id 0 00 0 00 0 00 2723168908 6 Idle 0 33 WAITING 32503246 -1 0 SYS$BACKGROUND DISABLED FALSE FALSE FIRST EXEC 133 0000000DC34B3CB8 1443 0 0 DISABLED 0 FALSE SQL> @sw 3118 SID STATE EVENT SEQ# SEC_IN_WAIT P1 P2 P3 P1TRANSL ------- ------- ---------------------------------------- ---------- ----------- ------------------- ------------------- ------------------- ------------------------------------------ 3118 WAITING class slave wait 14519 51 slave id= 0 0 0 SQL> @s 3118 SID SQLID_AND_CHILD STATUS STATE EVENT SEQ# SEC_IN_WAIT BLOCKING_SID P1 P2 P3 P1TRANSL ------- -------------------- -------- ------- ---------------------------------------- ---------- ----------- ------------ ------------------ ------------------ ------------------- ------------------------------------------ 3118 ACTIVE WAITING class slave wait 14519 58 NO HOLDER slave id= 0 0 0 SQL> @usid 3118 USERNAME SID AUDSID OSUSER MACHINE PROGRAM SPID OPID CPID SQL_ID HASH_VALUE LASTCALL STATUS SADDR PADDR TADDR LOGON_TIME ----------------------- -------------- ----------- ---------------- ------------------ -------------------- -------------- ------ ------------------------ ------------- ----------- ---------- -------- ---------------- ---------------- ---------------- ------------------- '3118,65101' 0 oracle anbob1 (M002) 91436 443 91436 0 3253816 ACTIVE 0000000D63EF6B00 0000000DC34B3CB8 2022-03-05 02:10:29 SQL> oradebug setorapid 443 Oracle pid: 443, Unix process pid: 91436, image: oracle@anbob1 (M002) SQL> oradebug short_stack; ksedsts()+426<-ksdxfstk()+58<-ksdxcb()+872<-sspuser()+200<-__sighandler()<-semtimedop()+10<-skgpwwait()+192<-ksliwat()+2192<-kslwaitctx()+200<-ksvrdp_int()+3412<-opirip()+583<-opidrv()+581<-sou2o()+165<-opimai_real()+173<-ssthrdmain()+417<-main()+256<-__libc_start_main()+245 SQL> oradebug short_stack; ksedsts()+426<-ksdxfstk()+58<-ksdxcb()+872<-sspuser()+200<-__sighandler()<-semtimedop()+10<-skgpwwait()+192<-ksliwat()+2192<-kslwaitctx()+200<-ksvrdp_int()+3412<-opirip()+583<-opidrv()+581<-sou2o()+165<-opimai_real()+173<-ssthrdmain()+417<-main()+256<-__libc_start_main()+245 SQL> oradebug short_stack; ksedsts()+426<-ksdxfstk()+58<-ksdxcb()+872<-sspuser()+200<-__sighandler()<-semtimedop()+10<-skgpwwait()+192<-ksliwat()+2192<-kslwaitctx()+200<-ksvrdp_int()+3412<-opirip()+583<-opidrv()+581<-sou2o()+165<-opimai_real()+173<-ssthrdmain()+417<-main()+256<-__libc_start_main()+245 SQL> oradebug hanganalyze 3 Hang Analysis in /oracle/app/oracle/diag/rdbms/pribill/bill1/trace/bill1_ora_118663.trc
Note:
M002进程的ACTION为Intensive AutoTask Dispatcher , 长时间处理active,函数调用如下,也未发现hang 链条,所以虽然数据库默认有kill block机制也未触发kill .
Parsed short stack: semtimedop()+10 SYSCALL: semaphore timed operation skgpwwait()+192 operating system dependent kernel generic post/wait wait ksliwat()+2192 kernel service latching and post-wait inner wait function; setup a wait that times out kslwaitctx()+200 kernel service latching and post-wait wait for n centi-seconds or until posted wait context; wait until timeout ksvrdp_int() kernel service slave management run generic detached slave process internal
既然没有自动KILL, 我们可以先手动KILL处理。
oracle@anbob1:/home/oracle> ps -ef|grep 91436 oracle 91436 1 0 Mar05 ? 00:40:39 ora_m002_bill1 oracle@anbob1:/home/oracle> kill -9 91436 SQL> select * from gv$lock where TYPE='CF'; INST_ID ADDR KADDR SID TY ID1 ID2 LMODE REQUEST CTIME BLOCK CON_ID ---------- ---------------- ---------------- ------- -- ---------- ---------- ---------- ---------- ---------- ---------- ---------- 2 0000000D279D4C88 0000000D279D4CB8 8456 CF 0 0 2 0 12674744 2 0 2 0000000D27AEEDB8 0000000D27AEEDE8 11665 CF 0 2 6 0 316 2 0 1 0000000D279D53D8 0000000D279D5408 8456 CF 0 0 2 0 12674750 2 0 # node2 SQL> @usid 11665 USERNAME SID AUDSID OSUSER MACHINE PROGRAM SPID OPID CPID SQL_ID HASH_VALUE LASTCALL STATUS SADDR PADDR TADDR LOGON_TIME ----------------------- -------------- ----------- ---------------- ------------------ -------------------- -------------- ------ ------------------------------------------------ -------------------------- ----------- ---------- ---------------- ---------------- ---------------- -------------------------------- ------------------- '11665,31627' 0 oracle ldjfa2 (M001) 82268 131 82268 oracle@anbob2:/home/oracle> kill -9 82268
Note:
实例2的M001同样的现象,一并kill.
再次测试备份
RMAN> backup spfile format '/home/oracle/spfile_bak_f3';
Starting backup at 11-APR-22
using channel ORA_DISK_1
..
channel ORA_DISK_1: starting full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
including current SPFILE in backup set
channel ORA_DISK_1: starting piece 1 at 11-APR-22
channel ORA_DISK_1: finished piece 1 at 11-APR-22
piece handle=/home/oracle/spfile_bak_f3 tag=TAG20220411T181021 comment=NONE
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:01
Finished backup at 11-APR-22
Starting Control File and SPFILE Autobackup at 11-APR-22
piece handle=/oracle/app/oracle/product/19c/db_1/dbs/c-2179257193-20220411-00 comment=NONE
Finished Control File and SPFILE Autobackup at 11-APR-22
Note:
自动备份成功,同是db alert log中不再提示ora-230错误。
— over —
对不起,这篇文章暂时关闭评论。