Troubleshooting Oracle Wait Chains ‘buffer busy wait’、’log file sync’、 ‘enq: CF – contention’ during RMAN backup
最近一个客户的数据库出现了短暂1分钟左右的高负载,环境为oracle 11.2.0.4 2nodes RAC , 看当时的top event主要是buffer busy wait和log file sync, 关于buffer busy wait 在Troubleshooting Oracle wait event ‘buffer busy wait’ 有记录过分析方法,是一种hot block现象, 而log file sync又是提交类的,当两者同时发生时,可能是后者影响前者,首先建议排查I/O, 而另1个节点此时正在做RMAN增量备份,简单记录该问题.
ASH 报告
— Instance 2
Top User Events
Event | Event Class | % Event | Avg Active Sessions |
---|---|---|---|
buffer busy waits | Concurrency | 32.78 | 11.27 |
log file sync | Commit | 27.64 | 9.50 |
CPU + Wait for CPU | CPU | 27.22 | 9.36 |
db file sequential read | User I/O | 3.12 | 1.07 |
enq: TX – row lock contention | Application | 1.87 | 0.64 |
Top Background Events
Event | Event Class | % Activity | Avg Active Sessions |
---|---|---|---|
Streams AQ: qmn coordinator waiting for slave to start | Other | 1.12 | 0.39 |
Top Event P1/P2/P3 Values
Event | % Event | P1 Value, P2 Value, P3 Value | % Activity | Parameter 1 | Parameter 2 | Parameter 3 |
---|---|---|---|---|---|---|
buffer busy waits | 32.78 | “1”,”59943″,”1″ | 3.01 | file# | block# | class# |
buffer busy waits
|
32.78243590364949352978238743759996122716
|
“1”,”62769″,”1″ | 2.73 |
file#
|
block#
|
class#
|
buffer busy waits
|
32.78243590364949352978238743759996122716
|
“1”,”228″,”1″ | 2.73 |
file#
|
block#
|
class#
|
log file sync | 27.66 | “4294967295”,”3388657110″,”0″ | 0.15 | buffer# | sync scn | NOT DEFINED |
Top Blocking Sessions
Blocking Sid (Inst) | % Activity | Event Caused | % Event | User | Program | # Samples Active | XIDs |
---|---|---|---|---|---|---|---|
4348, 1( 2) | 27.98 | log file sync | 27.66 | SYS | oracle@anbob2 (LGWR) | 248/1,200 [ 21%] | 0 |
4479,20405( 2) | 2.09 | buffer busy waits | 2.09 | USER1 | proc1#svr31 | 14/1,200 [ 1%] | 1 |
1148,53443( 2) | 1.66 | buffer busy waits | 1.66 | USER1 | proc1#svr25 | 18/1,200 [ 2%] | 0 |
5868,21505( 2) | 1.50 | buffer busy waits | 1.50 | USER1 | proc1#svr26 | 21/1,200 [ 2%] | 0 |
3663, 7499( 2) | 1.41 | buffer busy waits | 1.41 | USER1 | proc1#svr29 | 17/1,200 [ 1%] | 1 |
— Instance 1
Top User Events
Event | Event Class | % Event | Avg Active Sessions |
---|---|---|---|
CPU + Wait for CPU | CPU | 6.67 | 0.26 |
gc buffer busy acquire | Cluster | 3.18 | 0.12 |
gc cr block busy | Cluster | 1.18 | 0.05 |
Top Background Events
Event | Event Class | % Activity | Avg Active Sessions |
---|---|---|---|
Backup: MML write backup piece | Administrative | 62.90 | 2.40 |
CPU + Wait for CPU | CPU | 12.93 | 0.49 |
enq: CF – contention | Other | 2.62 | 0.10 |
RMAN backup & recovery I/O | System I/O | 1.94 | 0.07 |
ASM file metadata operation | Other | 1.35 | 0.05 |
Top Event P1/P2/P3 Values
Event | % Event | P1 Value, P2 Value, P3 Value | % Activity | Parameter 1 | Parameter 2 | Parameter 3 |
---|---|---|---|---|---|---|
gc buffer busy acquire | 3.18 | “22”,”177″,”1″ | 1.55 | file# | block# | class# |
gc buffer busy acquire
|
3.18290821887944190102463483758447787225
|
“22”,”174″,”1″ | 1.53 |
file#
|
block#
|
class#
|
enq: CF – contention | 2.64 | “1128660996”,”0″,”0″ | 2.62 | name|mode | 0 | operation |
RMAN backup & recovery I/O | 1.94 | “1”,”256″,”2147483647″ | 1.90 | count | intr | timeout |
ASM file metadata operation | 1.55 | “6268”,”9″,”0″ | 0.78 | msgop | locn | NOT DEFINED |
log file sequential read | 1.33 | “1”,”1″,”1″ | 0.15 | log# | block# | blocks |
Top Blocking Sessions
Blocking Sid (Inst) | % Activity | Event Caused | % Event | User | Program | # Samples Active | XIDs |
---|---|---|---|---|---|---|---|
5307,19313( 1) | 2.62 | enq: CF – contention | 2.62 | SYS | rman@anbob1(TNS V1-V3) | 48/1,200 [ 4%] | 0 |
2650,45619( 1) | 1.57 | gc buffer busy acquire | 1.57 | USER1 | JDBC Thin Client | 28/1,200 [ 2%] | 0 |
1906,10651( 1) | 1.53 | gc buffer busy acquire | 1.53 | USER1 | JDBC Thin Client | 20/1,200 [ 2%] | 0 |
诊断脚本
当然可以使用ASH,DASH,根据sample_time或snap_id再结合blocking_session和blocking_inst_id手动递归的查等待链条,Top 1 Event并不一定是问题的根源,而是现象。
使用 Tanel’s ash_wait_chains.sql
@ash_wait_chains <grouping_cols> <filters> <fromtime> <totime> @ash_wait_chains username||':'||program2||event2 session_type='FOREGROUND' "to_date('20250307 13:40','yyyymmdd hh24:mi')" "to_date('20250307 13:50','yyyymmdd hh24:mi')" @ash_wait_chains username||':'||event 1=1 "to_date('20250307 13:40','yyyymmdd hh24:mi')" "to_date('20250307 13:50','yyyymmdd hh24:mi')" @ash_wait_chains instance_number||':'||session_id 1=1 "to_date('20250307 13:40','yyyymmdd hh24:mi')" "to_date('20250307 13:50','yyyymmdd hh24:mi')"
通过ASH
-- ASH TEMP table create table system.ash_20250307 as select * from gv$active_session_history where sample_time between to_date('20250307 13:40','yyyymmdd hh24:mi') and to_date('20250307 13:50','yyyymmdd hh24:mi'); -- Diag wait chains with t as (select * from system.ash_20250307), t2 as (select sample_time ts , level lv, session_id, session_serial#, program, event, connect_by_iscycle iscycle, connect_by_isleaf leaf, LEVEL, SYS_CONNECT_BY_PATH('[' || program || ']' || '(' || nvl(event, session_state) || ')','->') wait_chain from t connect by NOCYCLE prior blocking_session = session_id and prior blocking_session_serial# = session_serial# and prior blocking_inst_id = inst_id and prior to_char(sample_time, 'hh24miss') = to_char(sample_time, 'hh24miss') start with session_state = 'WAITING') select wait_chain, count(*), max(iscycle) iscycle, TO_CHAR(MIN(ts), 'YYYY-MM-DD HH24:MI:SS') first_seen , TO_CHAR(MAX(ts), 'YYYY-MM-DD HH24:MI:SS') last_seen from t2 where leaf = 1 AND LV > 1 group by wait_chain order by count(*) desc;
Note:
返回的堵塞链条
->[proc1#SRV31](buffer busy waits)->[proc1#SRV21](buffer busy waits)->[proc2#svr31](log file sync)->[oracle@anbob2 (LGWR)](enq: CF – contention)->[rman@anbob1 (TNS V1-V3)](ON CPU)
目前可以看到最终是有RMAN进程备份时,持有的Control File 队列,堵塞了另一个节点的BG进程LGWR,又导致提前的FG进程logfile sync,和未提交FG进程的buffer busy waits。
buffer busy waits ===> log file sync ===> enq: CF - contention [lgwr] ===> RMAN
网上一个案例,因为应用调用sequence比较多,TOP 1 是enq: SQ – contention, 可能给人的错觉是seq出了问题,比如cache不足等,但那不是问题的根源。
@dash_wait_chains username||':'||event 1=1 1596 1596 %This SECONDS WAIT_CHAIN ------ ---------- ---------------------------------------------------------------------------------------------------------------------------------------------------- 70% 105023520 -> SYS:enq: SQ - contention -> SYS:row cache lock -> SYS:log file switch (archiving needed) -> SYS:control file sequential read 9% 13698720 -> SYS:enq: SQ - contention -> SYS:row cache lock -> SYS:log file switch (archiving needed) -> SYS:Disk file Mirror Read
另一案例
SQL> select * from ( 2 with sf_sql as 3 (select distinct sql_id, plan_hash_value from v$sql_plan where operation = 'TEMP TABLE TRANSFORMATION' 4 union 5 select distinct sql_id, plan_hash_value from dba_hist_sql_plan where operation = 'TEMP TABLE TRANSFORMATION') 6 select PROGRAM, EVENT, ash.sql_id, nvl2(p.sql_id, 'TEMP TABLE TRANSFORMATION', 'Other CF activities') AS CF_ENQUEUE_REASON, count(*) 7 from v$active_session_history ash, sf_sql p 8 where EVENT in ('enq: CF - contention', 9 'control file sequential read') 10 and SESSION_STATE = 'WAITING' 11 and ash.SQL_ID = p.sql_id(+) 12 and ash.SQL_PLAN_HASH_VALUE = p.plan_hash_value(+) 13 group by PROGRAM, EVENT, ash.sql_id, nvl2(p.sql_id, 'TEMP TABLE TRANSFORMATION', 'Other CF activities') 14 order by count(*) desc 15 ) where rownum <= 5 16 / PROGRAM EVENT SQL_ID CF_ENQUEUE_REASON COUNT(*) ------------------------------------------------ ----------------------------- ------------- ------------------------- ---------- JDBC Thin Client enq: CF - contention 0cufxa3h8bp1b TEMP TABLE TRANSFORMATION 1132 JDBC Thin Client enq: CF - contention 60tftk50vu96h TEMP TABLE TRANSFORMATION 266
检查control file sequential read等待分布
@event_hist control.*read session_id=1262 1596 1596 Wait time Num ASH Estimated Estimated % Event Estimated Wait Event bucket ms+ Samples Total Waits Total Sec Time Time Graph ------------------------------ ----------- ---------- ----------- ------------ ---------- ------------ control file sequential read < 1 17 66149.3 66.1 88.3 |######### | < 16 2 171.8 2.7 3.6 | | < 32 3 128.2 4.1 5.5 |# | < 128 1 15.6 2.0 2.7 | |
关于 ‘enq: CF – contention’的等待可以去我另1个之前另一篇《 Troubleshooting performance event ‘enq: CF – contention’》,CF lock通常发生在:
— checkpointing
— switching logfiles
— archiving redologs
— performing crash recovery
— logfile manipulation
— begin/end hot backup
— DML access for NOLOGGING objects
持有CF lock的常见进程
SQL> select p.program, s.event from v$lock l, v$session s, v$process p where l.sid = s.sid and s.paddr = p.addr and l.type='CF' and l.lmode >= 5; oracle@hostname.ru (CKPT) control file sequential read oracle@hostname.ru (ARC2) control file sequential read oracle@hostname.ru (LGWR) control file sequential read
这个案例继续可以排查control file 大小,和当时的备份进程状态,如是否因为RMAN备份磁带问题而挂起,是否当时ASM层有问题? 是否控制文件读取较慢,尝试asmcmd cp CF文件。
目前这篇文章还没有评论(Rss)