首页 » ORACLE 9i-23ai » Troubleshooting Oracle Wait Chains ‘buffer busy wait’、’log file sync’、 ‘enq: CF – contention’ during RMAN backup

Troubleshooting Oracle Wait Chains ‘buffer busy wait’、’log file sync’、 ‘enq: CF – contention’ during RMAN backup

最近一个客户的数据库出现了短暂1分钟左右的高负载,环境为oracle 11.2.0.4 2nodes RAC , 看当时的top event主要是buffer busy wait和log file sync, 关于buffer busy wait 在Troubleshooting Oracle wait event ‘buffer busy wait’ 有记录过分析方法,是一种hot block现象, 而log file sync又是提交类的,当两者同时发生时,可能是后者影响前者,首先建议排查I/O, 而另1个节点此时正在做RMAN增量备份,简单记录该问题.

ASH 报告

— Instance 2

Top User Events

Event Event Class % Event Avg Active Sessions
buffer busy waits Concurrency 32.78 11.27
log file sync Commit 27.64 9.50
CPU + Wait for CPU CPU 27.22 9.36
db file sequential read User I/O 3.12 1.07
enq: TX – row lock contention Application 1.87 0.64

Top Background Events

Event Event Class % Activity Avg Active Sessions
Streams AQ: qmn coordinator waiting for slave to start Other 1.12 0.39

Top Event P1/P2/P3 Values

Event % Event P1 Value, P2 Value, P3 Value % Activity Parameter 1 Parameter 2 Parameter 3
buffer busy waits 32.78 “1”,”59943″,”1″ 3.01 file# block# class#
“1”,”62769″,”1″ 2.73
“1”,”228″,”1″ 2.73
log file sync 27.66 “4294967295”,”3388657110″,”0″ 0.15 buffer# sync scn NOT DEFINED

Top Blocking Sessions

Blocking Sid (Inst) % Activity Event Caused % Event User Program # Samples Active XIDs
4348, 1( 2) 27.98 log file sync 27.66 SYS oracle@anbob2 (LGWR) 248/1,200 [ 21%] 0
4479,20405( 2) 2.09 buffer busy waits 2.09 USER1 proc1#svr31 14/1,200 [ 1%] 1
1148,53443( 2) 1.66 buffer busy waits 1.66 USER1 proc1#svr25 18/1,200 [ 2%] 0
5868,21505( 2) 1.50 buffer busy waits 1.50 USER1 proc1#svr26 21/1,200 [ 2%] 0
3663, 7499( 2) 1.41 buffer busy waits 1.41 USER1 proc1#svr29 17/1,200 [ 1%] 1

— Instance 1

Top User Events

Event Event Class % Event Avg Active Sessions
CPU + Wait for CPU CPU 6.67 0.26
gc buffer busy acquire Cluster 3.18 0.12
gc cr block busy Cluster 1.18 0.05

Top Background Events

Event Event Class % Activity Avg Active Sessions
Backup: MML write backup piece Administrative 62.90 2.40
CPU + Wait for CPU CPU 12.93 0.49
enq: CF – contention Other 2.62 0.10
RMAN backup & recovery I/O System I/O 1.94 0.07
ASM file metadata operation Other 1.35 0.05

Top Event P1/P2/P3 Values

Event % Event P1 Value, P2 Value, P3 Value % Activity Parameter 1 Parameter 2 Parameter 3
gc buffer busy acquire 3.18 “22”,”177″,”1″ 1.55 file# block# class#
“22”,”174″,”1″ 1.53
enq: CF – contention 2.64 “1128660996”,”0″,”0″ 2.62 name|mode 0 operation
RMAN backup & recovery I/O 1.94 “1”,”256″,”2147483647″ 1.90 count intr timeout
ASM file metadata operation 1.55 “6268”,”9″,”0″ 0.78 msgop locn NOT DEFINED
log file sequential read 1.33 “1”,”1″,”1″ 0.15 log# block# blocks

Top Blocking Sessions

Blocking Sid (Inst) % Activity Event Caused % Event User Program # Samples Active XIDs
5307,19313( 1) 2.62 enq: CF – contention 2.62 SYS rman@anbob1(TNS V1-V3) 48/1,200 [ 4%] 0
2650,45619( 1) 1.57 gc buffer busy acquire 1.57 USER1 JDBC Thin Client 28/1,200 [ 2%] 0
1906,10651( 1) 1.53 gc buffer busy acquire 1.53 USER1 JDBC Thin Client 20/1,200 [ 2%] 0

诊断脚本

当然可以使用ASH,DASH,根据sample_time或snap_id再结合blocking_session和blocking_inst_id手动递归的查等待链条,Top 1 Event并不一定是问题的根源,而是现象。

使用  Tanel’s ash_wait_chains.sql

@ash_wait_chains <grouping_cols> <filters> <fromtime> <totime>

@ash_wait_chains username||':'||program2||event2 session_type='FOREGROUND'  "to_date('20250307 13:40','yyyymmdd hh24:mi')" "to_date('20250307 13:50','yyyymmdd hh24:mi')"

@ash_wait_chains username||':'||event  1=1 "to_date('20250307 13:40','yyyymmdd hh24:mi')" "to_date('20250307 13:50','yyyymmdd hh24:mi')"

@ash_wait_chains instance_number||':'||session_id  1=1 "to_date('20250307 13:40','yyyymmdd hh24:mi')" "to_date('20250307 13:50','yyyymmdd hh24:mi')"

通过ASH

-- ASH TEMP table
create table system.ash_20250307  as
select * from gv$active_session_history 
where sample_time between to_date('20250307 13:40','yyyymmdd hh24:mi') and to_date('20250307 13:50','yyyymmdd hh24:mi'); 

-- Diag wait chains
with t as
    (select * from system.ash_20250307),
   t2 as
    (select 
	 sample_time ts
      , level lv,
            session_id,
            session_serial#,
            program,
            event,
            connect_by_iscycle iscycle,
            connect_by_isleaf leaf,
            LEVEL,
            SYS_CONNECT_BY_PATH('[' || program || ']' || '(' || nvl(event, session_state) || ')','->') wait_chain
       from t
     connect by NOCYCLE prior blocking_session = session_id
            and prior blocking_session_serial# = session_serial#
			and prior blocking_inst_id = inst_id
            and prior to_char(sample_time, 'hh24miss') =
                 to_char(sample_time, 'hh24miss')
      start with session_state = 'WAITING')
   select wait_chain, count(*), max(iscycle) iscycle, TO_CHAR(MIN(ts), 'YYYY-MM-DD HH24:MI:SS') first_seen
      , TO_CHAR(MAX(ts), 'YYYY-MM-DD HH24:MI:SS') last_seen
     from t2
    where leaf = 1   AND LV > 1
    group by wait_chain
    order by count(*) desc;

Note:

返回的堵塞链条

->[proc1#SRV31](buffer busy waits)->[proc1#SRV21](buffer busy waits)->[proc2#svr31](log file sync)->[oracle@anbob2 (LGWR)](enq: CF – contention)->[rman@anbob1 (TNS V1-V3)](ON CPU)

目前可以看到最终是有RMAN进程备份时,持有的Control File 队列,堵塞了另一个节点的BG进程LGWR,又导致提前的FG进程logfile sync,和未提交FG进程的buffer busy waits。

buffer busy waits
   ===>
      log file sync
                  ===>
                     enq: CF - contention [lgwr]
                                         ===>
                                             RMAN

网上一个案例,因为应用调用sequence比较多,TOP 1 是enq: SQ – contention, 可能给人的错觉是seq出了问题,比如cache不足等,但那不是问题的根源。

@dash_wait_chains username||':'||event  1=1 1596 1596

%This     SECONDS WAIT_CHAIN
------ ---------- ----------------------------------------------------------------------------------------------------------------------------------------------------
  70%   105023520 -> SYS:enq: SQ - contention -> SYS:row cache lock -> SYS:log file switch (archiving needed) -> SYS:control file sequential read
   9%    13698720 -> SYS:enq: SQ - contention -> SYS:row cache lock -> SYS:log file switch (archiving needed) -> SYS:Disk file Mirror Read

另一案例

SQL> select * from (
2  with sf_sql as
3   (select distinct sql_id, plan_hash_value from v$sql_plan where operation = 'TEMP TABLE TRANSFORMATION'
4    union
5    select distinct sql_id, plan_hash_value from dba_hist_sql_plan where operation = 'TEMP TABLE TRANSFORMATION')
6  select PROGRAM, EVENT, ash.sql_id, nvl2(p.sql_id, 'TEMP TABLE TRANSFORMATION', 'Other CF activities') AS CF_ENQUEUE_REASON, count(*)
7    from v$active_session_history ash, sf_sql p
8   where EVENT in ('enq: CF - contention',
9                   'control file sequential read')
10     and SESSION_STATE = 'WAITING'
11     and ash.SQL_ID = p.sql_id(+)
12     and ash.SQL_PLAN_HASH_VALUE = p.plan_hash_value(+)
13   group by PROGRAM, EVENT, ash.sql_id, nvl2(p.sql_id, 'TEMP TABLE TRANSFORMATION', 'Other CF activities')
14   order by count(*) desc
15  ) where rownum <= 5
16  /
 
PROGRAM                                          EVENT                         SQL_ID        CF_ENQUEUE_REASON           COUNT(*)
------------------------------------------------ ----------------------------- ------------- ------------------------- ----------
JDBC Thin Client                                 enq: CF - contention          0cufxa3h8bp1b TEMP TABLE TRANSFORMATION       1132
JDBC Thin Client                                 enq: CF - contention          60tftk50vu96h TEMP TABLE TRANSFORMATION        266

检查control file sequential read等待分布

@event_hist  control.*read session_id=1262 1596 1596
                                 Wait time    Num ASH   Estimated    Estimated    % Event  Estimated
Wait Event                      bucket ms+    Samples Total Waits    Total Sec       Time  Time Graph
------------------------------ ----------- ---------- ----------- ------------ ---------- ------------
control file sequential read           < 1         17     66149.3         66.1       88.3 |######### |
                                      < 16          2       171.8          2.7        3.6 |          |
                                      < 32          3       128.2          4.1        5.5 |#         |
                                     < 128          1        15.6          2.0        2.7 |          |

关于 ‘enq: CF – contention’的等待可以去我另1个之前另一篇《 Troubleshooting performance event ‘enq: CF – contention’》,CF lock通常发生在:

— checkpointing
— switching logfiles
— archiving redologs
— performing crash recovery
— logfile manipulation
— begin/end hot backup
— DML access for NOLOGGING objects

持有CF lock的常见进程

SQL> select p.program, s.event
 from v$lock l, v$session s, v$process p
 where l.sid = s.sid
 and s.paddr = p.addr
 and l.type='CF'
 and l.lmode >= 5;
 
oracle@hostname.ru (CKPT)    control file sequential read
oracle@hostname.ru (ARC2)    control file sequential read
oracle@hostname.ru (LGWR)    control file sequential read

这个案例继续可以排查control file 大小,和当时的备份进程状态,如是否因为RMAN备份磁带问题而挂起,是否当时ASM层有问题? 是否控制文件读取较慢,尝试asmcmd cp CF文件。

打赏

, ,

目前这篇文章还没有评论(Rss)

我要评论