Troubleshooting Oracle RAC high active sessions wait log file sync &buffer busy waits & gc xx events due to Interconnect latency
最近有家半导体客户的oracle RAC间隔性出现了2次活动会话高负载的堵塞现象,找到我协助分析, 当时的主要活动会等待是buffer busy wait, enq: tx … 的hot object现象,同时有较高的log file sync 单次延时,当分析数据库性能问题时,从top event入手没错,但是也要防止被带偏,也可能是因为其它的影响,简单记录这个事件。
AWR profile
Per Second | Per Transaction | Per Exec | Per Call | |
---|---|---|---|---|
DB Time(s): | 215.8 | 1.1 | 0.08 | 0.02 |
DB CPU(s): | 4.1 | 0.0 | 0.00 | 0.00 |
Redo size (bytes): | 3,217,300.8 | 15,586.6 | ||
Logical read (blocks): | 219,035.8 | 1,061.2 | ||
Block changes: | 12,901.5 | 62.5 | ||
Physical read (blocks): | 6,808.4 | 33.0 | ||
Physical write (blocks): | 679.3 | 3.3 | ||
Read IO requests: | 97.5 | 0.5 | ||
Write IO requests: | 116.5 | 0.6 | ||
Read IO (MB): | 53.2 | 0.3 | ||
Write IO (MB): | 5.3 | 0.0 | ||
Global Cache blocks received: | 6.8 | 0.0 | ||
Global Cache blocks served: | 211.6 | 1.0 | ||
User calls: | 10,975.3 | 53.2 | ||
Parses (SQL): | 4,106.6 | 19.9 | ||
Hard parses (SQL): | 0.4 | 0.0 | ||
SQL Work Area (MB): | 26.3 | 0.1 | ||
Logons: | 4.9 | 0.0 | ||
Executes (SQL): | 2,758.7 | 13.4 | ||
Rollbacks: | 12.0 | 0.1 | ||
Transactions: | 206.4 |
Top 10 Foreground Events by Total Wait Time
Event | Waits | Total Wait Time (sec) | Wait Avg(ms) | % DB time | Wait Class |
---|---|---|---|---|---|
buffer busy waits | 283,804 | 212.4K | 748 | 27.2 | Concurrency |
enq: TX – row lock contention | 377,781 | 177K | 469 | 22.7 | Application |
log file sync | 740,503 | 103.5K | 140 | 13.3 | Commit |
enq: TX – index contention | 176,726 | 84.6K | 479 | 10.9 | Concurrency |
rdbms ipc reply | 198,695 | 81.7K | 411 | 10.5 | Other |
enq: TA – contention | 1,260 | 23.3K | 18505 | 3.0 | Other |
enq: US – contention | 3,420 | 15.1K | 4421 | 1.9 | Other |
DB CPU | 14.9K | 1.9 | |||
KJC: Wait for msg sends to complete | 5,491 | 10.1K | 1848 | 1.3 | Other |
DFS lock handle | 21,161 | 9986.8 | 472 | 1.3 | Other |
Note:
主要的wait是buffer busy热块或enq : tx。。行锁, 及log file sync, 看上去像是高并发,但并不一定,注意log file sync wait avg 140ms, 上面的事务并没有多大量级。 想象一下 commit时会做哪个操作? 持久化或广播不是吗?
Background Wait Events
Event | Waits | %Time -outs | Total Wait Time (s) | Avg wait (ms) | Waits /txn | % bg time |
---|---|---|---|---|---|---|
DFS lock handle | 20,163 | 100 | 1,349 | 67 | 0.03 | 30.20 |
log file parallel write | 694,521 | 0 | 590 | 1 | 0.93 | 13.20 |
wait for scn ack | 727 | 0 | 441 | 607 | 0.00 | 9.88 |
gcs log flush sync | 35,982 | 10 | 182 | 5 | 0.05 | 4.07 |
buffer busy waits | 86 | 0 | 100 | 1164 | 0.00 | 2.24 |
enq: CO – master slave det | 1,053 | 100 | 90 | 85 | 0.00 | 2.01 |
Note:
log file parallel write正常范围,但scn ask 异常高。
DFS lock handle
SQL> select instance_number,p1,count(*) from anbob.ASH_DBA20240802 where sample_time between to_date('20240802 15:20','yyyymmdd hh24:mi') and to_date('20240802 15:50','yyyymmdd hh24:mi') and event='DFS lock handle' group by instance_number,p1 2 ; INSTANCE_NUMBER P1 COUNT(*) --------------- ---------- ---------- 2 1111621637 3 2 1146617861 4 2 1413545989 3 1 1111621637 127 1 1413545989 794 1 1146617861 389 SQL> select chr(bitand(&p1,-16777216)/16777215) || chr(bitand(&p1,16711680)/65535) type, mod(&p1, 16) md from dual ; 2 3 4 5 Enter value for p1: 1413545989 TYPE MD ---- ---------- TA 5 SQL> / Enter value for p1: 1146617861 TYPE MD ---- ---------- DX 5 SQL> select * from v$lock_type where type='DX'; TYPE NAME ID1_TAG ID2_TAG IS_USE DESCRIPTION ---------- ------------------------------ ---------- ---------- ------ ---------------------------------------- DX Distributed Transaction transactio 0 NO Serializes tightly coupled distributed t n entry # ransaction branches SQL> select * from v$lock_type where type='TA' 2 ; TYPE NAME ID1_TAG ID2_TAG IS_USE DESCRIPTION ---------- ------------------------------ ---------- ---------- ------ ---------------------------------------- TA Instance Undo operation undo segme NO Serializes operations on undo segments a nt # / oth nd undo tablespaces er
ASH wait chain
SQL> @mydash program2||event2 session_type='FOREGROUND' "to_date('20240802 15:28','yyyymmdd hh24:mi')" "to_date('20240802 15:36','yyyymmdd hh24:mi')" -- Display ASH Wait Chain Signatures script v0.6 BETA by Tanel Poder ( https://tanelpoder.com ) %This SECONDS AAS #Blkrs WAIT_CHAIN FIRST_SEEN LAST_SEEN ------ ---------- ---------- ------ ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ -------------------------------------- -------------------------------------- 10% 39960 83.3 4 -> (pro1) rdbms ipc reply -> (GTXn) DFS lock handle 2024-08-02 15:29:21 2024-08-02 15:35:24 9% 39410 82.1 36 -> (pro2) buffer busy waits [data block] -> (pro2) gc current grant 2-way 2024-08-02 15:29:31 2024-08-02 15:35:24 4% 17430 36.3 27 -> (pro2) enq: TX - row lock contention [mode=6] -> (pro2) buffer busy waits [data block] -> (pro2) gc current grant 2-way 2024-08-02 15:29:31 2024-08-02 15:35:24 4% 15260 31.8 17 -> (pro3) enq: TX - row lock contention [mode=6] -> (pro3) gc current grant 2-way 2024-08-02 15:29:31 2024-08-02 15:35:24 3% 11770 24.5 1 -> (pro3) log file sync -> (LGWR) wait for scn ack 2024-08-02 15:30:34 2024-08-02 15:35:24 3% 11170 23.3 1 -> (pro2) log file sync -> (LGWR) wait for scn ack 2024-08-02 15:30:34 2024-08-02 15:35:24 2% 7430 15.5 1 -> (pro2) log file sync -> (LGWR) wait for scn ack -> [idle blocker 2,2458,1] 2024-08-02 15:29:21 2024-08-02 15:30:23 1% 6150 12.8 5 -> (pro2) enq: TX - index contention [mode=4] -> (pro2) enq: FB - contention [mode=6] 2024-08-02 15:30:12 2024-08-02 15:34:21 1% 5240 10.9 4 -> (pro3) enq: TX - row lock contention [mode=6] -> (pro3) buffer busy waits [data block] -> (pro3) gc current grant 2-way 2024-08-02 15:30:12 2024-08-02 15:33:07 1% 5220 10.9 1 -> (pro3) log file sync -> (LGWR) wait for scn ack -> [idle blocker 2,2458,1] 2024-08-02 15:29:21 2024-08-02 15:30:23 1% 4340 9 42 -> (extract) control file sequential read 2024-08-02 15:29:12 2024-08-02 15:35:59 1% 4320 9 3 -> (pro2) enq: TX - index contention [mode=4] -> (pro2) buffer busy waits [data block] -> (pro2) buffer busy waits [data block] -> (pro2) gc current grant 2-way
Other Instance Activity Stats
Statistic | Total | per Second | per Trans |
---|---|---|---|
redo write broadcast ack count | 55,254 | 15.29 | 0.07 |
redo write broadcast ack time | 2,555,084,270 | 707,047.46 | 3,425.38 |
redo write broadcast lgwr post count | 33 | 0.01 | 0.00 |
Host CPU
CPUs | Cores | Sockets | Load Average Begin | Load Average End | %User | %System | %WIO | %Idle |
---|---|---|---|---|---|---|---|---|
32 | 32 | 4 | 0.14 | 0.15 | 13.3 | 2.6 | 0.0 | 84.1 |
Global Cache Load Profile
Per Second | Per Transaction | |
---|---|---|
Global Cache blocks received: | 6.81 | 0.03 |
Global Cache blocks served: | 211.59 | 1.03 |
GCS/GES messages received: | 498.81 | 2.42 |
GCS/GES messages sent: | 487.26 | 2.36 |
DBWR Fusion writes: | 7.49 | 0.04 |
Estd Interconnect traffic (KB) | 1,939.76 |
Global Cache and Enqueue Services – Workload Characteristics
Avg global enqueue get time (ms): | 49.2 |
Avg global cache cr block receive time (ms): | 551.0 |
Avg global cache current block receive time (ms): | 157.4 |
Avg global cache cr block build time (ms): | 0.0 |
Avg global cache cr block send time (ms): | 0.0 |
Global cache log flushes for cr blocks served %: | 2.6 |
Avg global cache cr block flush time (ms): | 15.8 |
Avg global cache current block pin time (ms): | 0.2 |
Avg global cache current block send time (ms): | 0.0 |
Global cache log flushes for current blocks served %: | 0.0 |
Avg global cache current block flush time (ms): | 30.6 |
Global Cache and Enqueue Services – Messaging Statistics
Avg message sent queue time (ms): | 0.1 |
Avg message sent queue time on ksxp (ms): | 91.1 |
Avg message received queue time (ms): | 0.0 |
Avg GCS message process time (ms): | 0.0 |
Avg GES message process time (ms): | 0.0 |
% of direct sent messages: | 67.04 |
% of indirect sent messages: | 31.78 |
% of flow controlled messages: | 1.17 |
Global Messaging Statistics
Statistic | Total | per Second | per Trans |
---|---|---|---|
acks for commit broadcast(actual) | 43,823 | 12.13 | 0.06 |
acks for commit broadcast(logical) | 48,723 | 13.48 | 0.07 |
broadcast msgs on commit(actual) | 662,139 | 183.23 | 0.89 |
broadcast msgs on commit(logical) | 688,529 | 190.53 | 0.92 |
broadcast msgs on commit(wasted) | 21,462 | 5.94 | 0.03 |
dynamically allocated gcs resources | 0 | 0.00 | 0.00 |
Interconnect Device Statistics
Device Name | IP Address | Public | Source | Send Mbytes/sec | Send Errors | Send Dropped | Send Buffer Overrun | Send Carrier Lost | Receive Mbytes/sec | Receive Errors | Receive Dropped | Receive Buffer Overrun | Receive Frame Errors |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
lan901:801 | 169.254.118.232 | NO | 0.00 | 0 | 0 | 0 | 0 | 0.00 | 0 | 0 | 0 | 0 |
Interconnect Ping Latency Stats –问题时间点
Target Instance | 500B Ping Count | Avg Latency 500B msg | Stddev 500B msg | 8K Ping Count | Avg Latency 8K msg | Stddev 8K msg |
---|---|---|---|---|---|---|
1 | 288 | 0.15 | 0.04 | 288 | 0.14 | 0.04 |
2 | 287 | 456.64 | 1,664.70 | 287 | 458.05 | 1,666.25 |
Interconnect Ping Latency Stats –其他正常时间点
Target Instance | 500B Ping Count | Avg Latency 500B msg | Stddev 500B msg | 8K Ping Count | Avg Latency 8K msg | Stddev 8K msg |
---|---|---|---|---|---|---|
1 | 285 | 0.15 | 0.04 | 285 | 0.14 | 0.04 |
2 | 285 | 0.27 | 0.23 | 285 | 0.39 | 0.26 |
小结:
TOP event 基本是被跨实例调用相关的等待,如DFS LOCK和GC ,而log file sync是等scn ask也是私网广播。而RAC的统计信息发现网络延迟过高,是RAC无法接受的,因为该环境无OSW, 但目前的数据基本怀疑是interconnect 延时过高导致。
如果您有相关问题 ,联系www.anbob.com 首页的联系方式
对不起,这篇文章暂时关闭评论。