Troubleshooting Wait event: enq: HW – contention思路
这是一套11G数据库,一线DBA反应是数据库活动会话很高, 要来一份AWR,显示数据库top event中较高的是enq: HW – contention和enq: TX – row lock contention,而另一节点是有较高的direct path read.
DB Name | DB Id | Instance | Inst num | Startup Time | Release | RAC |
---|---|---|---|---|---|---|
ANBOB | 3391174999 | anbob1 | 1 | 03-May-20 18:05 | 11.2.0.4.0 | YES |
Host Name | Platform | CPUs | Cores | Sockets | Memory (GB) |
---|---|---|---|---|---|
anbob.com1 | Solaris[tm] OE (64-bit) | 512 | 64 | 4 | 1023.00 |
Snap Id | Snap Time | Sessions | Cursors/Session | Instances | |
---|---|---|---|---|---|
Begin Snap: | 88631 | 03-Jun-20 20:00:17 | 549 | .6 | 2 |
End Snap: | 88633 | 03-Jun-20 21:00:36 | 2293 | 2.7 | 2 |
Elapsed: | 60.30 (mins) | ||||
DB Time: | 47,824.25 (mins) |
Report Summary
Load Profile
Per Second | Per Transaction | Per Exec | Per Call | |
---|---|---|---|---|
DB Time(s): | 793.1 | 11.6 | 0.48 | 0.21 |
DB CPU(s): | 21.5 | 0.3 | 0.01 | 0.01 |
Redo size (bytes): | 397,274.5 | 5,831.3 | ||
Logical read (blocks): | 1,254,716.0 | 18,417.0 | ||
Block changes: | 2,235.9 | 32.8 | ||
Physical read (blocks): | 576.7 | 8.5 | ||
Physical write (blocks): | 107.4 | 1.6 | ||
Read IO requests: | 223.2 | 3.3 | ||
Write IO requests: | 79.5 | 1.2 | ||
Read IO (MB): | 4.5 | 0.1 | ||
Write IO (MB): | 0.8 | 0.0 | ||
Global Cache blocks received: | 290.0 | 4.3 | ||
Global Cache blocks served: | 205.6 | 3.0 | ||
User calls: | 3,830.6 | 56.2 | ||
Parses (SQL): | 1,433.6 | 21.0 | ||
Hard parses (SQL): | 4.1 | 0.1 | ||
SQL Work Area (MB): | 11.2 | 0.2 | ||
Logons: | 1.9 | 0.0 | ||
Executes (SQL): | 1,651.4 | 24.2 | ||
Rollbacks: | 22.1 | 0.3 | ||
Transactions: | 68.1 |
Instance Efficiency Percentages (Target 100%)
Buffer Nowait %: | 99.98 | Redo NoWait %: | 99.99 |
Buffer Hit %: | 99.98 | In-memory Sort %: | 100.00 |
Library Hit %: | 99.65 | Soft Parse %: | 99.71 |
Execute to Parse %: | 13.19 | Latch Hit %: | 98.14 |
Parse CPU to Parse Elapsd %: | 85.74 | % Non-Parse CPU: | 99.55 |
Top 10 Foreground Events by Total Wait Time
Event | Waits | Total Wait Time (sec) | Wait Avg(ms) | % DB time | Wait Class |
---|---|---|---|---|---|
enq: HW – contention | 1,409 | 773.6K | 549069 | 27.0 | Configuration |
enq: TX – row lock contention | 4,853 | 600.8K | 123792 | 20.9 | Application |
row cache lock | 183,475 | 207.7K | 1132 | 7.2 | Concurrency |
gc buffer busy acquire | 386,790 | 169.4K | 438 | 5.9 | Cluster |
db file sequential read | 737,067 | 161.2K | 219 | 5.6 | User I/O |
gc buffer busy release | 178,230 | 139.9K | 785 | 4.9 | Cluster |
read by other session | 185,264 | 123.8K | 668 | 4.3 | User I/O |
enq: US – contention | 87,502 | 83.3K | 951 | 2.9 | Other |
DB CPU | 77.7K | 2.7 | |||
log file sync | 181,402 | 73.7K | 406 | 2.6 | Commit |
从以上信息得到1小时内会话数据上涨2000, 每秒dbtime 793 已高于512 CPUs说明这个系统负载已经很高,每秒逻辑读125万,DB IOPS约300多, IO吞吐量几M,transaction 和SQL 执行量也并不高, % Non-Parse CPU 达99%+,从当前实例看不是解析和事务量大导致的。
从TOP EVENT看是enq: HW – contention,enq: TX – row lock contention 占了前台db time 50%, avg ms已经高的失真, 下面看IO 类的avg 也在200ms。
Operating System Statistics – Detail
Snap Time | Load | %busy | %user | %sys | %idle | %iowait |
---|---|---|---|---|---|---|
03-Jun 20:00:17 | 13.59 | |||||
03-Jun 20:30:40 | 6.61 | 7.23 | 6.26 | 0.97 | 92.77 | 0.00 |
03-Jun 21:00:36 | 13.77 | 2.99 | 2.11 | 0.88 | 97.01 | 0.00 |
Back to Wait Events Statistics
Back to Top
Foreground Wait Class
- s – second, ms – millisecond – 1000th of a second
- ordered by wait time desc, waits desc
- %Timeouts: value of 0 indicates value was < .5%. Value of null is truly 0
- Captured Time accounts for 87.1% of Total DB time 2,869,455.22 (s)
- Total FG Wait Time: 2,421,211.45 (s) DB CPU time: 77,748.59 (s)
Wait Class | Waits | %Time -outs | Total Wait Time (s) | Avg wait (ms) | %DB time |
---|---|---|---|---|---|
Configuration | 4,316 | 34 | 777,663 | 180181 | 27.10 |
Application | 16,739 | 0 | 600,782 | 35891 | 20.94 |
Cluster | 1,900,124 | 4 | 340,004 | 179 | 11.85 |
User I/O | 1,148,504 | 0 | 296,017 | 258 | 10.32 |
Concurrency | 983,527 | 6 | 245,365 | 249 | 8.55 |
Other | 414,877 | 70 | 83,754 | 202 | 2.92 |
DB CPU | 77,749 | 2.71 | |||
Commit | 181,402 | 0 | 73,654 | 406 | 2.57 |
System I/O | 32,000 | 0 | 3,941 | 123 | 0.14 |
Network | 9,675,267 | 0 | 32 | 0 | 0.00 |
数据库负载AAS高,但CPU并不高,最高的是configuration CLASS也就是enq: hw.
Background Wait Events
- ordered by wait time desc, waits desc (idle events last)
- Only events with Total Wait Time (s) >= .001 are shown
- %Timeouts: value of 0 indicates value was < .5%. Value of null is truly 0
Event | Waits | %Time -outs | Total Wait Time (s) | Avg wait (ms) | Waits /txn | % bg time |
---|---|---|---|---|---|---|
db file parallel write | 223,056 | 0 | 18,135 | 81 | 0.90 | 39.93 |
gcs log flush sync | 560,810 | 45 | 8,156 | 15 | 2.28 | 17.96 |
control file sequential read | 15,802 | 0 | 3,755 | 238 | 0.06 | 8.27 |
log file parallel write | 276,181 | 0 | 1,470 | 5 | 1.12 | 3.24 |
log file sync | 488 | 0 | 1,302 | 2668 | 0.00 | 2.87 |
db file sequential read | 3,224 | 0 | 1,160 | 360 | 0.01 | 2.55 |
control file parallel write | 4,928 | 0 | 594 | 121 | 0.02 | 1.31 |
后台db file 和controlfile 的IO 也并不好。
Global Cache Load Profile
Per Second | Per Transaction | |
---|---|---|
Global Cache blocks received: | 290.03 | 4.26 |
Global Cache blocks served: | 205.60 | 3.02 |
GCS/GES messages received: | 655.58 | 9.62 |
GCS/GES messages sent: | 959.28 | 14.08 |
DBWR Fusion writes: | 13.02 | 0.19 |
Estd Interconnect traffic (KB) | 4,280.46 |
Global Cache and Enqueue Services – Workload Characteristics
Avg global enqueue get time (ms): | 47.8 |
Avg global cache cr block receive time (ms): | 79.1 |
Avg global cache current block receive time (ms): | 18.6 |
Avg global cache cr block build time (ms): | 0.0 |
Avg global cache cr block send time (ms): | 0.0 |
Global cache log flushes for cr blocks served %: | 16.1 |
Avg global cache cr block flush time (ms): | 349.8 |
Avg global cache current block pin time (ms): | 0.9 |
Avg global cache current block send time (ms): | 0.0 |
Global cache log flushes for current blocks served %: | 1.3 |
Avg global cache current block flush time (ms): | 1,395.7 |
Global Cache and Enqueue Services – Messaging Statistics
Avg message sent queue time (ms): | 0.1 |
Avg message sent queue time on ksxp (ms): | 0.6 |
Avg message received queue time (ms): | 0.0 |
Avg GCS message process time (ms): | 0.1 |
Avg GES message process time (ms): | 0.0 |
% of direct sent messages: | 45.61 |
% of indirect sent messages: | 52.52 |
% of flow controlled messages: | 1.86 |
Global CURRENT Served Stats
- Pins = CURRENT Block Pin Operations
- Flushes = Redo Flush before CURRENT Block Served Operations
- Writes = CURRENT Block Fusion Write Operations
Statistic | Total | % <1ms | % <10ms | % <100ms | % <1s | % <10s |
---|---|---|---|---|---|---|
Pins | 1,912 | 47.59 | 34.21 | 7.79 | 4.55 | 5.86 |
Flushes | 6,047 | 29.67 | 37.95 | 1.03 | 2.40 | 28.96 |
Writes | 48,372 | 36.00 | 56.30 | 0.61 | 1.56 | 5.53 |
Note:
网络带宽使用也就4MB, 但cr/current block receive time不是理想,cr/ current block flush time 存在严重问题,Flush 指脏块被LMS进程传输出去之前,其相关的redo必须由LGWR已经flush 到磁盘上。 对于这个案例我们从SQL ordered by Elapsed Time找到了insert SQL, 该表中确实有lob字段。同时该时间段另一实例的direct path 是一个几十GB的大表在全表扫描。
What is an enqueue?
Enqueues are shared memory structures (locks) that serialize access to database resources. They can be associated with a session or transaction.
What is high water mark (HW)
The boundary between used and unused space in a segment.
enq: HW – contention
. In Oracle, the High-water Mark (HWM) of a segment is a pointer to a data block up to which free blocks are formatted and are available to insert new data. If data is inserted at a high rate, new blocks may have to be made available after a search for the freelists or L1 blocks was unable to return any space. This involves formatting the blocks, inserting them into a segment header or bitmap block and pushing up the HWM. This has to happen while holding the HWM enqueue and acquiring exclusive access to the new blocks. Normally, a batch of blocks with contiguous block addresses is “newed” and the HW enqueue is not released until the end of the operation.
he former being a consequence of the serialization on the High-water Mark enqueue (HW), and the latter relating to the fact that current access to the new data blocks is required for the “newing” operation. In a RAC environment, the length of this space management operation is proportional to the time it takes to acquire the HW enqueue and the time it takes to acquire global locks for all the new blocks. This time is small under normal circumstances because there will never be any access conflict for the new blocks.
Waits on High Water enqueue occur when allocating space beyond the high water mark of a segment during insert operations. Also, this wait occurs when reclaiming free space in lob segments after delete or update operations. When lob segments are deleted or updated, the space being used is not immediately freed up so consistent reads can be maintained. When more space is needed, a HW enqueue lock is acquired to reclaim the free space from the previous deletes or updates. The PCTVERSION or RETENTION settings on the log segment object control the frequency of when the space is reclaimed.
To find the root cause of the problem, determine which segment is growing fast, either by extrapolating from the rates at which data is inserted (V$SQLAREA) or by capturing the id1 and id2 of the HW enqueue waited for, where the id1 is the tablespace number in which the segment resides and id2 will be the block address of the segment header。
Parameters:
- P1 = name|mode
- P2 = tablespace#
- P3 = block (RDBA) — 如果是bigfile tablespace
You can get the information about the event from dba_hist_active_sess_history.
SELECT event, p1, p2, Count(1) FROM dba_hist_active_sess_history WHERE event_id = 1645217925 --enq hw GROUP BY event,p1,p2
tablespace#
This is the tablespace id for the segment we want an HW lock on.
You can find the tablespace name using P2 thus:
SELECT * FROM dba_tablespaces WHERE tablespace_name = ( select name from sys.ts$ where ts#=&P2 ) ;
block (RDBA)
This parameter depends on whether the tablespace is a BIGFILE tablespace or not:
If the tablespace is BIGFILE then P3 is the block number in the file for that tablespace.
If the tablespace is NOT a BIGFILE tablespace then P2 is a relative DBA (RDBA) rather than a block#. The value encodes the relative file# and block# of the segment header that the HW lock is requested for in a 4 byte value. You can convert P3 to a relative file and block# using DBMS_UTILITY:
SELECT DBMS_UTILITY.DATA_BLOCK_ADDRESS_FILE(rdba) RELFILE#, DBMS_UTILITY.DATA_BLOCK_ADDRESS_BLOCK(rdba) BLOCK# FROM (select &P3 rdba from dual) ;
Now we can use this information to get which segment it belongs too.
SELECT owner, segment_type, segment_name FROM dba_extents WHERE file_id = 14 AND 266 BETWEEN block_id AND block_id + blocks - 1;
if It’s a lob segment, We can find out which table and column by checking dba_lobs table.
SELECT owner, table_name, column_name, chunk, securefile FROM dba_lobs WHERE segment_name = 'SYS_LOB00000xxxx'
Finding Blockers:
You can find current blockers by querying GV$LOCK like this:
SELECT distinct w.HW, w.p2 OBJECT_ID, l.inst_id, l.sid, l.lmode, l.request FROM ( SELECT p2, p3, 'HW-'||substr(p2raw,-8)||'-'||lpad(p3,8,'0') HW FROM v$session_wait WHERE event='enq: HW - contention' and state='WAITING' ) W, gv$lock L WHERE l.type(+)='HW' and l.id1(+)=w.p2 and l.id2(+)=w.p3 ORDER BY hw, lmode desc, request desc
There are 10 bugs listed.
NB Prob Bug Fixed Description – 28067846 19.1 System state did not dump lock elements which was considered busy III 27986817 12.1.0.2.190416, 19.1 L1 Validation Waits With High Concurrent Workload III 20687474 12.2.0.1 Excessive “enq: HW – contention” / “latch: cache buffers chains” inserting into very large non-partitioned table II 18221857 11.2.0.4.4.SAPEXADBBP, 11.2.0.4.BP15, 12.1.0.2, 12.2.0.1 “enq: HW – contention” on large segments with lots of extents and high insert rate IIII 7319120 HW contention for BasicFile LOBs – special workaround III 13357632 11.2.0.4, 12.1.0.1 High ST contention from parallel direct load & MERGE III 9801919 11.2.0.1.BP08, 11.2.0.2.10, 11.2.0.2.BP20, 11.2.0.3, 12.1.0.1 “enq: HW – contention” against segments that add an extent frequently during high concurrency D III 6413373 10.2.0.5, 11.1.0.7, 11.2.0.1 Lob HW enqueue contention in RAC environments with ASSM space management D IIII 6376915 10.2.0.4, 11.1.0.7, 11.2.0.1 HW enqueue contention for ASSM LOB segments II 4867884 10.2.0.3, 11.1.0.6 Lob HW lock contention with space reclaimation
HW enqueue根本存在目的是为了串行化对segment高水位线的移动以及回收lob segment中的空间。造成HW enqueue contention争用的根本原因多种多样,但最终HW enqueue总是只在数据段segment High Water Mark高水位线需要移动时才被持有。
常见的原因:
1, 频繁的insert, 较小的uniform extent
2, IO性能差
3, LOB 对象回收
4, LOB相关的BUG
常见的解决方法:
1, move 到ASSM 管理的表空间
2, 手动预分配空间allocate extent, 或使用更大的extent大小。(较小的uniform Extent在频繁allocate会加剧hw enq)
alter table anbob.tabxxx allocate extent(size 1g);
3, 查找物理IO读较高的SQL 或 查找存储性能问题
4, 如果lob对象存在较频繁的更新,可以考虑增加LOB PCTVERSION to 20%,保留更多的空间用于lob的旧版本,考虑lob 分区或增加lob 段的chunk大小
5, 由于ASSM LOB空间分析一次只获取一个块,导致HW争用,应用 Bug 6376915实现批量回收,或通过此修复程序,ASSM LOB将基于事件44951的值获得最小数量的块(最大设置为1024)。
alter system set events '44951 trace name context forever, level 1024';
6, 迁移basicfile LOBs到Securefiles LOB对于11g or later. Securefile LOBs对于并发的性能更好
7, 拆分区,拆成更小的分区粒度,分散insert
How to debug
From Oracle version 10g and above, x$ksqst is externalized as v$enqueue_statistics.
However, this statistics shows activity at instance level. While we can use this statistics to measure HWenqueue activity , we need to make sure that there is no other session acquiring HW enqueue. Event 10704 can be used to trace enqueues and every call to get an enqueue prints few lines in the trace file. SQL statement to dump this information to trace file is :
alter session set events ’10704 trace name context forever, level 15′;
Event 10704 is documented as below:
10704, 00000, "Print out information about what enqueues are being obtained" // *Cause: When enabled, prints out arguments to calls to ksqcmi and // ksqlrl and the return values. // *Action: Level indicates details: // Level: 1-4: print out basic info for ksqlrl, ksqcmi // 5-9: also print out stuff in callbacks: ksqlac, ksqlop // 10+: also print out time for each line
Few lines from the trace files printed below. ksq is internal Oracle module names for enqueues and ksqgtl is to get locks on a resource. From the lines below, we can see that HW enqueue is acquired in mode 6, exclusive mode. Timestamp is also printed since we enabled this event at level 15. If we need count number of HW enqueue gets, we need to count occurrences of HW- string in the trace file.
对不起,这篇文章暂时关闭评论。