latch: undo global data(ktudba: KSLBEGIN) wait event when insert value
前几日有11.2.0.3套库出现了性能问题,这里简单记录, 当时只是表现几个会话的同一条INSERT SQL语句出现了较高的latch: undo global data 等待,latch miss是 ktudba: KSLBEGIN ,同时还有BBS 等待。
SQL> @snapper ash 5 1 all Sampling SID all with interval 5 seconds, taking 1 snapshots... -- Session Snapper v4.22 - by Tanel Poder ( http://blog.tanelpoder.com/snapper ) - Enjoy the Most Advanced Oracle Troubleshooting Script on the Planet! :) --------------------------------------------------------------------------------------------------------------- ActSes %Thread | INST | SQL_ID | SQL_CHILD | EVENT | WAIT_CLASS --------------------------------------------------------------------------------------------------------------- 44.00 (4400%) | 1 | 4rwaxk5gtg1t7 | 2 | latch: undo global data | Other 38.60 (3860%) | 1 | 4rwaxk5gtg1t7 | 0 | latch: undo global data | Other 14.00 (1400%) | 1 | 4rwaxk5gtg1t7 | 2 | buffer busy waits | Concurrency 12.40 (1240%) | 1 | 4rwaxk5gtg1t7 | 0 | buffer busy waits | Concurrency 6.00 (600%) | 1 | 4rwaxk5gtg1t7 | 2 | ON CPU | ON CPU 5.60 (560%) | 1 | 4dxyn82n0zspx | 0 | ON CPU | ON CPU 4.80 (480%) | 1 | 4rwaxk5gtg1t7 | 0 | ON CPU | ON CPU 3.40 (340%) | 1 | 4dxyn82n0zspx | 0 | latch: cache buffers chains | Concurrency 3.20 (320%) | 1 | | | ON CPU | ON CPU 1.40 (140%) | 1 | | | log file sync | Commit -- End of ASH snap 1, end=2016-06-28 10:14:13, seconds=6, samples_taken=5, AAS=177.6 SQL> @sqlid 4rwaxk5gtg1t7 % Show SQL text, child cursors and execution stats for SQLID 4rwaxk5gtg1t7 child % HASH_VALUE PLAN_HASH_VALUE CH# SQL_TEXT ---------- --------------- ---- ---------------------------------------------------------------------------- 1603766055 0 0 INSERT INTO T (REGION, TELNUM, ...... UDPTYPE) VALUES ( :REGION, :TELNUM, .... :UDPTYPE) CH# PARENT_HANDLE OBJECT_HANDLE PLAN_HASH PARSES H_PARSES EXECUTIONS FETCHES ROWS_PROCESSED ROWS_PER_FETCH CPU_SEC CPU_SEC_EXEC ELA_SEC LIOS PIOS SORTS USERS_EXECUTING ---- ---------------- ---------------- ---------- ---------- ---------- ---------- ---------- -------------- -------------- ---------- ------------ ---------- ---------- ---------- ---------- --------------- 0 C0000014D7859A70 C0000014D786C268 0 58 26 12772 0 12772 9.32 .000729721 28.339461 246717 4523 0 0
TIP:
这里使用了Poder的脚本, 从AWR当时也是latch: undo global data 为数据库TOP 1 wait event, 查看了该语句就是一个insert values 简单语句,执行频率从AWR 得到的是半小时27000次,执行速度还可以,但是为什么会这么多的latch undo 呢?
LATCH: UNDO GLOBAL DATA
This latch serializes the access to the Undo (aka Rollback) segment information in the SGA.
Every time a session wants to know about the state of the Undo Segments, it has to get this latch.
该事件确认是与undo 相关,通常会伴有enq: US – contention 或 latch undo 的location是ktusm_stealext_2或ktusm_stealext: KSLBEGIN,对于stealext是因查找可用的segment 未发现offline 或过期的从未过期的窃取空间的调用, 那样增加UNDO表空间就可以得到缓解。
下面确认当时的情况是不是这种现象。
SQL> select * from v$latch_misses where SLEEP_COUNT>0 and parent_name like 'undo global data%' order by sleep_count;
PARENT_NAME WHERE NWFAIL_COUNT SLEEP_COUNT WTR_SLP_COUNT LONGHOLD_COUNT LOCATION
---------------------- ------------------------ ------------ ----------- ------------- -------------- ----------------------
undo global data kturar: KSLBEGIN 0 1 11 0 kturar: KSLBEGIN
undo global data ktugnb: KSLBEGIN 0 1 4 0 ktugnb: KSLBEGIN
undo global data ktusmasp: ktugd_tuux 0 3 28 0 ktusmasp: ktugd_tuux
undo global data kturdc: KSLBEGIN 0 3 0 0 kturdc: KSLBEGIN
undo global data ktur set recov bit 0 4 26 0 ktur set recov bit
undo global data ktusmofxu_1: kslgetl 0 8 0 0 ktusmofxu_1: kslgetl
undo global data ktuswr_3 0 36 4 0 ktuswr_3
undo global data ktudax:child 0 36 16 0 ktudax:child
undo global data ktubnd:child 0 106 108 0 ktubnd:child
undo global data ktufrbs_2 0 571 1292 0 ktufrbs_2
undo global data ktudnx:child 0 1125 1044 0 ktudnx:child
undo global data ktucof: at start 0 2003 0 0 ktucof: at start
undo global data ktusmupst: KSLBEGIN 0 2118 9203 0 ktusmupst: KSLBEGIN
undo global data kturax 0 5107 0 0 kturax
undo global data ktudba: KSLBEGIN 0 43637217 43636922 0 ktudba: KSLBEGIN
NOTE:
这里latch: undo global data 的location是ktudba: KSLBEGIN 和上面所说的情况不太一致, 后来我也尝试增加了一些UNDO 结果一样问题没有什么缓解。 要先搞明白ktudba: KSLBEGIN是什么?
KSLBEGIN 是一种get latch时的非常快的宏,在取得latch后调用KSLEND;
ktudba 像是一种从usn 转换data block address的宏
@ Latch Miss Sources
@ ——————
@ undo global data ktudba: KSLBEGIN 0 117,271,003 115,396,045
@ .
@ .
@ ktu.h:
@ ——
@ #define ktudba(usn, dbap, scnp, arspp) \
@ ktuGetUsegDba((usn), (dbap), (scnp), (arspp), \
@ NULLP(ub2), NULLP(ksqn), NULLP(uword))
@ .
@ ktuGetUsegDba – Kernel Transaction Undo convert from usn to DBA
检查当时的数据库参数
SQL> @p undo NAME VALUE ---------------------------------------- ---------------------------------------- _lm_spare_undo 0 _flush_undo_after_tx_recovery TRUE _gc_undo_affinity FALSE _gc_undo_block_disk_reads TRUE _gc_tsn_undo_affinity TRUE _undo_block_compression TRUE undo_management AUTO undo_tablespace undotbs1 _collect_undo_stats TRUE _undo_debug_mode 0 _verify_undo_quota FALSE _in_memory_undo TRUE <<<<<<<<<<<<<<< _undo_autotune FALSE <<<<<<<<<<<<<<< _highthreshold_undoretention 4294967294 <<<<<<<<<<<<<<< undo_retention 7200 <<<<<<<<<<<<<<< _undo_debug_usage 0 _disable_undo_tablespace_alerts FALSE _smon_undo_seg_rescan_limit 10 _optimizer_undo_cost_change 11.2.0.3 _optimizer_undo_changes FALSE _enable_default_undo_threshold TRUE
Note:
As the “_undo_autotune” instance parameter is set to FALSE, the undo retention is not automatically calculated based on the instance activity
UNDO 问题通常在_undo_autotune= false 和_highthreshold_undoretention value large时问题会较明显, 下面检查undo segment的使用情况.
SQL> col status format a20 SQL> col cnt format 999,999,999 head "How Many?" SQL> SQL> select status, count(*) cnt from dba_rollback_segs 2 group by status 3 / STATUS How Many? -------------------- ------------ OFFLINE 2,053 ONLINE 1,126 SQL> select tablespace_name,status,sum(bytes)/1024/1024 mb from DBA_UNDO_EXTENTS group by tablespace_name,status; TABLESPACE_NAME STATUS MB ------------------------------ -------------------- ---------- UNDOTBS1 ACTIVE 34585.125 <<<<<<<<<<<<<< UNDOTBS1 EXPIRED 13843.75 UNDOTBS1 UNEXPIRED 12030.375 UNDOTBS2 ACTIVE 72.5625 UNDOTBS2 EXPIRED 3000.25 UNDOTBS2 UNEXPIRED 7369.5625 SQL> @a A-Script: Display active sessions... COUNT(*) SQL_ID STATE EVENT ---------- ------------- ------- ---------------------------------------------------------------- 63 4rwaxk5gtg1t7 WAITING latch: undo global data <<<<<<<<<<<< 14 4rwaxk5gtg1t7 ON CPU On CPU / runqueue 13 4rwaxk5gtg1t7 WAITING buffer busy waits 12 4dxyn82n0zspx WAITING latch: cache buffers chains 7 4dxyn82n0zspx ON CPU On CPU / runqueue 2 0cgk3a3kjpg8m WAITING db file sequential read 2 5upb68p4xd0h4 WAITING SQL*Net message from dblink 1 4rwaxk5gtg1t7 WAITING latch: cache buffers chains 1 5dt8ha6p0uuc4 WAITING read by other session
Note:
当时的actve undo seg 使用30多G引起注意,这里oltp类型库,平时都是些小事务,当时的业务量也不高, 查询undo是被哪个用了
SQL>select b.sid,--会话编号
b.SERIAL#,
b.USERNAME,
b.MACHINE,
b.sql_id,
a.START_TIME,--事务启动时间
a.USED_UBLK, --使用的UNDO块数
a.USED_UREC, --使用的UNDO记录条数
a.START_UBAFIL, --使用的UNDO文件号
a.START_UBABLK --使用的UNDO起始块号
from v$transaction a, v$session b
where a.ses_addr = b.saddr;
SID SERIAL# USERNAME MACHINE SQL_ID START_TIME USED_UBLK USED_UREC START_UBAFIL START_UBABLK
------ ---------- --------- ---------- ------------- -------------------- ---------- ---------- ------------ ------------
10881 55703 ANBOB kinjk1 gd29uxmzbmj2g 06/28/16 11:48:30 1 9 250 1504704
9065 60993 S111008 HBYDYK\WIN 4abmpbbmt8fv5 06/28/16 11:43:50 1 1 248 1573967
7626 52347 ANBOB qmweb81 06/28/16 10:48:00 1 1 245 84451
11357 46961 COMMON qdyya1 0jfw9hdzvvk9t 06/28/16 11:48:14 107 2416 586 6550
6731 20111 ANBOB qdtza1 0pymq63h93cr7 06/27/16 14:56:29 2290118 26084282 248 650766 <<<<<< 16101 36593 ANBOB qmweb54 06/28/16 09:45:03 1 1 250 620376 12821 24875 ANBOB qmyy31401 fd72s60zy0dk9 06/28/16 11:48:30 1 17 250 1504600 20855 2013 ANBOB qmyy31401 06/28/16 11:48:32 1 1 245 1249799 4705 53047 ANBOB kmyy17 0cgk3a3kjpg8m 06/28/16 11:48:32 2 49 248 700088 15569 32497 SONGZHZH HBYDYK\WIN 06/28/16 10:46:57 1 1 248 237094 10910 55127 SHILQ HBYDYK\WIN 06/28/16 11:07:14 1 1 250 1285765 2986 32313 ANBOB qmweb53 06/28/16 11:37:08 1 1 250 354863 50 rows selected. SQL> @usid 6731
USERNAME SID AUDSID OSUSER MACHINE PROGRAM SPID OPID CPID SQL_ID HASH_VALUE LASTCALL STATUS SADDR PADDR TADDR LOGON_TIME
----------------------- -------------- ----------- ---------------- ------------------ -------------------- -------------- ------ ------------------------ ------------- ----------- ---------- -------- ---------------- ---------------- ---------------- -----------------
ANBOB '6731,20111' 936346265 grid qdtza1 (TNS V1-V3) 11564 4478 17667 0pymq63h93cr7 3767644903 75139 ACTIVE C000001604A7DFB8 C0000016138BF9C0 C0000015C3F1EA20 20160627 14:56:29
SQL> @xi 0pymq63h93cr7 %
eXplain the execution plan for sqlid 0pymq63h93cr7 child %...
PLAN_TABLE_OUTPUT
-------------------------------------------------------------------------
SQL_ID 0pymq63h93cr7, child number 0
DELETE FROM "ANBOB"."T" "A1"
Note:
现在已经明确是sid=6731的会话做了一个delete 全表的操作, 该表正是上面insert的表,这个事务目前使用了2290118个undo block,
当然后期我们确认了是个人用户通过DBLINK 执行的该表删除的误操作,确认后杀掉了该会话,估算了回滚剩余时间。上面的latch: undo的等待事件消失。
SET SERVEROUT ON SQL> declare 2 l_start number; 3 l_end number; 4 begin 5 select sum(ktuxesiz) into l_start from x$ktuxe where KTUXECFL ='DEAD'; 6 dbms_lock.sleep(60); 7 select sum(ktuxesiz) into l_end from x$ktuxe where KTUXECFL ='DEAD'; 8 dbms_output.put_line('time est Day:'|| round(l_end/(l_start -l_end)/60/24,2)); 9 end; 10 / PL/SQL procedure successfully completed. SQL> set serveroutput on SQL> / time est Day:.04 SQL> select tablespace_name,status,sum(bytes)/1024/1024 mb from DBA_UNDO_EXTENTS group by tablespace_name,status; TABLESPACE_NAME STATUS MB ------------------------------ --------- ---------- UNDOTBS1 ACTIVE 98.0625 UNDOTBS1 EXPIRED 14017.8125 UNDOTBS1 UNEXPIRED 45236.3125 UNDOTBS2 ACTIVE 78.5625 UNDOTBS2 EXPIRED 3022 UNDOTBS2 UNEXPIRED 6706.8125
对不起,这篇文章暂时关闭评论。