案例: RAC Hang wait ‘library cache lock’ & ‘latch: row cache objects’ 在做了大量的表分区DDL后
数据库中做分区表DDL是较为常见的维护,如运营商中大量以月份为分区键的分区表,前不久就在例行了大量的分区表几个小时后数据库双实例hang死,下面分享这个案例希望有相同环境的DBA们早日预防。
环境是11.2.0.3.7的2Nodes RAC ON HPUX 11.31, 当晚在2节点做了几万次的分区维护但是在2:00前就已做完,还好在营业厅开门前,7:00有人反应数据库实例无法连接,开始是发现节点2连接数满,后维护人员KILL了所有的前台进程反应当时全是Library cache lock等待事件,但是KILL前没有解决问题,不久节点1出现了同样的症状。我登录上查看节点1的活动会话已接近4000,几乎全是Library cache lock的会话,blocker session 在等待DFS lock handle , DFS lock handle 是一种OPS时分布式文件系统的一种等待事件通常是RAC相关,可以转换为具体的event.
为了尽快恢复业务,开始KILL 节点1 的会话,但是会话变成KILLED状态未释放,随后决定重启该实例1,shutdown immediate无法停止,alert log 一直提示“SHUTDOWN: waiting for active calls”,等待了10多分钟后决定kill pmon 强行终止(不到最后非常不建议这种手段), 实例重启后数据库一切恢复正常。 下面只能使用日志和ASH信息分析原因。
-- node1 SQL> select * from ( 2 select etime,nvl(event,'on cpu') events, dbtime, round(100*ratio_to_report(dbtime) OVER (partition by etime ),2) pct,row_number() over(partition by etime order by dbtime desc) rn 3 from ( 4 select substr(to_char(SAMPLE_TIME,'yyyymmdd hh24:mi'),1,13)||'0' etime,event,count(*)*10 dbtime 5 from dbmt.dash1027 6 where INSTANCE_NUMBER=1 7 and sample_time between to_date('2017-10-27 07:00:00','yyyy-mm-dd hh24:mi:ss') AND to_date('2017-10-27 08:50:00','yyyy-mm-dd hh24:mi:ss') 8 group by substr(to_char(SAMPLE_TIME,'yyyymmdd hh24:mi'),1,13),event 9 ) 10 ) where rn<=5; ETIME EVENTS DBTIME PCT RN -------------- ------------------------------ ----------- ------------------------------ ------------------------------ 20171027 07:00 enq: SQ - contention 147580 45.71 1 20171027 07:00 library cache lock 134870 41.77 2 20171027 07:00 on cpu 11810 3.66 3 20171027 07:00 gc buffer busy acquire 8210 2.54 4 20171027 07:00 db file sequential read 4070 1.26 5 20171027 07:10 on cpu 13110 42.45 1 20171027 07:10 library cache lock 6580 21.31 2 20171027 07:10 db file sequential read 5230 16.94 3 20171027 07:10 db file scattered read 1660 5.38 4 20171027 07:10 read by other session 800 2.59 5 20171027 07:20 enq: SQ - contention 69190 44.18 1 20171027 07:20 library cache lock 62140 39.68 2 20171027 07:20 on cpu 12660 8.08 3 20171027 07:20 db file sequential read 4330 2.76 4 20171027 07:20 db file scattered read 1960 1.25 5 --node 2 SQL> select * from ( 2 select etime,nvl(event,'on cpu') events, dbtime, round(100*ratio_to_report(dbtime) OVER (partition by etime ),2) pct,row_number() over(partition by etime order by dbtime desc) rn 3 from ( 4 select substr(to_char(SAMPLE_TIME,'yyyymmdd hh24:mi'),1,13)||'0' etime,event,count(*)*10 dbtime 5 from dbmt.dash1027 6 where INSTANCE_NUMBER=2 7 and sample_time between to_date('2017-10-27 07:00:00','yyyy-mm-dd hh24:mi:ss') AND to_date('2017-10-27 08:50:00','yyyy-mm-dd hh24:mi:ss') 8 group by substr(to_char(SAMPLE_TIME,'yyyymmdd hh24:mi'),1,13),event 9 ) 10 ) where rn<=5; ETIME EVENTS DBTIME PCT RN -------------- ---------------------------------------------------- ------------------------------ ------------------------------ 20171027 07:00 latch: row cache objects 27250 23.92 1 20171027 07:00 enq: SQ - contention 27140 23.82 2 20171027 07:00 library cache load lock 18460 16.2 3 20171027 07:00 gc buffer busy release 11270 9.89 4 20171027 07:00 on cpu 8130 7.14 5 20171027 07:10 on cpu 6720 34.62 1 20171027 07:10 Backup: MML write backup piece 2620 13.5 2 20171027 07:10 SQL*Net message from dblink 2390 12.31 3 20171027 07:10 db file sequential read 1700 8.76 4 20171027 07:10 latch: row cache objects 1430 7.37 5 20171027 07:20 enq: SQ - contention 16230 34.24 1 20171027 07:20 latch: row cache objects 10240 21.6 2 20171027 07:20 on cpu 5910 12.47 3 20171027 07:20 library cache: mutex X 3410 7.19 4 20171027 07:20 Backup: MML write backup piece 2490 5.25 5 SQL> select INSTANCE_NUMBER,sql_id,p2, 2 --blocking_session, 3 blocking_inst_id, 4 count(*) 5 from dbmt.dash1027 where 6 sample_time between to_date('2017-10-27 7:00:00','yyyy-mm-dd hh24:mi:ss') AND to_date('2017-10-27 07:10:00','yyyy-mm-dd hh24:mi:ss') 7 and event='enq: SQ - contention' group by INSTANCE_NUMBER,sql_id,p2 8 --,blocking_session 9 ,blocking_inst_id 10 order by 5 desc; INSTANCE_NUMBER SQL_ID P2 BLOCKING_INST_ID COUNT(*) ------------------------------ ------------- ------------------------------ ------------------------------ ------------------------------ 1 2783h5bwb20f8 8260071 1 11718 2 362 2 2706 1 4rhpkb0fh39nc 8260071 1 2509 1 08324f71s5na3 8260071 1 386 1 6fa71k5cafbh4 8260071 1 77 SQL> @oid 8260071 owner object_name object_type SUBOBJECT_NAME CREATED LAST_DDL_TIME status DATA_OBJECT_ID ------------------------- ------------------------------ ------------------ ------------------------------ ----------------- ----------------- --------- ------------------------------ ANBOB SEQ_OID SEQUENCE 20150324 02:47:42 20150807 15:08:24 VALID 1 row selected. SQL> @seq anbob.SEQ_OID SEQUENCE_OWNER SEQUENCE_NAME MIN_VALUE MAX_VALUE INCREMENT_BY C O CACHE_SIZE LAST_NUMBER ------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------------------------ - - ------------------------------ ------------------------------ ANBOB SEQ_OID 88100000000000 1000000000000000000000000000 4 N N 10000 88507311739938 SQL> @sqlid 2783h5bwb20f8 % Show SQL text, child cursors and execution stats for SQLID 2783h5bwb20f8 child % HASH_VALUE PLAN_HASH_VALUE CH# SQL_TEXT ------------------------------ ------------------------------ ---- ------------------------------------------------------------------------------------------------------------------------------------------------------ 4172349896 207917406 0 select to_char(seq_oid.nextval) oid from dual 1 row selected. CH# PARENT_HANDLE OBJECT_HANDLE PLAN_HASH PARSES H_PARSES EXECUTIONS FETCHES ROWS_PROCESSED ROWS_PER_FETCH CPU_SEC CPU_SEC_EXEC ELA_SEC LIOS PIOS SORTS USERS_EXECUTING ---- ---------------- ---------------- ------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------------------------ 0 C00000153DA00458 C00000153D9FF2F8 207917406 5891 1 12580228 12580235 12580073 .99998712265708867918604064233 1174.11 .00009332978702770728797602078 1362.309233 9239 0 0 0
Note:
当时的ENQ:SQ 事件也较为严重,结合AWR,根据SQL 和sequnece相关业务和执行频率无异常,以当前的SEQ调用频率,从AWR “Wait Event Histogram Detail”查看enq:sq每次平均wait时间太长,sq问题应该发生在row cache填充时, 判断当时shared pool 应该存在性能问题。
下面查询当时的session wait chains
SQL> select INSTANCE_NUMBER,sql_id,p1, 2 blocking_session,blocking_session_serial#, 3 blocking_inst_id, 4 count(*) 5 from dbmt.dash1027 where 6 sample_time between to_date('2017-10-27 7:00:00','yyyy-mm-dd hh24:mi:ss') AND to_date('2017-10-27 07:10:00','yyyy-mm-dd hh24:mi:ss') 7 and event='library cache lock' group by INSTANCE_NUMBER,sql_id,p1 8 ,blocking_session,blocking_session_serial# 9 ,blocking_inst_id 10 order by 7 desc; INSTANCE_NUMBER SQL_ID P1 BLOCKING_SESSION BLOCKING_SESSION_SERIAL# BLOCKING_INST_ID COUNT(*) ---------------- ------------- ---------------------- ------------------ ------------------------------ ----------------- -------- 1 13835058141231219944 8502 6515 1 13372 2 13835058145195890800 131 1 13835058141231219944 52 1 13835058141231219944 9694 58887 1 20 1 13835058141231219944 13307 40389 1 13 1 13835058141231219944 7143 57041 1 10 1 13835058141231219944 14432 64427 1 8 SQL> select sample_time,program,machine,sql_id,event,p1text,p1,p2text,p2,p3text,p3,session_state, blocking_session,blocking_session_serial#,blocking_inst_id,current_obj#,delta_time from dbmt.dash1027 where sample_time between to_date('2017-10-27 7:00:00','yyyy-mm-dd hh24:mi:ss') AND to_date('2017-10-27 07:12:00','yyyy-mm-dd hh24:mi:ss') and session_id=8502 and session_serial#=6515 and instance_number=1 order by 1; SAMPLE_TIME PROGRAM MACHINE SQL_ID EVENT P1TEXT P1 P2TEXT BLOCKING_SESSION BS_SERIAL# BLOCKING_INST_ID ------------------------------ -------------------------- --------- ------------- ------------------- ----------- -------------- ------- ----------------- -------- ---------------- 27-OCT-17 07.03.35.009 AM Deal@anbob (TNS V1-V3) anbob DFS lock handle type|mode 1230372869 id1 27-OCT-17 07.03.45.459 AM Deal@anbob (TNS V1-V3) anbob DFS lock handle type|mode 1230372869 id1 10717 1 2 27-OCT-17 07.03.56.219 AM Deal@anbob (TNS V1-V3) anbob DFS lock handle type|mode 1230372869 id1 10717 1 2 27-OCT-17 07.04.06.639 AM Deal@anbob (TNS V1-V3) anbob DFS lock handle type|mode 1230372869 id1 10717 1 2 27-OCT-17 07.04.17.168 AM Deal@anbob (TNS V1-V3) anbob DFS lock handle type|mode 1230372869 id1 10717 1 2 27-OCT-17 07.04.27.678 AM Deal@anbob (TNS V1-V3) anbob DFS lock handle type|mode 1230372869 id1 10717 1 2 ... ... have truncated ... 27-OCT-17 07.06.47.259 AM Deal@anbob (TNS V1-V3) anbob DFS lock handle type|mode 1230372869 id1 10717 1 2 27-OCT-17 07.06.58.509 AM Deal@anbob (TNS V1-V3) anbob DFS lock handle type|mode 1230372869 id1 10717 1 2 27-OCT-17 07.07.09.389 AM Deal@anbob (TNS V1-V3) anbob DFS lock handle type|mode 1230372869 id1 10717 1 2 27-OCT-17 07.07.20.399 AM Deal@anbob (TNS V1-V3) anbob DFS lock handle type|mode 1230372869 id1 10717 1 2 27-OCT-17 07.07.31.429 AM Deal@anbob (TNS V1-V3) anbob DFS lock handle type|mode 1230372869 id1 10717 1 2 SQL> select chr(bitand(&&p11,-16777216)/16777215) || 2 chr(bitand(&&p11,16711680)/65535) type, 3 mod(&&p11, 16) md 4 from dual; Enter value for p11: 1230372869 TY MD -- ------------------------------ IV 5 SQL> select * from v$lock_type where type='IV'; TYPE NAME ID1_TAG ID2_TAG IS_ ---------------------------------------------------------------- ---------------------------------------------------------------- ---------------------------------------------------------------- ---------------------------------------------------------------- --- DESCRIPTION ------------------------------------------------------------------------------------------------------------------------------------------------------------------- IV Library Cache Invalidation object # time stamp NO Synchronizes library cache object invalidations across instances SQL> select sample_time,program,machine,sql_id,event,p1text,p1,p2text,p2,p3text,p3,session_state,blocking_session,blocking_session_serial#,blocking_inst_id,current_obj#,delta_time 2 from dbmt.dash1027 where 3 sample_time between to_date('2017-10-27 7:00:00','yyyy-mm-dd hh24:mi:ss') AND to_date('2017-10-27 07:12:00','yyyy-mm-dd hh24:mi:ss') 4 and session_id=10717 and session_serial#=1 and instance_number=2 order by 1; SAMPLE_TIME PROGRAM MACHINE SQL_ID EVENT BLOCKING_SESSION BS_SERIAL# BS_INST_ID ----------------------------------- ----------------------------- --------------- ------------- ------------------------------ ---------------- ---------- ---------- 27-OCT-17 07.00.05.868 AM oracle@anbob2 (LCK0) anbob2 latch: shared pool 11596 14855 2 27-OCT-17 07.00.26.527 AM oracle@anbob2 (LCK0) anbob2 latch: shared pool 2329 15239 2 27-OCT-17 07.00.47.258 AM oracle@anbob2 (LCK0) anbob2 latch: shared pool 3132 7895 2 27-OCT-17 07.01.49.888 AM oracle@anbob2 (LCK0) anbob2 27-OCT-17 07.02.00.367 AM oracle@anbob2 (LCK0) anbob2 latch: shared pool 7055 45921 2 27-OCT-17 07.02.10.667 AM oracle@anbob2 (LCK0) anbob2 latch: row cache objects 6283 57835 2 27-OCT-17 07.02.21.028 AM oracle@anbob2 (LCK0) anbob2 latch: shared pool 2329 15239 2 27-OCT-17 07.02.52.059 AM oracle@anbob2 (LCK0) anbob2 latch: shared pool 13863 58359 2 27-OCT-17 07.03.02.540 AM oracle@anbob2 (LCK0) anbob2 latch: row cache objects 6283 57835 2 27-OCT-17 07.03.23.199 AM oracle@anbob2 (LCK0) anbob2 latch: shared pool 5987 63625 2 27-OCT-17 07.03.33.569 AM oracle@anbob2 (LCK0) anbob2 latch: row cache objects 27-OCT-17 07.03.43.870 AM oracle@anbob2 (LCK0) anbob2 latch: row cache objects 27-OCT-17 07.03.54.239 AM oracle@anbob2 (LCK0) anbob2 latch: row cache objects 27-OCT-17 07.04.04.719 AM oracle@anbob2 (LCK0) anbob2 latch: row cache objects 11303 22599 2 27-OCT-17 07.04.15.039 AM oracle@anbob2 (LCK0) anbob2 latch: row cache objects 27-OCT-17 07.04.25.429 AM oracle@anbob2 (LCK0) anbob2 latch: row cache objects 9384 46843 2 27-OCT-17 07.04.35.820 AM oracle@anbob2 (LCK0) anbob2 latch: row cache objects 17485 3 2 27-OCT-17 07.04.46.199 AM oracle@anbob2 (LCK0) anbob2 latch: row cache objects 17485 3 2 27-OCT-17 07.04.56.519 AM oracle@anbob2 (LCK0) anbob2 latch: row cache objects 27-OCT-17 07.05.07.030 AM oracle@anbob2 (LCK0) anbob2 latch: row cache objects 27-OCT-17 07.05.17.419 AM oracle@anbob2 (LCK0) anbob2 latch: row cache objects 27-OCT-17 07.05.27.740 AM oracle@anbob2 (LCK0) anbob2 latch: row cache objects 27-OCT-17 07.05.38.140 AM oracle@anbob2 (LCK0) anbob2 latch: row cache objects 27-OCT-17 07.05.48.589 AM oracle@anbob2 (LCK0) anbob2 latch: row cache objects 5925 61 2 27-OCT-17 07.05.58.979 AM oracle@anbob2 (LCK0) anbob2 latch: row cache objects 5925 61 2 27-OCT-17 07.06.09.609 AM oracle@anbob2 (LCK0) anbob2 latch: row cache objects 27-OCT-17 07.06.20.060 AM oracle@anbob2 (LCK0) anbob2 latch: row cache objects 7339 3825 2 27-OCT-17 07.06.30.529 AM oracle@anbob2 (LCK0) anbob2 latch: row cache objects 3692 39497 2 27-OCT-17 07.06.40.930 AM oracle@anbob2 (LCK0) anbob2 latch: row cache objects 27-OCT-17 07.06.51.380 AM oracle@anbob2 (LCK0) anbob2 latch: row cache objects 27-OCT-17 07.07.02.030 AM oracle@anbob2 (LCK0) anbob2 latch: row cache objects 12996 63675 2 27-OCT-17 07.07.12.452 AM oracle@anbob2 (LCK0) anbob2 latch: row cache objects 27-OCT-17 07.07.22.920 AM oracle@anbob2 (LCK0) anbob2 latch: row cache objects 27-OCT-17 07.10.20.190 AM oracle@anbob2 (LCK0) anbob2 latch: shared pool 27-OCT-17 07.11.22.280 AM oracle@anbob2 (LCK0) anbob2 latch: shared pool 11584 41211 2 35 rows selected.
Note:
堵塞链条chains: 大部分session等待事件library cache lock
==>1:8502等待事件等待事件 “DFS lock handle”是在等待其它实例invalidation lock,也就是释放锁资源,用p1的值可以转换成具体的enq, 这里是一种IV, 一种跨实例的串行使library cache object InValidations的Library cache锁等待,通常发生在当发生了DDL或收集了统计信息后使相关的cursor失效。(FG process)
==> 2:10717 (LCK BG process) 此进程是lck0的后台进程,LCK进程出现latch:row cache objects的时间段,正是2个实例第一个负载高峰的时间段,Row cache 也叫dictionary cache,存储的是表、列、索引、用户等元数据。Lck 的wait通常都是很短暂,从LCK所全时段的wait event分析, 两次负载高峰时间段lck进程都是在等待latch:row cache objects, 是在加载或清理row cache object时申请的latch.
分析row cache objects
当lck0等待latch: row cache objects时,P1的值为13835058147926472600,因为篇幅原因不再附上
SQL> @dec 13835058147926472600 DEC HEX ----------------------------------- -------------------- 13835058147926472600.000000 C00000159207F798 SQL> select latch#,child# ,name,gets,misses from v$latch_children where addr like 'C00000159207F798'; LATCH# CHILD# NAME GETS MISSES --------------- --------------- ---------------------------------------------------------------- --------------- --------------- 280 13 row cache objects 2458650848 8927867 SQL> select s.kqrstcln latch#,kqrstcid cache#,kqrsttxt name from x$kqrst s where kqrstcln=13; LATCH# CACHE# NAME --------------- --------------- -------------------------------- 13 16 dc_histogram_defs 13 16 dc_histogram_data 13 16 dc_histogram_data
TIP:
根据P1的值可以判断当时LCK进程等待持有latch对象是dc_histogram_*, 与计算表列上的柱状图相关。
从ASH后来的时间发现lck0出现了row cache cleaup, 用于清理row cache中的对象,结合IV enq,可以判断是在清理InValidations的对象,通常发现在当一个节点做了DDL或者DCL或shared pool内存紧张时清理row cache object,其它实例同样也需要清理与其相关的在row cache中的对象。
另外从DB ALERT LOG中发现该时间段出现过ora-4031
Errors in file /oracle/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_ora_26816.trc (incident=1565538): ORA-04031: unable to allocate 20504 bytes of shared memory ("shared pool","unknown object","sga heap(4,0)","KTI SGA freea") Incident details in: /oracle/app/oracle/diag/rdbms/anbob/anbob2/incident/incdir_1565538/anbob2_ora_26816_i1565538.trc -- anbob2_ora_26816.trc ============================== Memory Utilization of Subpool 4 ================================ Allocation Name Size ___________________________ ____________ "free memory " 223004992 "miscellaneous " 1024 ...
但从当时的trace文件中显示当时Subpool 4还有200MB左右的空闲空间,所以当时应该是shared pool中应该是存在碎片.同时从AWR的历史记录看shared pool中的较大的组件”KQR L PO”从晚2点做的分区维护以后一直从300MB增长到出问题时1200MB,最后成为shared pool中最大的资源内存区。同时对比前两个月例行分区维护时该组件不存在该现象最大到700M 然后逐渐释放,以上说明本月的row cache的内存管理出现问题。
KQR ==> [K]ernel [Q]uery Layer [R]ow Cache Management 是指ROW Cache Object管理的相关资源.
RAC中每个节点可能都有某张表引用信息在ROW(dictionary) cache中,当任何一个节点上修改了表的结构,都需要把所有节点上把与该表相关的对象置为invaild, 所以在传统的library cache lock外每个节点上的lck0 进程会对本 instance library cache中的对象加一个,shared-mode的iv(invalidation)instance lock,如果想修改对象定义,必须获取一个exclusive的 iv lock,会通知本地的lck0释放shared-mode lock,本地lck0在释放这个lock之前,会通知其它节点上的lck0,其它节点上的lck0收到这个消息后,就会将本地library cache中的所有相关对象置为invalidate属于广播机制有进程lmd 完成。所以oracle对这些资源采取了广播机制,每个节点上的变化都通知所有节点。
查看当时LCK进程的trace 文件
loadavg : 0.09 0.12 0.17 Swapinfo : Avail = 629429.25Mb Used = 205304.09Mb Swap free = 424074.81Mb Kernel rsvd = 27521.14Mb Free Mem = 316882.34Mb F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME COMD 3401 S oracle 10708 1 0 178 20 e0000012722b7700 92499 e0000013cad17240 Jul 13 ? 1108:05 ora_lck0_tbcsb2 Short stack dump: ksedsts()+544<-ksdxfstk()+48<-ksdxcb()+3216<-sspuser()+688<-<-_pw_wait()+48<-pw_wait()+112<-sskgpwwait()+432<-skgpwwait()+320<-kslges()+2208<-kslgetl()+784 <-kqrbip()+1136<-kqrbfr()+1232<-kqrbgl()+368<-ksbabs()+1008<-kclabs()+400<-ksbrdp()+2736<-opirip()+1296<-opidrv()+1152 <-sou2o()+256<-opimai_real()+352<-ssthrdmain()+576<-main()+336<-main_opd_entry()+80 SO: 0xc000001624721918, type: 4, owner: 0xc000001642e31a08, flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0xc000001642e31a08, name=session, file=ksu.h LINE:12624 ID:, pg=0 (session) sid: 10717 ser: 1 trans: 0x0000000000000000, creator: 0xc000001642e31a08 flags: (0x51) USR/- flags_idl: (0x1) BSY/-/-/-/-/- flags2: (0x409) -/-/INC DID: , short-term DID: txn branch: 0x0000000000000000 oct: 0, prv: 0, sql: 0x0000000000000000, psql: 0x0000000000000000, user: 0/SYS ksuxds FALSE at location: 0 service name: SYS$BACKGROUND Current Wait Stack: 1: waiting for 'latch: row cache objects' address=0xc00000158216b238, number=0x118, tries=0x0 wait_id=2575238331 seq_num=6566 snap_id=1 wait times: snap=8.825235 sec, exc=8.825235 sec, total=8.825235 sec wait times: max=infinite, heur=8.825235 sec wait counts: calls=0 os=0 in_wait=1 iflags=0x520 0: waiting for 'row cache cleanup' =0x0, =0x0, =0x0 wait_id=2575235501 seq_num=6565 snap_id=2830 wait times: snap=0.000000 sec, exc=0.491436 sec, total=12 min 50 sec wait times: max=infinite, heur=12 min 50 sec wait counts: calls=0 os=0 in_wait=1 iflags=0x15a2 There are 2618 sessions blocked by this session. Dumping one waiter: inst: 1, sid: 9422, ser: 23985 wait event: 'DFS lock handle' p1: 'type|mode'=0x49560005 p2: 'id1'=0x4c4f434b p3: 'id2'=0x21 row_wait_obj#: 4294967295, block#: 0, row#: 0, file# 0 min_blocked_time: 631 secs, waiter_cache_ver: 447 Wait State: fixed_waits=0 flags=0x22 boundary=0x0000000000000000/-1 Session Wait History: elapsed time of 0.000000 sec since current wait 0: waited for 'row cache cleanup' =0x0, =0x0, =0x0 wait_id=2575235501 seq_num=6565 snap_id=2830 wait times: snap=0.000126 sec, exc=0.491436 sec, total=12 min 41 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000000 sec of elapsed time 1: waited for 'latch: shared pool' address=0xc00000006fee4438, number=0x133, tries=0x0 wait_id=2575238330 seq_num=6564 snap_id=1 wait times: snap=0.058322 sec, exc=0.058322 sec, total=0.058322 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000000 sec of elapsed time 2: waited for 'row cache cleanup' =0x0, =0x0, =0x0 wait_id=2575235501 seq_num=6563 snap_id=2829 wait times: snap=0.000074 sec, exc=0.491310 sec, total=12 min 41 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000000 sec of elapsed time 3: waited for 'latch: row cache objects' address=0xc00000158216b238, number=0x118, tries=0x0 wait_id=2575238329 seq_num=6562 snap_id=1 wait times: snap=14.639462 sec, exc=14.639462 sec, total=14.639462 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000000 sec of elapsed time 4: waited for 'row cache cleanup' =0x0, =0x0, =0x0 wait_id=2575235501 seq_num=6561 snap_id=2828 wait times: snap=0.002836 sec, exc=0.491236 sec, total=12 min 27 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000000 sec of elapsed time 5: waited for 'latch: shared pool' address=0xc00000006fee4438, number=0x133, tries=0x0 wait_id=2575238328 seq_num=6560 snap_id=1 wait times: snap=0.073927 sec, exc=0.073927 sec, total=0.073927 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000000 sec of elapsed time 6: waited for 'row cache cleanup' =0x0, =0x0, =0x0 wait_id=2575235501 seq_num=6559 snap_id=2827 wait times: snap=0.000105 sec, exc=0.488400 sec, total=12 min 27 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000000 sec of elapsed time 7: waited for 'latch: shared pool' address=0xc00000006fee4438, number=0x133, tries=0x0 wait_id=2575238327 seq_num=6558 snap_id=1 wait times: snap=0.029721 sec, exc=0.029721 sec, total=0.029721 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000000 sec of elapsed time 8: waited for 'row cache cleanup' =0x0, =0x0, =0x0 wait_id=2575235501 seq_num=6557 snap_id=2826 wait times: snap=0.000074 sec, exc=0.488295 sec, total=12 min 27 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000000 sec of elapsed time 9: waited for 'latch: row cache objects' address=0xc00000158216b238, number=0x118, tries=0x0 wait_id=2575238326 seq_num=6556 snap_id=1 wait times: snap=7.106811 sec, exc=7.106811 sec, total=7.106811 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000000 sec of elapsed time *** 2017-10-27 07:03:44.749 LCK0 (ospid: 10708) has not moved for 35 sec (1509059024.1509058989) -- *** 2017-10-27 07:04:04.789 LCK0 (ospid: 10708) has not moved for 55 sec (1509059044.1509058989) -- *** 2017-10-27 07:04:24.829 LCK0 (ospid: 10708) has not moved for 74 sec (1509059063.1509058989) -- *** 2017-10-27 07:04:44.869 LCK0 (ospid: 10708) has not moved for 95 sec (1509059084.1509058989) -- *** 2017-10-27 07:05:04.909 LCK0 (ospid: 10708) has not moved for 115 sec (1509059104.1509058989) -- ... -- *** 2017-10-27 07:06:45.110 LCK0 (ospid: 10708) has not moved for 215 sec (1509059204.1509058989) -- *** 2017-10-27 07:07:05.150 LCK0 (ospid: 10708) has not moved for 235 sec (1509059224.1509058989) -- *** 2017-10-27 07:07:25.190 LCK0 (ospid: 10708) has not moved for 255 sec (1509059244.1509058989) -- *** 2017-10-27 07:22:52.475 LCK0 (ospid: 10708) has not moved for 39 sec (1509060172.1509060133) -- *** 2017-10-27 07:26:14.056 LCK0 (ospid: 10708) has not moved for 25 sec (1509060373.1509060348) -- *** 2017-10-27 07:26:34.096 LCK0 (ospid: 10708) has not moved for 45 sec (1509060393.1509060348) -- *** 2017-10-27 07:26:54.167 ... ... ... *** 2017-10-27 08:07:20.780 LCK0 (ospid: 10708) has not moved for 1689 sec (1509062840.1509061151) -- *** 2017-10-27 08:07:40.820 LCK0 (ospid: 10708) has not moved for 1709 sec (1509062860.1509061151)
Note:
期间连续累积的等待时间(已截取了部分输出)可以看出其间lck0进程出现了2次 not moved 超过100秒的时间段,和数据库负载高峰对应,最后一次是从7:26到8:07之间半小时左右lck0 都未活动。
LCK: This lock process manages requests that are not cache-fusion requests. Requests like row cache requests and library cache requests. Only a single LCK process is allowed for each instance.
PMON is waiting for an LCK process to cleanup the lock context after a foreground process died while doing a global cache lock operation.
The shared pool is stressed and memory need to be freed for the new cursors. As a consequence, the dictionary cache is reduced in size by the LCK process causing a temporal hang of the instance since the LCK can’t do other activity during that time. Since the dictionary cache is a memory area protected clusterwide in RAC, the LCK is responsible to free it in collaboration with the dictionary cache users (the sessions using cursors referenced in the dictionary cache). This process can be time consuming when the dictionary cache is big.
当shared pool里的对象需要为新的对象释放空间时如sql cursor, LCK进程降低Row Cache 大小期间使数据库临时hang, 因为在RAC环境中LCK进程负责释放持有row cache的用户进程协调工作及Library cache 的请求, 如果LCK出现性能问题也就会导致library cache object无法请求和会话补KILL后的释放row cache堵塞。Pmon 进程也需要等待LCK进程清理died进程后才能清理锁上下文。这个是为什么大量的等待library cache lock, library cache object load的事件和kill 会话无法使问题得到缓解的原因。 甚至因为中间件的session 重连机制,在kill 会话后大批量的新会话请求 也会加剧shared pool 占用的原因,在原资源未释放,新资源再大量请求时, shared pool内存管理出现问题 就更容易出现ora-4031。而且如要row cache越大后期LCK进程释放资源时就需要消耗更长的时间(每个instance 只有一个lck进程)。
该类现象相关的BUG:
Bug 8215444 Many processes hang in RAC with wait event “DFS lock handle”
Bug 16529874 Frequent “LCK(ospid: ****) has not moved for 20 sec” trace messages
BUG:8666117 – LCK0 PROCESS STUCK AT WAITING FOR “LATCH: ROW CACHE OBJECTS”
Bug 18199537 – RAC database becomes almost hung when large amount of row cache are used in shared pool
Bug 13814739 Excessive KQR X PO” allocations in a RAC environment (can cause ORA-4031)
只有18199537和当前的版本符合,同时也存在16529874提到的现象,开SR后也同样确认是该BUG。在Oracle最佳实践中也建议安装该Oneoff patch。
解决方案:
1,安装 one off patch 18199537 or upgrade 11.2.0.4.4 or above
Event 18199537 can be set to level 10 or more to reduce contention during row cache cleanup if it is not resolved with default value 1.
Event can also be set to 100 or 200 if there are too many concurrent sessions involved.
SQL> alter system set events ‘18199537 trace name context forever, level 100’;
2,同时建议修复16529874
— update 20240507 —
前台进程出现大量latch: row cache objects, LCK1进程出row cache object free或row cache cleanup等待132s, 系统hang了2分钟后自动恢复, 因为没有做systemstate dump和hanganalyze, 版本是12.1.0.2也在这个bug范围内。
RAC Database Hang Due To Row Cache Objects Latch Contention While LCK0 Waiting For Row Cache Cleanup (Doc ID 2491737.1)
LCK0:
ksedsts <- ksdxfstk <- ksdxcb <- sspuser <- __sighndlr <- call_user_handler <- sigacthandler <- _syscall6 <- sskgpwwait <- skgpwwait <- kslges <- kghfrunp <- kghfnd <- kghalo <- kghgex <- kghfnd <- kghalo <- kghgex <- kghfnd <- kghprmalo <- kghalp <- parchk <- pkaidl <- pkamh2 <- pdybF00_Init <- pdy1F01_Driver <- pdw0F82_Run_Code_Gen <- pdw0F01_Code_Gen <- phpcmp <- pcicogForeground: ksedsts <- ksdxfstk <- ksdxcb <- sspuser <- __sighndlr <- call_user_handler <- sigacthandler <- _syscall6 <- sskgpwwait <- skgpwwait <- kslges <- kqrbip <- kqrbfr <- kqrbgl <- ksbabs <- ksbrdp <- opirip <- opidrv <- sou2o <- opimai_real <- ssthrdmain <- main
对不起,这篇文章暂时关闭评论。