首页 » ORACLE 9i-23ai » 案例: RAC Hang wait ‘library cache lock’ & ‘latch: row cache objects’ 在做了大量的表分区DDL后

案例: RAC Hang wait ‘library cache lock’ & ‘latch: row cache objects’ 在做了大量的表分区DDL后

数据库中做分区表DDL是较为常见的维护,如运营商中大量以月份为分区键的分区表,前不久就在例行了大量的分区表几个小时后数据库双实例hang死,下面分享这个案例希望有相同环境的DBA们早日预防。

环境是11.2.0.3.7的2Nodes RAC  ON HPUX 11.31, 当晚在2节点做了几万次的分区维护但是在2:00前就已做完,还好在营业厅开门前,7:00有人反应数据库实例无法连接,开始是发现节点2连接数满,后维护人员KILL了所有的前台进程反应当时全是Library cache lock等待事件,但是KILL前没有解决问题,不久节点1出现了同样的症状。我登录上查看节点1的活动会话已接近4000,几乎全是Library cache lock的会话,blocker session 在等待DFS lock handle , DFS lock handle 是一种OPS时分布式文件系统的一种等待事件通常是RAC相关,可以转换为具体的event.

为了尽快恢复业务,开始KILL 节点1 的会话,但是会话变成KILLED状态未释放,随后决定重启该实例1,shutdown immediate无法停止,alert log 一直提示“SHUTDOWN: waiting for active calls”,等待了10多分钟后决定kill pmon 强行终止(不到最后非常不建议这种手段), 实例重启后数据库一切恢复正常。 下面只能使用日志和ASH信息分析原因。

-- node1 
SQL> select * from (
  2      select etime,nvl(event,'on cpu') events, dbtime, round(100*ratio_to_report(dbtime) OVER (partition by etime ),2) pct,row_number() over(partition by etime order by dbtime  desc) rn
  3   from (
  4  select substr(to_char(SAMPLE_TIME,'yyyymmdd hh24:mi'),1,13)||'0' etime,event,count(*)*10 dbtime
  5   from dbmt.dash1027
  6   where INSTANCE_NUMBER=1
  7   and sample_time between  to_date('2017-10-27 07:00:00','yyyy-mm-dd hh24:mi:ss') AND  to_date('2017-10-27 08:50:00','yyyy-mm-dd hh24:mi:ss')
  8   group by substr(to_char(SAMPLE_TIME,'yyyymmdd hh24:mi'),1,13),event
  9  )
 10  ) where rn<=5;

ETIME          EVENTS                              DBTIME                            PCT                             RN
-------------- ------------------------------ ----------- ------------------------------ ------------------------------
20171027 07:00 enq: SQ - contention                147580                          45.71                              1
20171027 07:00 library cache lock                  134870                          41.77                              2
20171027 07:00 on cpu                               11810                           3.66                              3
20171027 07:00 gc buffer busy acquire                8210                           2.54                              4
20171027 07:00 db file sequential read               4070                           1.26                              5
20171027 07:10 on cpu                               13110                          42.45                              1
20171027 07:10 library cache lock                    6580                          21.31                              2
20171027 07:10 db file sequential read               5230                          16.94                              3
20171027 07:10 db file scattered read                1660                           5.38                              4
20171027 07:10 read by other session                  800                           2.59                              5
20171027 07:20 enq: SQ - contention                 69190                          44.18                              1
20171027 07:20 library cache lock                   62140                          39.68                              2
20171027 07:20 on cpu                               12660                           8.08                              3
20171027 07:20 db file sequential read               4330                           2.76                              4
20171027 07:20 db file scattered read                1960                           1.25                              5

--node 2
SQL>  select * from (
  2      select etime,nvl(event,'on cpu') events, dbtime, round(100*ratio_to_report(dbtime) OVER (partition by etime ),2) pct,row_number() over(partition by etime order by dbtime  desc) rn
  3   from (
  4  select substr(to_char(SAMPLE_TIME,'yyyymmdd hh24:mi'),1,13)||'0' etime,event,count(*)*10 dbtime
  5   from dbmt.dash1027
  6   where INSTANCE_NUMBER=2
  7   and sample_time between  to_date('2017-10-27 07:00:00','yyyy-mm-dd hh24:mi:ss') AND  to_date('2017-10-27 08:50:00','yyyy-mm-dd hh24:mi:ss')
  8   group by substr(to_char(SAMPLE_TIME,'yyyymmdd hh24:mi'),1,13),event
  9  )
 10  ) where rn<=5;

ETIME          EVENTS                                        DBTIME                            PCT                             RN
-------------- ---------------------------------------------------- ------------------------------ ------------------------------
20171027 07:00 latch: row cache objects                       27250                          23.92                              1
20171027 07:00 enq: SQ - contention                           27140                          23.82                              2
20171027 07:00 library cache load lock                        18460                           16.2                              3
20171027 07:00 gc buffer busy release                         11270                           9.89                              4
20171027 07:00 on cpu                                          8130                           7.14                              5
20171027 07:10 on cpu                                          6720                          34.62                              1
20171027 07:10 Backup: MML write backup piece                  2620                           13.5                              2
20171027 07:10 SQL*Net message from dblink                     2390                          12.31                              3
20171027 07:10 db file sequential read                         1700                           8.76                              4
20171027 07:10 latch: row cache objects                        1430                           7.37                              5
20171027 07:20 enq: SQ - contention                           16230                          34.24                              1
20171027 07:20 latch: row cache objects                       10240                           21.6                              2
20171027 07:20 on cpu                                          5910                          12.47                              3
20171027 07:20 library cache: mutex X                          3410                           7.19                              4
20171027 07:20 Backup: MML write backup piece                  2490                           5.25                              5

SQL> select INSTANCE_NUMBER,sql_id,p2,
  2  --blocking_session,
  3  blocking_inst_id,
  4  count(*)
  5  from dbmt.dash1027 where
  6    sample_time between  to_date('2017-10-27 7:00:00','yyyy-mm-dd hh24:mi:ss') AND  to_date('2017-10-27 07:10:00','yyyy-mm-dd hh24:mi:ss')
  7   and event='enq: SQ - contention' group by INSTANCE_NUMBER,sql_id,p2
  8   --,blocking_session
  9   ,blocking_inst_id
 10   order by 5 desc;

               INSTANCE_NUMBER SQL_ID                                    P2               BLOCKING_INST_ID                       COUNT(*)
------------------------------ ------------- ------------------------------ ------------------------------ ------------------------------
                             1 2783h5bwb20f8                        8260071                              1                          11718
                             2                                          362                              2                           2706
                             1 4rhpkb0fh39nc                        8260071                              1                           2509
                             1 08324f71s5na3                        8260071                              1                            386
                             1 6fa71k5cafbh4                        8260071                              1                             77

SQL> @oid 8260071

owner                     object_name                    object_type        SUBOBJECT_NAME                 CREATED           LAST_DDL_TIME     status                    DATA_OBJECT_ID
------------------------- ------------------------------ ------------------ ------------------------------ ----------------- ----------------- --------- ------------------------------
ANBOB                      SEQ_OID                        SEQUENCE                                          20150324 02:47:42 20150807 15:08:24 VALID

1 row selected.
SQL> @seq anbob.SEQ_OID

SEQUENCE_OWNER                 SEQUENCE_NAME                                       MIN_VALUE                      MAX_VALUE                   INCREMENT_BY C O                     CACHE_SIZE                    LAST_NUMBER
------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------------------------ - - ------------------------------ ------------------------------
ANBOB                           SEQ_OID                                        88100000000000   1000000000000000000000000000                              4 N N                          10000                 88507311739938

SQL> @sqlid 2783h5bwb20f8 %
Show SQL text, child cursors and execution stats for SQLID 2783h5bwb20f8 child %

                    HASH_VALUE                PLAN_HASH_VALUE  CH# SQL_TEXT
------------------------------ ------------------------------ ---- ------------------------------------------------------------------------------------------------------------------------------------------------------
                    4172349896                      207917406    0 select to_char(seq_oid.nextval) oid from dual

1 row selected.


 CH# PARENT_HANDLE    OBJECT_HANDLE                         PLAN_HASH                         PARSES                       H_PARSES                     EXECUTIONS                        FETCHES                 ROWS_PROCESSED                 ROWS_PER_FETCH                        CPU_SEC                   CPU_SEC_EXEC                        ELA_SEC                           LIOS                           PIOS                          SORTS                USERS_EXECUTING
---- ---------------- ---------------- ------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------------------------
   0 C00000153DA00458 C00000153D9FF2F8                      207917406                           5891                              1                       12580228                       12580235                       12580073 .99998712265708867918604064233                        1174.11 .00009332978702770728797602078                    1362.309233                           9239                              0                              0                              0

Note:
当时的ENQ:SQ 事件也较为严重,结合AWR,根据SQL 和sequnece相关业务和执行频率无异常,以当前的SEQ调用频率,从AWR “Wait Event Histogram Detail”查看enq:sq每次平均wait时间太长,sq问题应该发生在row cache填充时, 判断当时shared pool 应该存在性能问题。

下面查询当时的session wait chains

SQL> select INSTANCE_NUMBER,sql_id,p1,
  2  blocking_session,blocking_session_serial#,
  3  blocking_inst_id,
  4  count(*)
  5  from dbmt.dash1027 where
  6    sample_time between  to_date('2017-10-27 7:00:00','yyyy-mm-dd hh24:mi:ss') AND  to_date('2017-10-27 07:10:00','yyyy-mm-dd hh24:mi:ss')
  7   and event='library cache lock' group by INSTANCE_NUMBER,sql_id,p1
  8  ,blocking_session,blocking_session_serial#
  9   ,blocking_inst_id
 10   order by 7 desc;

 INSTANCE_NUMBER SQL_ID                            P1   BLOCKING_SESSION       BLOCKING_SESSION_SERIAL#  BLOCKING_INST_ID COUNT(*)
---------------- ------------- ---------------------- ------------------ ------------------------------ ----------------- --------
               1                 13835058141231219944               8502                           6515                 1    13372
               2                 13835058145195890800                                                                          131
               1                 13835058141231219944                                                                           52
               1                 13835058141231219944               9694                          58887                 1       20
               1                 13835058141231219944              13307                          40389                 1       13
               1                 13835058141231219944               7143                          57041                 1       10
               1                 13835058141231219944              14432                          64427                 1        8

SQL> select sample_time,program,machine,sql_id,event,p1text,p1,p2text,p2,p3text,p3,session_state,
blocking_session,blocking_session_serial#,blocking_inst_id,current_obj#,delta_time
    from dbmt.dash1027 
	where  sample_time between  to_date('2017-10-27 7:00:00','yyyy-mm-dd hh24:mi:ss') AND  to_date('2017-10-27 07:12:00','yyyy-mm-dd hh24:mi:ss')
     and  session_id=8502 and session_serial#=6515 and instance_number=1 order by 1;
	 

SAMPLE_TIME                    PROGRAM                    MACHINE   SQL_ID        EVENT               P1TEXT                  P1 P2TEXT   BLOCKING_SESSION BS_SERIAL# BLOCKING_INST_ID 
------------------------------ -------------------------- --------- ------------- ------------------- ----------- -------------- ------- -----------------  -------- ---------------- 
27-OCT-17 07.03.35.009 AM      Deal@anbob (TNS V1-V3)     anbob                   DFS lock handle     type|mode       1230372869 id1                                                   
27-OCT-17 07.03.45.459 AM      Deal@anbob (TNS V1-V3)     anbob                   DFS lock handle     type|mode       1230372869 id1                  10717         1                2 
27-OCT-17 07.03.56.219 AM      Deal@anbob (TNS V1-V3)     anbob                   DFS lock handle     type|mode       1230372869 id1                  10717         1                2 
27-OCT-17 07.04.06.639 AM      Deal@anbob (TNS V1-V3)     anbob                   DFS lock handle     type|mode       1230372869 id1                  10717         1                2 
27-OCT-17 07.04.17.168 AM      Deal@anbob (TNS V1-V3)     anbob                   DFS lock handle     type|mode       1230372869 id1                  10717         1                2 
27-OCT-17 07.04.27.678 AM      Deal@anbob (TNS V1-V3)     anbob                   DFS lock handle     type|mode       1230372869 id1                  10717         1                2 
...
... have truncated
...
27-OCT-17 07.06.47.259 AM      Deal@anbob (TNS V1-V3)     anbob                   DFS lock handle     type|mode       1230372869 id1                  10717         1                2 
27-OCT-17 07.06.58.509 AM      Deal@anbob (TNS V1-V3)     anbob                   DFS lock handle     type|mode       1230372869 id1                  10717         1                2 
27-OCT-17 07.07.09.389 AM      Deal@anbob (TNS V1-V3)     anbob                   DFS lock handle     type|mode       1230372869 id1                  10717         1                2 
27-OCT-17 07.07.20.399 AM      Deal@anbob (TNS V1-V3)     anbob                   DFS lock handle     type|mode       1230372869 id1                  10717         1                2 
27-OCT-17 07.07.31.429 AM      Deal@anbob (TNS V1-V3)     anbob                   DFS lock handle     type|mode       1230372869 id1                  10717         1                2 


SQL> select chr(bitand(&&p11,-16777216)/16777215) ||
  2         chr(bitand(&&p11,16711680)/65535) type,
  3         mod(&&p11, 16) md
  4         from dual;
Enter value for p11: 1230372869

TY                             MD
-- ------------------------------
IV                              5

SQL> select * from v$lock_type where type='IV';

TYPE                                                             NAME                                                             ID1_TAG                                                          ID2_TAG                                                          IS_
---------------------------------------------------------------- ---------------------------------------------------------------- ---------------------------------------------------------------- ---------------------------------------------------------------- ---
DESCRIPTION
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
IV                                                               Library Cache Invalidation                                       object #                                                         time stamp                                                       NO
Synchronizes library cache object invalidations across instances

SQL> select sample_time,program,machine,sql_id,event,p1text,p1,p2text,p2,p3text,p3,session_state,blocking_session,blocking_session_serial#,blocking_inst_id,current_obj#,delta_time
  2      from dbmt.dash1027 where
  3        sample_time between  to_date('2017-10-27 7:00:00','yyyy-mm-dd hh24:mi:ss') AND  to_date('2017-10-27 07:12:00','yyyy-mm-dd hh24:mi:ss')
  4     and session_id=10717 and session_serial#=1 and instance_number=2 order by 1;

SAMPLE_TIME                         PROGRAM                       MACHINE         SQL_ID        EVENT                          BLOCKING_SESSION BS_SERIAL# BS_INST_ID
----------------------------------- ----------------------------- --------------- ------------- ------------------------------ ---------------- ---------- ----------
27-OCT-17 07.00.05.868 AM           oracle@anbob2 (LCK0)          anbob2                        latch: shared pool                        11596      14855          2
27-OCT-17 07.00.26.527 AM           oracle@anbob2 (LCK0)          anbob2                        latch: shared pool                         2329      15239          2
27-OCT-17 07.00.47.258 AM           oracle@anbob2 (LCK0)          anbob2                        latch: shared pool                         3132       7895          2
27-OCT-17 07.01.49.888 AM           oracle@anbob2 (LCK0)          anbob2                                                                                             
27-OCT-17 07.02.00.367 AM           oracle@anbob2 (LCK0)          anbob2                        latch: shared pool                         7055      45921          2
27-OCT-17 07.02.10.667 AM           oracle@anbob2 (LCK0)          anbob2                        latch: row cache objects                   6283      57835          2
27-OCT-17 07.02.21.028 AM           oracle@anbob2 (LCK0)          anbob2                        latch: shared pool                         2329      15239          2
27-OCT-17 07.02.52.059 AM           oracle@anbob2 (LCK0)          anbob2                        latch: shared pool                        13863      58359          2
27-OCT-17 07.03.02.540 AM           oracle@anbob2 (LCK0)          anbob2                        latch: row cache objects                   6283      57835          2
27-OCT-17 07.03.23.199 AM           oracle@anbob2 (LCK0)          anbob2                        latch: shared pool                         5987      63625          2
27-OCT-17 07.03.33.569 AM           oracle@anbob2 (LCK0)          anbob2                        latch: row cache objects                                             
27-OCT-17 07.03.43.870 AM           oracle@anbob2 (LCK0)          anbob2                        latch: row cache objects                                             
27-OCT-17 07.03.54.239 AM           oracle@anbob2 (LCK0)          anbob2                        latch: row cache objects                                             
27-OCT-17 07.04.04.719 AM           oracle@anbob2 (LCK0)          anbob2                        latch: row cache objects                  11303      22599          2
27-OCT-17 07.04.15.039 AM           oracle@anbob2 (LCK0)          anbob2                        latch: row cache objects                                             
27-OCT-17 07.04.25.429 AM           oracle@anbob2 (LCK0)          anbob2                        latch: row cache objects                   9384      46843          2
27-OCT-17 07.04.35.820 AM           oracle@anbob2 (LCK0)          anbob2                        latch: row cache objects                  17485          3          2
27-OCT-17 07.04.46.199 AM           oracle@anbob2 (LCK0)          anbob2                        latch: row cache objects                  17485          3          2
27-OCT-17 07.04.56.519 AM           oracle@anbob2 (LCK0)          anbob2                        latch: row cache objects                                             
27-OCT-17 07.05.07.030 AM           oracle@anbob2 (LCK0)          anbob2                        latch: row cache objects                                             
27-OCT-17 07.05.17.419 AM           oracle@anbob2 (LCK0)          anbob2                        latch: row cache objects                                             
27-OCT-17 07.05.27.740 AM           oracle@anbob2 (LCK0)          anbob2                        latch: row cache objects                                             
27-OCT-17 07.05.38.140 AM           oracle@anbob2 (LCK0)          anbob2                        latch: row cache objects                                             
27-OCT-17 07.05.48.589 AM           oracle@anbob2 (LCK0)          anbob2                        latch: row cache objects                   5925         61          2
27-OCT-17 07.05.58.979 AM           oracle@anbob2 (LCK0)          anbob2                        latch: row cache objects                   5925         61          2
27-OCT-17 07.06.09.609 AM           oracle@anbob2 (LCK0)          anbob2                        latch: row cache objects                                             
27-OCT-17 07.06.20.060 AM           oracle@anbob2 (LCK0)          anbob2                        latch: row cache objects                   7339       3825          2
27-OCT-17 07.06.30.529 AM           oracle@anbob2 (LCK0)          anbob2                        latch: row cache objects                   3692      39497          2
27-OCT-17 07.06.40.930 AM           oracle@anbob2 (LCK0)          anbob2                        latch: row cache objects                                             
27-OCT-17 07.06.51.380 AM           oracle@anbob2 (LCK0)          anbob2                        latch: row cache objects                                             
27-OCT-17 07.07.02.030 AM           oracle@anbob2 (LCK0)          anbob2                        latch: row cache objects                  12996      63675          2
27-OCT-17 07.07.12.452 AM           oracle@anbob2 (LCK0)          anbob2                        latch: row cache objects                                             
27-OCT-17 07.07.22.920 AM           oracle@anbob2 (LCK0)          anbob2                        latch: row cache objects                                             
27-OCT-17 07.10.20.190 AM           oracle@anbob2 (LCK0)          anbob2                        latch: shared pool                                                   
27-OCT-17 07.11.22.280 AM           oracle@anbob2 (LCK0)          anbob2                        latch: shared pool                        11584      41211          2
                                                                                                                               
35 rows selected.

Note:
堵塞链条chains: 大部分session等待事件library cache lock
==>1:8502等待事件等待事件 “DFS lock handle”是在等待其它实例invalidation lock,也就是释放锁资源,用p1的值可以转换成具体的enq, 这里是一种IV, 一种跨实例的串行使library cache object InValidations的Library cache锁等待,通常发生在当发生了DDL或收集了统计信息后使相关的cursor失效。(FG process)
==> 2:10717 (LCK BG process) 此进程是lck0的后台进程,LCK进程出现latch:row cache objects的时间段,正是2个实例第一个负载高峰的时间段,Row cache 也叫dictionary cache,存储的是表、列、索引、用户等元数据。Lck 的wait通常都是很短暂,从LCK所全时段的wait event分析, 两次负载高峰时间段lck进程都是在等待latch:row cache objects, 是在加载或清理row cache object时申请的latch.

分析row cache objects
当lck0等待latch: row cache objects时,P1的值为13835058147926472600,因为篇幅原因不再附上

SQL> @dec 13835058147926472600

                                DEC                  HEX
----------------------------------- --------------------
        13835058147926472600.000000     C00000159207F798

SQL> select latch#,child# ,name,gets,misses from  v$latch_children where addr like 'C00000159207F798';

         LATCH#          CHILD# NAME                                                                        GETS          MISSES
--------------- --------------- ---------------------------------------------------------------- --------------- ---------------
            280              13 row cache objects                                                     2458650848         8927867
            
SQL>  select s.kqrstcln latch#,kqrstcid cache#,kqrsttxt name from x$kqrst s where kqrstcln=13;

         LATCH#          CACHE# NAME
--------------- --------------- --------------------------------
             13              16 dc_histogram_defs
             13              16 dc_histogram_data
             13              16 dc_histogram_data

TIP:
根据P1的值可以判断当时LCK进程等待持有latch对象是dc_histogram_*, 与计算表列上的柱状图相关。
从ASH后来的时间发现lck0出现了row cache cleaup, 用于清理row cache中的对象,结合IV enq,可以判断是在清理InValidations的对象,通常发现在当一个节点做了DDL或者DCL或shared pool内存紧张时清理row cache object,其它实例同样也需要清理与其相关的在row cache中的对象。

另外从DB ALERT LOG中发现该时间段出现过ora-4031

Errors in file /oracle/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_ora_26816.trc  (incident=1565538):
ORA-04031: unable to allocate 20504 bytes of shared memory ("shared pool","unknown object","sga heap(4,0)","KTI SGA freea")
Incident details in: /oracle/app/oracle/diag/rdbms/anbob/anbob2/incident/incdir_1565538/anbob2_ora_26816_i1565538.trc


-- anbob2_ora_26816.trc
==============================
Memory Utilization of Subpool 4
================================
      Allocation Name            Size
___________________________  ____________
"free memory              "     223004992
"miscellaneous            "          1024
...

但从当时的trace文件中显示当时Subpool 4还有200MB左右的空闲空间,所以当时应该是shared pool中应该是存在碎片.同时从AWR的历史记录看shared pool中的较大的组件”KQR L PO”从晚2点做的分区维护以后一直从300MB增长到出问题时1200MB,最后成为shared pool中最大的资源内存区。同时对比前两个月例行分区维护时该组件不存在该现象最大到700M 然后逐渐释放,以上说明本月的row cache的内存管理出现问题。
KQR ==> [K]ernel [Q]uery Layer [R]ow Cache Management 是指ROW Cache Object管理的相关资源.

RAC中每个节点可能都有某张表引用信息在ROW(dictionary) cache中,当任何一个节点上修改了表的结构,都需要把所有节点上把与该表相关的对象置为invaild, 所以在传统的library cache lock外每个节点上的lck0 进程会对本 instance library cache中的对象加一个,shared-mode的iv(invalidation)instance lock,如果想修改对象定义,必须获取一个exclusive的 iv lock,会通知本地的lck0释放shared-mode lock,本地lck0在释放这个lock之前,会通知其它节点上的lck0,其它节点上的lck0收到这个消息后,就会将本地library cache中的所有相关对象置为invalidate属于广播机制有进程lmd 完成。所以oracle对这些资源采取了广播机制,每个节点上的变化都通知所有节点。

查看当时LCK进程的trace 文件

loadavg : 0.09 0.12 0.17
Swapinfo :
        Avail = 629429.25Mb Used = 205304.09Mb
        Swap free = 424074.81Mb Kernel rsvd = 27521.14Mb
        Free Mem  = 316882.34Mb
  F S      UID   PID  PPID  C PRI NI             ADDR   SZ            WCHAN    STIME TTY       TIME COMD
3401 S   oracle 10708     1  0 178 20 e0000012722b7700 92499 e0000013cad17240  Jul 13  ?        1108:05 ora_lck0_tbcsb2
Short stack dump:
ksedsts()+544<-ksdxfstk()+48<-ksdxcb()+3216<-sspuser()+688<-<-_pw_wait()+48<-pw_wait()+112<-sskgpwwait()+432<-skgpwwait()+320<-kslges()+2208<-kslgetl()+784
<-kqrbip()+1136<-kqrbfr()+1232<-kqrbgl()+368<-ksbabs()+1008<-kclabs()+400<-ksbrdp()+2736<-opirip()+1296<-opidrv()+1152
<-sou2o()+256<-opimai_real()+352<-ssthrdmain()+576<-main()+336<-main_opd_entry()+80

 
 SO: 0xc000001624721918, type: 4, owner: 0xc000001642e31a08, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
     proc=0xc000001642e31a08, name=session, file=ksu.h LINE:12624 ID:, pg=0
    (session) sid: 10717 ser: 1 trans: 0x0000000000000000, creator: 0xc000001642e31a08
              flags: (0x51) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
              flags2: (0x409) -/-/INC
              DID: , short-term DID:
              txn branch: 0x0000000000000000
              oct: 0, prv: 0, sql: 0x0000000000000000, psql: 0x0000000000000000, user: 0/SYS
    ksuxds FALSE at location: 0
    service name: SYS$BACKGROUND
    Current Wait Stack:
     1: waiting for 'latch: row cache objects'
        address=0xc00000158216b238, number=0x118, tries=0x0
        wait_id=2575238331 seq_num=6566 snap_id=1
        wait times: snap=8.825235 sec, exc=8.825235 sec, total=8.825235 sec
        wait times: max=infinite, heur=8.825235 sec
        wait counts: calls=0 os=0
        in_wait=1 iflags=0x520
     0: waiting for 'row cache cleanup'
        =0x0, =0x0, =0x0
        wait_id=2575235501 seq_num=6565 snap_id=2830
        wait times: snap=0.000000 sec, exc=0.491436 sec, total=12 min 50 sec
        wait times: max=infinite, heur=12 min 50 sec
        wait counts: calls=0 os=0
        in_wait=1 iflags=0x15a2
    There are 2618 sessions blocked by this session.
    Dumping one waiter:
      inst: 1, sid: 9422, ser: 23985
      wait event: 'DFS lock handle'
        p1: 'type|mode'=0x49560005
        p2: 'id1'=0x4c4f434b
        p3: 'id2'=0x21
      row_wait_obj#: 4294967295, block#: 0, row#: 0, file# 0
      min_blocked_time: 631 secs, waiter_cache_ver: 447
    Wait State:
      fixed_waits=0 flags=0x22 boundary=0x0000000000000000/-1
    Session Wait History:
        elapsed time of 0.000000 sec since current wait
     0: waited for 'row cache cleanup'
        =0x0, =0x0, =0x0
        wait_id=2575235501 seq_num=6565 snap_id=2830
        wait times: snap=0.000126 sec, exc=0.491436 sec, total=12 min 41 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 0.000000 sec of elapsed time
     1: waited for 'latch: shared pool'
        address=0xc00000006fee4438, number=0x133, tries=0x0
        wait_id=2575238330 seq_num=6564 snap_id=1
        wait times: snap=0.058322 sec, exc=0.058322 sec, total=0.058322 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 0.000000 sec of elapsed time
     2: waited for 'row cache cleanup'
        =0x0, =0x0, =0x0
        wait_id=2575235501 seq_num=6563 snap_id=2829
        wait times: snap=0.000074 sec, exc=0.491310 sec, total=12 min 41 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 0.000000 sec of elapsed time
     3: waited for 'latch: row cache objects'
        address=0xc00000158216b238, number=0x118, tries=0x0
        wait_id=2575238329 seq_num=6562 snap_id=1
        wait times: snap=14.639462 sec, exc=14.639462 sec, total=14.639462 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 0.000000 sec of elapsed time
     4: waited for 'row cache cleanup'
        =0x0, =0x0, =0x0
        wait_id=2575235501 seq_num=6561 snap_id=2828
        wait times: snap=0.002836 sec, exc=0.491236 sec, total=12 min 27 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 0.000000 sec of elapsed time
     5: waited for 'latch: shared pool'
        address=0xc00000006fee4438, number=0x133, tries=0x0
        wait_id=2575238328 seq_num=6560 snap_id=1
        wait times: snap=0.073927 sec, exc=0.073927 sec, total=0.073927 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 0.000000 sec of elapsed time
     6: waited for 'row cache cleanup'
        =0x0, =0x0, =0x0
        wait_id=2575235501 seq_num=6559 snap_id=2827
        wait times: snap=0.000105 sec, exc=0.488400 sec, total=12 min 27 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 0.000000 sec of elapsed time
     7: waited for 'latch: shared pool'
        address=0xc00000006fee4438, number=0x133, tries=0x0
        wait_id=2575238327 seq_num=6558 snap_id=1
        wait times: snap=0.029721 sec, exc=0.029721 sec, total=0.029721 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 0.000000 sec of elapsed time
     8: waited for 'row cache cleanup'
        =0x0, =0x0, =0x0
        wait_id=2575235501 seq_num=6557 snap_id=2826
        wait times: snap=0.000074 sec, exc=0.488295 sec, total=12 min 27 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 0.000000 sec of elapsed time
     9: waited for 'latch: row cache objects'
        address=0xc00000158216b238, number=0x118, tries=0x0
        wait_id=2575238326 seq_num=6556 snap_id=1
        wait times: snap=7.106811 sec, exc=7.106811 sec, total=7.106811 sec
        wait times: max=infinite
        wait counts: calls=0 os=0
        occurred after 0.000000 sec of elapsed time

*** 2017-10-27 07:03:44.749
LCK0 (ospid: 10708) has not moved for 35 sec (1509059024.1509058989)
--
*** 2017-10-27 07:04:04.789
LCK0 (ospid: 10708) has not moved for 55 sec (1509059044.1509058989)
--
*** 2017-10-27 07:04:24.829
LCK0 (ospid: 10708) has not moved for 74 sec (1509059063.1509058989)
--
*** 2017-10-27 07:04:44.869
LCK0 (ospid: 10708) has not moved for 95 sec (1509059084.1509058989)
--
*** 2017-10-27 07:05:04.909
LCK0 (ospid: 10708) has not moved for 115 sec (1509059104.1509058989)
--
...
--
*** 2017-10-27 07:06:45.110
LCK0 (ospid: 10708) has not moved for 215 sec (1509059204.1509058989)
--
*** 2017-10-27 07:07:05.150
LCK0 (ospid: 10708) has not moved for 235 sec (1509059224.1509058989)
--
*** 2017-10-27 07:07:25.190
LCK0 (ospid: 10708) has not moved for 255 sec (1509059244.1509058989)
--
*** 2017-10-27 07:22:52.475
LCK0 (ospid: 10708) has not moved for 39 sec (1509060172.1509060133)
--
*** 2017-10-27 07:26:14.056
LCK0 (ospid: 10708) has not moved for 25 sec (1509060373.1509060348)
--
*** 2017-10-27 07:26:34.096
LCK0 (ospid: 10708) has not moved for 45 sec (1509060393.1509060348)
--
*** 2017-10-27 07:26:54.167
...
...
...
*** 2017-10-27 08:07:20.780
LCK0 (ospid: 10708) has not moved for 1689 sec (1509062840.1509061151)
--
*** 2017-10-27 08:07:40.820
LCK0 (ospid: 10708) has not moved for 1709 sec (1509062860.1509061151)

Note:
期间连续累积的等待时间(已截取了部分输出)可以看出其间lck0进程出现了2次 not moved 超过100秒的时间段,和数据库负载高峰对应,最后一次是从7:26到8:07之间半小时左右lck0 都未活动。

LCK: This lock process manages requests that are not cache-fusion requests. Requests like row cache requests and library cache requests. Only a single LCK process is allowed for each instance.
PMON is waiting for an LCK process to cleanup the lock context after a foreground process died while doing a global cache lock operation.
The shared pool is stressed and memory need to be freed for the new cursors. As a consequence, the dictionary cache is reduced in size by the LCK process causing a temporal hang of the instance since the LCK can’t do other activity during that time. Since the dictionary cache is a memory area protected clusterwide in RAC, the LCK is responsible to free it in collaboration with the dictionary cache users (the sessions using cursors referenced in the dictionary cache). This process can be time consuming when the dictionary cache is big.

当shared pool里的对象需要为新的对象释放空间时如sql cursor, LCK进程降低Row Cache 大小期间使数据库临时hang, 因为在RAC环境中LCK进程负责释放持有row cache的用户进程协调工作及Library cache 的请求, 如果LCK出现性能问题也就会导致library cache object无法请求和会话补KILL后的释放row cache堵塞。Pmon 进程也需要等待LCK进程清理died进程后才能清理锁上下文。这个是为什么大量的等待library cache lock, library cache object load的事件和kill 会话无法使问题得到缓解的原因。 甚至因为中间件的session 重连机制,在kill 会话后大批量的新会话请求 也会加剧shared pool 占用的原因,在原资源未释放,新资源再大量请求时, shared pool内存管理出现问题 就更容易出现ora-4031。而且如要row cache越大后期LCK进程释放资源时就需要消耗更长的时间(每个instance 只有一个lck进程)。

该类现象相关的BUG:
Bug 8215444 Many processes hang in RAC with wait event “DFS lock handle”
Bug 16529874 Frequent “LCK(ospid: ****) has not moved for 20 sec” trace messages
BUG:8666117 – LCK0 PROCESS STUCK AT WAITING FOR “LATCH: ROW CACHE OBJECTS”
Bug 18199537 – RAC database becomes almost hung when large amount of row cache are used in shared pool
Bug 13814739 Excessive KQR X PO” allocations in a RAC environment (can cause ORA-4031)

只有18199537和当前的版本符合,同时也存在16529874提到的现象,开SR后也同样确认是该BUG。在Oracle最佳实践中也建议安装该Oneoff patch。

解决方案:
1,安装 one off patch 18199537 or upgrade 11.2.0.4.4 or above
Event 18199537 can be set to level 10 or more to reduce contention during row cache cleanup if it is not resolved with default value 1.
Event can also be set to 100 or 200 if there are too many concurrent sessions involved.

SQL> alter system set events ‘18199537 trace name context forever, level 100’;

2,同时建议修复16529874

— update 20240507 —
前台进程出现大量latch: row cache objects, LCK1进程出row cache object free或row cache cleanup等待132s, 系统hang了2分钟后自动恢复, 因为没有做systemstate dump和hanganalyze, 版本是12.1.0.2也在这个bug范围内。

 

RAC Database Hang Due To Row Cache Objects Latch Contention While LCK0 Waiting For Row Cache Cleanup (Doc ID 2491737.1)

LCK0:

ksedsts <- ksdxfstk <- ksdxcb <- sspuser <- __sighndlr
<- call_user_handler <- sigacthandler <- _syscall6 <- sskgpwwait <- skgpwwait
<- kslges <- kghfrunp <- kghfnd <- kghalo <- kghgex
<- kghfnd <- kghalo <- kghgex <- kghfnd <- kghprmalo
<- kghalp <- parchk <- pkaidl <- pkamh2 <- pdybF00_Init
<- pdy1F01_Driver <- pdw0F82_Run_Code_Gen <- pdw0F01_Code_Gen <- phpcmp <- pcicogForeground:
ksedsts <- ksdxfstk <- ksdxcb <- sspuser <- __sighndlr
<- call_user_handler <- sigacthandler <- _syscall6 <- sskgpwwait <- skgpwwait
<- kslges <- kqrbip <- kqrbfr <- kqrbgl <- ksbabs
<- ksbrdp <- opirip <- opidrv <- sou2o <- opimai_real
<- ssthrdmain <- main 
 
打赏

, , ,

对不起,这篇文章暂时关闭评论。