Performance Tunning: cursor: mutex X & cursor: mutex S high wait after 12.2.0.1
朋友那一套12C R2多组户架构3节点RAC数据库反应数据库响应缓慢,找到我们帮忙分析,因远程不方便,只拿到事后一份AWR,数据库出现较高的cursor: mutex X 和cursor: mutex S, 12.2目前的资料较少,这个问题个人感觉BUG的可能性大,但MOS中未发现匹配的NOTE.这里仅记录一下。
DB Name DB Id Unique Name Role Edition Release RAC CDB ANBOB 3056822707 ANBOB PRIMARY EE 12.2.0.1.0 YES YES Snap Id Snap Time Sessions Cursors/Session Instances Pluggable Databases Open Begin Snap: 3264 03-Aug-17 11:00:16 230 1.9 3 4 End Snap: 3265 03-Aug-17 12:00:04 1034 2.1 3 4 Elapsed: 59.79 (mins) DB Time: 57,548.34 (mins) Per Second Per Transaction Per Exec Per Call DB Time(s): 962.4 248.5 9.51 9.73 DB CPU(s): 5.1 1.3 0.05 0.05 Background CPU(s): 0.6 0.1 0.01 0.00 Redo size (bytes): 190,081.8 49,081.9 Logical read (blocks): 369,483.3 95,406.1 Block changes: 1,458.0 376.5 Physical read (blocks): 3,537.2 913.4 Physical write (blocks): 1,765.4 455.8 Read IO requests: 126.6 32.7 Write IO requests: 89.7 23.2 Read IO (MB): 27.6 7.1 Write IO (MB): 13.8 3.6 IM scan rows: 0.0 0.0 Session Logical Read IM: 0.0 0.0 Global Cache blocks received: 3,082.4 795.9 Global Cache blocks served: 2,789.8 720.4 User calls: 98.9 25.5 Parses (SQL): 53.2 13.7 Hard parses (SQL): 10.3 2.7 SQL Work Area (MB): 11.0 2.9 Logons: 1.7 0.4 Executes (SQL): 101.2 26.1 Rollbacks: 0.1 0.0 Transactions: 3.9
Event | Waits | Total Wait Time (sec) | Avg Wait | % DB time | Wait Class |
cursor: mutex X | 20,558,713 | 3.3M | 159.05ms | 94.7 | Concurrency |
cursor: mutex S | 14,188,123 | 85.5K | 6.03ms | 2.5 | Concurrency |
enq: TX – row lock contention | 218 | 29.9K | 137.16 s | 0.9 | Application |
DB CPU | 18.4K | 0.5 | |||
gc buffer busy acquire | 4,853,464 | 3703.3 | 763.02us | 0.1 | Cluster |
gc cr block busy | 2,036,172 | 2445.8 | 1.20ms | 0.1 | Cluster |
gc cr block 2-way | 4,170,540 | 1718.3 | 412.01us | 0 | Cluster |
inactive session | 1,379 | 1379.8 | 1000.60ms | 0 | Other |
gc cr block 3-way | 2,071,664 | 1102.1 | 531.98us | 0 | Cluster |
gc cr multi block mixed | 551,888 | 554.5 | 1.00ms | 0 | Cluster |
time model
Statistic Name | Time (s) | % of DB Time | % of Total CPU Time |
parse time elapsed | 3,401,559.92 | 98.51 | |
failed parse elapsed time | 2,578,581.09 | 74.68 | |
sql execute elapsed time | 49,749.93 | 1.44 | |
DB CPU | 18,373.74 | 0.53 | 90.26 |
connection management call elapsed time | 393.19 | 0.01 | |
hard parse elapsed time | 233.07 | 0.01 | |
hard parse (sharing criteria) elapsed time | 75.11 | 0 | |
PL/SQL compilation elapsed time | 11.99 | 0 | |
sequence load elapsed time | 4.25 | 0 |
SQL ordered by Version Count
- Only Statements with Version Count greater than 20 are displayed
Version Count | Executions | SQL Id | SQL Module | PDB Name | SQL Text |
---|---|---|---|---|---|
7,873 | 20 | f0h5rpzmhju11 | QZDB3PDB | select SYS_CONTEXT(‘USERENV’, … | |
3,225 | awxuv71j1wzbh | XXHCPDB | select distinct s.sowner, s.vn… | ||
2,666 | f0h5rpzmhju11 | XXHCPDB | select SYS_CONTEXT(‘USERENV’, … | ||
2,187 | gtuz08rcbk69y | XXHCPDB | select distinct s.sowner, s.vn… | ||
1,466 | 7n9pz49qw4f39 | XXHCPDB | select decode(upper(failover_m… | ||
1,038 | 9zg9qd9bm4spu | XXHCPDB | update user$ set spare6=DECODE… | ||
686 | 6nas5twtqzkk0 | CDB$ROOT | select /*+ no_monitor no_state… | ||
207 | 25 | 121ffmrc95v7g | CDB$ROOT | select i.obj#, i.ts#, i.file#,… | |
93 | 6 | a1xgxtssv5rrp | sqlplus@node1 (TNS V1-V3) | QZDB3PDB | select sum(used_blocks), ts.ts… |
93 | 6 | a1xgxtssv5rrp | sqlplus@node1 (TNS V1-V3) | XXHCPDB | select sum(used_blocks), ts.ts… |
93 | 6 | a1xgxtssv5rrp | sqlplus@node1 (TNS V1-V3) | AJZPDB | select sum(used_blocks), ts.ts… |
Mutex Sleep Summary
ordered by number of sleeps desc
Mutex Type | Location | Sleeps | Wait Time (ms) |
Cursor Parent | kkscsAddChildNode [KKSPRTLOC28] | 334,812,274 | 2,336,374,941 |
Cursor Parent | kkscsPruneChild [KKSPRTLOC27] | 81,186,150 | 30,246,997 |
Library Cache | kgllkc1 57 | 722,195 | 110,128 |
Library Cache | kgllkdl1 85 | 211,650 | 31,718 |
Cursor Pin | kkslce [KKSCHLPIN2] | 75,333 | 63,530 |
Cursor Parent | kksheqd [KKSPRTLOC12] | 1,966 | 12,282 |
Cursor Parent | kksfbc [KKSPRTLOC3] | 1,214 | 5,765 |
Library Cache | kglReleaseHandleReference 124 | 1,114 | 148 |
f0h5rpzmhju11 select SYS_CONTEXT('USERENV', 'SERVER_HOST'), SYS_CONTEXT('USERENV', 'DB_UNIQUE_NAME'), SYS_CONTEXT('USERENV', 'INSTANCE_NAME'), SYS_CONTEXT('USERENV', 'INSTANCE'), STARTUP_TIME, SYS_CONTEXT('USERENV', 'DB_DOMAIN'), SYS_CONTEXT('USERENV', 'SERVICE_NAME') from v$instance awxuv71j1wzbh select distinct s.sowner, s.vname from sys.snap$ s, sys.snap_reftime$ r where r.masobj# in (73811) and s.sowner = r.sowner and s.vname = r.vname and bitand(s.flag3, 67108864) = 67108864 gtuz08rcbk69y select distinct s.sowner, s.vname from sys.snap$ s, sys.snap_reftime$ r where r.masobj# in (73811) and s.sowner = r.sowner and s.vname = r.vname and s.mowner = r.mowner and s.master = r.master and bitand(s.flag3, 67108864) = 67108864
note:
从AWR可以看出数据库的解析出现了严重的问题,cursor: mutex X占了94%的DB TIME同时cursor: mutex S伴随出现。同时mutex 函数kkscsAddChildNode [KKSPRTLOC28]和kkscsPruneChild [KKSPRTLOC27] 增加和裁剪子游标时产生了问题,数据库存在high version SQL, version高于200的都是数据库的递归SQL,最大的达到了7000多。同时ASH 中显示也是该上面的3个SQL.
关于MUTEX
mutex很像latch,mutex 是一种低级的粒度串行机制控制访问SGA中的共享数据结构, 但是相比mutex更小更轻量并且更快, get时少几倍的指令调用, 在10g r2 版本引入,属于KGX (Kernel Generic Mutex Module). 每个结构对象都有自己的mutex保护,父游标和子游标都有各自的mutex; 每个结构又被多个的mutex保护,在11G R2中有超过60种原因会导致不能重用已存在的child cursor,hard parse产生新的child cursor;soft parse选择已存在的child cursor,查找已存在的child cursor时,持有的是cursor: mutex S, 而增加新的Child Cursor持有的是cursor: mutex X.
分析此类问题需要收集的信息:
• AWR Report
• ASH Report
• Hanganalyze trace
• Systemstate dump
What are Wait Events (on Mutexes)?
•Somebody requests a Mutex
•Can not get it by spinning
•And thus, sleeps
•Sleeping is recorded as wait time
•Spinning is not recorded as wait-, but as CPU time
Cursor: Mutex X情景:
Wants: Exclusive mode on Mutex of Parent / Child
• Building a new cursor under a parent
• Although this operation is cheaper, building many cursors under a parent cursor is not recommended.
• Capture SQL bind data(PEEK)
• Build or Update statistics blocks,Modify cursor-related statistics
• Mutex is in the parent cursor.
Cursor: mutex S情景:
Wants: Shared mode on Mutex of Parent / Child
• Change the reference count (“in flux”)= “new guy is interested / spinning”
• Parent examination
• When finding a cursor to execute, the parent must be examined. The examination of the parent is performed using the mutex, cursor: mutex S.
• When the parent cursor has many child cursors involved, this waits will come as the server process has to traverse the entire list of child cursors under the parent to find a match.
• Mutex is in the parent cursor.
Complex – cursor: mutex S/X情景:
•Root-caused by invalidated child cursor(s) => Too many cursor objects in Library Cache
Diagnosis:
•Oracle Wait Interface
•10046 Level 12 session trace (=sql_trace event) 根据bind oacdty检查绑定变量的类型是否存在bind mismatch
•v$sql_shared_cursor plus cursor trace [296377.1]
SQL不能共享的原因:
select * from (select sql_id, nonshared_reason, count(*) from v$sql_shared_cursor unpivot (nonshared_value for nonshared_reason in ( UNBOUND_CURSOR as 'UNBOUND_CURSOR', SQL_TYPE_MISMATCH as 'SQL_TYPE_MISMATCH', OPTIMIZER_MISMATCH as 'OPTIMIZER_MISMATCH', OUTLINE_MISMATCH as 'OUTLINE_MISMATCH', STATS_ROW_MISMATCH as 'STATS_ROW_MISMATCH', LITERAL_MISMATCH as 'LITERAL_MISMATCH', FORCE_HARD_PARSE as 'FORCE_HARD_PARSE', EXPLAIN_PLAN_CURSOR as 'EXPLAIN_PLAN_CURSOR', BUFFERED_DML_MISMATCH as 'BUFFERED_DML_MISMATCH', PDML_ENV_MISMATCH as 'PDML_ENV_MISMATCH', INST_DRTLD_MISMATCH as 'INST_DRTLD_MISMATCH', SLAVE_QC_MISMATCH as 'SLAVE_QC_MISMATCH', TYPECHECK_MISMATCH as 'TYPECHECK_MISMATCH', AUTH_CHECK_MISMATCH as 'AUTH_CHECK_MISMATCH', BIND_MISMATCH as 'BIND_MISMATCH', DESCRIBE_MISMATCH as 'DESCRIBE_MISMATCH', LANGUAGE_MISMATCH as 'LANGUAGE_MISMATCH', TRANSLATION_MISMATCH as 'TRANSLATION_MISMATCH', BIND_EQUIV_FAILURE as 'BIND_EQUIV_FAILURE', INSUFF_PRIVS as 'INSUFF_PRIVS', INSUFF_PRIVS_REM as 'INSUFF_PRIVS_REM', REMOTE_TRANS_MISMATCH as 'REMOTE_TRANS_MISMATCH', LOGMINER_SESSION_MISMATCH as 'LOGMINER_SESSION_MISMATCH', INCOMP_LTRL_MISMATCH as 'INCOMP_LTRL_MISMATCH', OVERLAP_TIME_MISMATCH as 'OVERLAP_TIME_MISMATCH', EDITION_MISMATCH as 'EDITION_MISMATCH', MV_QUERY_GEN_MISMATCH as 'MV_QUERY_GEN_MISMATCH', USER_BIND_PEEK_MISMATCH as 'USER_BIND_PEEK_MISMATCH', TYPCHK_DEP_MISMATCH as 'TYPCHK_DEP_MISMATCH', NO_TRIGGER_MISMATCH as 'NO_TRIGGER_MISMATCH', FLASHBACK_CURSOR as 'FLASHBACK_CURSOR', ANYDATA_TRANSFORMATION as 'ANYDATA_TRANSFORMATION', PDDL_ENV_MISMATCH as 'PDDL_ENV_MISMATCH', TOP_LEVEL_RPI_CURSOR as 'TOP_LEVEL_RPI_CURSOR', DIFFERENT_LONG_LENGTH as 'DIFFERENT_LONG_LENGTH', LOGICAL_STANDBY_APPLY as 'LOGICAL_STANDBY_APPLY', DIFF_CALL_DURN as 'DIFF_CALL_DURN', BIND_UACS_DIFF as 'BIND_UACS_DIFF', PLSQL_CMP_SWITCHS_DIFF as 'PLSQL_CMP_SWITCHS_DIFF', CURSOR_PARTS_MISMATCH as 'CURSOR_PARTS_MISMATCH', STB_OBJECT_MISMATCH as 'STB_OBJECT_MISMATCH', CROSSEDITION_TRIGGER_MISMATCH as 'CROSSEDITION_TRIGGER_MISMATCH', PQ_SLAVE_MISMATCH as 'PQ_SLAVE_MISMATCH', TOP_LEVEL_DDL_MISMATCH as 'TOP_LEVEL_DDL_MISMATCH', MULTI_PX_MISMATCH as 'MULTI_PX_MISMATCH', BIND_PEEKED_PQ_MISMATCH as 'BIND_PEEKED_PQ_MISMATCH', MV_REWRITE_MISMATCH as 'MV_REWRITE_MISMATCH', ROLL_INVALID_MISMATCH as 'ROLL_INVALID_MISMATCH', OPTIMIZER_MODE_MISMATCH as 'OPTIMIZER_MODE_MISMATCH', PX_MISMATCH as 'PX_MISMATCH', MV_STALEOBJ_MISMATCH as 'MV_STALEOBJ_MISMATCH', FLASHBACK_TABLE_MISMATCH as 'FLASHBACK_TABLE_MISMATCH', LITREP_COMP_MISMATCH as 'LITREP_COMP_MISMATCH', PLSQL_DEBUG as 'PLSQL_DEBUG', LOAD_OPTIMIZER_STATS as 'LOAD_OPTIMIZER_STATS', ACL_MISMATCH as 'ACL_MISMATCH', FLASHBACK_ARCHIVE_MISMATCH as 'FLASHBACK_ARCHIVE_MISMATCH', LOCK_USER_SCHEMA_FAILED as 'LOCK_USER_SCHEMA_FAILED', REMOTE_MAPPING_MISMATCH as 'REMOTE_MAPPING_MISMATCH', LOAD_RUNTIME_HEAP_FAILED as 'LOAD_RUNTIME_HEAP_FAILED', HASH_MATCH_FAILED as 'HASH_MATCH_FAILED', PURGED_CURSOR as 'PURGED_CURSOR', BIND_LENGTH_UPGRADEABLE as 'BIND_LENGTH_UPGRADEABLE', USE_FEEDBACK_STATS as 'USE_FEEDBACK_STATS' )) where nonshared_value = 'Y' group by sql_id, nonshared_reason ) where sql_id = '&sqlid' ;
模拟一种 cursor: mutex S/X wait 情况:
•Generating 64 child cursors for one SQL_ID
•Accessing them 20x parallel
➔ Delay to create new children (X)
➔ Delay to select good child (S)
Similar Problem with CHAR binds
•Bind buffers are size 32, 128, 2000 or 4000 bytes
•Changing CHAR bind length invalidates
•Reason BIND_LENGTH_UPGRADEABLE= 4^n cursor versions
情况较复杂应该根据自己的实际情况分析,可能的一些解决方案如下:
• One Quick Fix for cursor: mutex S/X System is loaded with heavy mutex waits due to high number of cursors (=version count)
=> frequently flush this cursor with dbms_shared_pool.purge (look out for new parsing issues = CPU)
• One solution for cursor: mutex S/X Application uses jdbc setter methods now properly on INTEGER column (=2)
#•setNUMBER(2) => Bind Var. is NUMBER
#setNULL(2, java.sql.Types.INTEGER)
=> Bind Var. is NUMBER
•Check how the application does handle bind variables Avoid BIND MISMATCH at (nearly) any cost
•Reduce the number of cursor versions below 100 More will lead to overhead
•Look for matching Oracle bugs in your RDBMS release
•Upgrade to 11.2.0.3 or higher 11.2.0.2 is worst version for cursor issues IMHO
• Increase the SGA if the shared pool is too small and heavy hard parse occurs due to reloads.
• For 11g, make sure cursor_sharing is not similar, as it has been deprecated. This may also cause cursor: mutex S waits:Recommended to use FORCE.
• Increase SESSION_CACHED_CURSORS to higher value if there are too many parses and waits on cursor: pin S.
• Use CURSOR_SHARING to FORCE to avoid waits on cursor: mutex S & cursor: mutex X due to huge version counts.
• If possible run with _LIBRARY_CACHE_ADVICE = FALSE and see if that helps
• On HP platforms ensure HPUX_SCHED_NOAGE parameter is set to 178.
• Obsoleting parent cursors if VERSION_COUNT exceeds a threshold (cursor: mutex S).
• When the child cursors grow beyond certain count be it 20 or 100, it obsoletes the parent cursors. In order to activate this enhancement bug set following:
For 11.2.0.2, apply the 11.2.0.2.2 PSU 12431716 atleast and do the following.
• _CURSOR_FEATURES_ENABLED to 1026
• event 106001 with value 100 (as the parameter _CURSOR_OBSOLETE_THRESHOLD is not present)
• Set , event= 106001 trace name context forever, level 100 in init.ora
For 11.2.0.3 and above, do the following.
• _CURSOR_OBSOLETE_THRESHOLD to 100 (default. this is the number of child cursor after which we obsolete it)
在有些情况下是存在真的hot对象,可以上面的方法都无法解决,需要手动打散:
To identify if the issue is a hot mutex use the following SQL :
SQL> select KGLNAOBJ, KGLNAOWN, KGLHDNSP, KGLOBTYP, KGLNAHSV, KGLOBT23, KGLOBT24 from X$KGLOB where KGLOBT23 > 100000 or KGLOBT24 > 100000 order by KGLOBT24;
Note:
相隔几分钟运行上面的SQL,如果KGLOBT23 或KGLOBT24 短时间内暴涨就说明该对象是HOT OBJ.或者从ASH或v$session数据根据当时的p1值得到。
Mutex wait Parameter Description P1 Hash value of cursor P2 Mutex value (top 2 bytes contain SID holding mutex in exclusive mode, and bottom two bytes usually hold the value 0) P3 Mutex where (an internal code locator) OR'd with Mutex Sleeps -- find hot obj select KGLNAOBJ, KGLNAOWN, KGLHDNSP, KGLOBTYP, KGLNAHSV from X$KGLOB where KGLNAHSH= {value of P1}; -- find blocker SELECT count (*), to_number(substr(to_char(rawtohex(p2raw)), 1, 8),'XXXXXXXX') blocking_sid FROM v$session WHERE event = 'library cache: mutex X‘ group by to_number(substr(to_char(rawtohex(p2raw)),1, 8), 'XXXXXXXX'); -- hot copy You can apply the hot copy fix from 11.2.0.2 onwards w/o anyinterim patches. • Set the following parameters in init.ora:- _KGL_HOT_OBJECT_COPIES={a} _KGL_DEBUG="name='{b}' schema='{c}' namespace={d} debug=33554432" or (where the object is a cursor) _KGL_DEBUG="hash='{e}' debug=33554432" NOTE: Only with Oracle Support Guidance Where: {a} - half the CPU count {b} - KGLNAOBJ from SQL {c} - KGLNAOWN from SQL {d} - KGLHDNSP from SQL {e} - KGLNAHSV from SQL. The same can be performed using dbms_shared_pool APIs in 11.2.0.2+. SQL> EXEC DBMS_SHARED_POOL.MARKHOT('username','object_name',KGLHDNSP); <or, for a cursor> SQL> EXEC DBMS_SHARED_POOL.MARKHOT('KGLNAHSV', 0);
NOTE: dbms_shared_pool.markhot只是实例周期的改变,重启后失效,需要每次重启后重复手动配置.
12.2 这个案例的临时解决方案
上面介绍了那么多,再回到这个12.2的案例, 这么高的mutex等待与跟high version脱离不了关系,简单咨询客户后确认了没有对12.2默认参数做调整,sql verison能达到7000多,我查询了12.2的隐藏参数_CURSOR_OBSOLETE_THRESHOLD原来现在默认已加大到了8192, 该参数在11G中也只是100,12C R1版本是也在1024. 并且high version的SQL都是递归内部SQL,无法调应用解决,于是让客户调整_CURSOR_OBSOLETE_THRESHOLD参数调到200,并重启实例生效。问题消失了,并且上面的那个high version的SQL从TOP中消失。
当然如果再深入查,至少要在当时做SSD或ash 中找到进程信息,或尝试禁用_optim_peek_user_binds, sql ACS\ECS这些特性,或者是个未知的BUG, 我相信未来肯定会有客户同样遇到,希望提供一个分析的方向。
SQL> select bugno,value,is_default,DESCRIPTION from V$SYSTEM_FIX_CONTROL where BUGNO=17443547; BUGNO VALUE IS_DEFAULT DESCRIPTION ---------- ---------- ---------- ---------------------------------------------------------------- 17443547 1 1 Adaptive Cursor Sharing for single bind constant expressions alter system set "_fix_control"='17443547:OFF'; DECLARE curstr VARCHAR2 (300); BEGIN FOR cur IN (SELECT address, hash_value FROM v$sqlarea WHERE version_count>500) LOOP curstr := 'begin sys.dbms_shared_pool.purge(:b1,''c'',65); end;'; -- DBMS_OUTPUT.put_line ('exec sys.dbms_shared_pool.purge('':b1,:b2'',''c'',65)'); EXECUTE IMMEDIATE (curstr) USING cur.address || ',' || cur.hash_value; END LOOP; END; /
— update 2018-5-2 —
相关bug
apply Patch 28889389 fixed in 20.1
bug 25054064 or 12.2 DBRU.
Bug 31211220 – High version count (cursor leaks) due to bind_equiv_failure (Doc ID 31211220.8)
Unpublished Bug 28889389 : HIGH WAITS ON CURSOR: MUTEX X AFTER UPGRADE TO 12.1.0.2
25054064TST&PERF:EBS: VERY HIGH VERSION COUNT AND CURSOR MUTEX WAITS AFTER PLUG INTO CDB
对不起,这篇文章暂时关闭评论。