Troubleshooting failed to generate AWR snapshots at WRH$_SERVICE_STAT(gv$service_stats many ‘–UNKNOWN–‘)
还是上次那个大量的expdp JOB数据库环境Oracle 11.2.0.4 4-nodes RAC, 查看库的AWR时发现已经很久没有生成snapshot, 分析发现AWR内部在收集gv$service_stats(X$KEWSSVCV)环节超时,因为X$KEWSSVCV 有百万行”–UNKNOWN–” service name的统计信息记录,虽然这个问题是因为datapump 的”BUG”((non-existent on MOS) )导致, 但这算是一个数据库长期不重启带来的“不好的影响”。
# db alert log
Thu May 07 22:09:20 2020
Suspending MMON slave action kewrmafsa_ for 82800 seconds
# MMON trace
*** KEWROCISTMTEXEC - encountered error: (ORA-12751: cpu time or run time policy violation) *** SQLSTR: total-len=295, dump-len=240, STR={insert into WRH$_SERVICE_STAT (snap_id, dbid, instance_number, service_name_hash, stat_id, value) select :snap_id, :dbid, :instance_number, stat.service_name_hash, stat.stat_id, stat.value from v$active_services asvc, v$service_st} DDE rules only execution for: ORA 12751 ----- START Event Driven Actions Dump ---- ---- END Event Driven Actions Dump ---- ----- START DDE Actions Dump ----- Executing SYNC actions Executing ASYNC actions ----- START DDE Action: 'ORA_12751_DUMP' (Sync) ----- CPU time exceeded 300 seconds Time limit violation detected at: ksedsts()+465<-kspol_12751_dump()+145<-dbgdaExecuteAction()+1065<-dbgerRunAction()+109<-dbgerRunActions()+4134<-dbgexPhaseII()+1873<-dbgexProcessError()+2680<-dbgeExecuteForError()+88<-dbgePostErrorKGE()+2136<-dbkePostKGE_kgsf()+71<-kgeade()+351<-kgerelv()+140 <-kgerev()+34<-kserec1()+170<-OCIKSEC()+189<-kewrose_oci_stmt_exec()+292<-kewrgwxf1_gwrsql_exft_1()+317<-kewrgwxf_gwrsql_exft()+496<-kewrews_execute_wr_sql()+52<-kewrftbs_flush_table_by_sql()+180<-kewrft_flush_table()+264<-kewrftec_flush_table_ehdlcx()+766 <-kewrfsvc_flush_svcstat()+45<-kewrft_flush_table()+397<-kewrftec_flush_table_ehdlcx()+766<-kewrfat_flush_all_tables()+898<-kewrfos_flush_onesnap()+170<-kewrfsc_flush_snapshot_c()+644<-kewrafs_auto_flush_slave()+769<-kebm_slave_main()+586<-ksvrdp()+1766<-opirip()+674 <-opidrv()+603<-sou2o()+103<-opimai_real()+250<-ssthrdmain()+265<-main()+201<-__libc_start_main()+230 Current Wait Stack: Not in wait; last wait ended 5 min 5 sec ago There are 3 sessions blocked by this session.
Note:
Taking an AWR snapshot executes many SQL statements to generate a new snapshot and has a built-in timeout mechanism, which causes timeouts between 300 and 900 seconds by default. As a result, if snapshot generation stalls, “Time limit violation” or “Runtime exceeded” messages and associated ORA-12751 errors are raised.
如果注意手动执行dbms_workload_repository.create_snapshot,也会因为内部的超时限制而被kill,终止当前操作。
查看SQL的执行计划
SQL_ID 7g732rx16j8jc, child number 3
-------------------------------------
insert into WRH$_SERVICE_STAT (snap_id, dbid, instance_number,
service_name_hash, stat_id, value) select :snap_id, :dbid,
:instance_number, stat.service_name_hash, stat.stat_id, stat.value
from v$active_services asvc, v$service_stats stat where
asvc.name_hash = stat.service_name_hash
Plan hash value: 2550932738
---------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | OMem | 1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------------------------------
| 0 | INSERT STATEMENT | | 1 | | 0 |00:00:00.01 | | | |
| 1 | LOAD TABLE CONVENTIONAL | | 1 | | 0 |00:00:00.01 | | | |
| 2 | NESTED LOOPS | | 1 | 1 | 174 |00:07:25.64 | | | |
| 3 | MERGE JOIN CARTESIAN | | 1 | 1 | 175 |00:00:00.01 | | | |
|* 4 | FIXED TABLE FULL | X$KSWSASTAB | 1 | 1 | 7 |00:00:00.01 | | | |
| 5 | BUFFER SORT | | 7 | 1 | 175 |00:00:00.01 | 73728 | 73728 | |
|* 6 | FIXED TABLE FULL | X$KEWSSMAP | 1 | 1 | 28 |00:00:00.01 | | | |
|* 7 | FIXED TABLE FIXED INDEX| X$KEWSSVCV (ind:2) | 175 | 1 | 174 |00:07:42.11 | | | |
---------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - filter(("KSWSASTABACT"=1 AND "INST_ID"=USERENV('INSTANCE')))
6 - filter("M"."AGGID"=3)
7 - filter(("S"."INST_ID"=USERENV('INSTANCE') AND "KSWSASTABNMH"="S"."SVCHSH" AND "S"."KEWSOFF"="M"."OFFST"))
Note:
看的出执行计划的7# 很少的数据在数据在和X$KEWSSVCV 做NLOOP时花费了7分钟多的时间。且估算的3个X$ FIXED TABLE也都很少的数据。
查看统计信息
SQL> select * from DBA_TAB_STATISTICS where table_name in ('X$KSWSASTAB','X$KEWSSMAP','X$KEWSSVCV'); OWNER TABLE_NAME OBJECT_TYPE NUM_ROWS BLOCKS EMPTY_BLOCKS AVG_SPACE LAST_ANALYZED GLO USE STATT STA -------- -------------- ------------ ---------- ---------- ------------ ---------- ----------------- --- --- ----- --- SYS X$KSWSASTAB FIXED TABLE SYS X$KEWSSMAP FIXED TABLE SYS X$KEWSSVCV FIXED TABLE
收集统计信息
SQL> select count(*) from X$KSWSASTAB;
COUNT(*)
----------
8
SQL> select count(*) from X$KEWSSMAP;
COUNT(*)
----------
265
SQL> select count(*) from X$KEWSSVCV;
COUNT(*)
----------
2576336
SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS('SYS', 'X$KSWSASTAB', no_invalidate=>false);
SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS('SYS', 'X$KEWSSMAP', no_invalidate=>false);
SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS('SYS', 'X$KEWSSVCV', no_invalidate=>false);
SQL> @sqlid 7g732rx16j8jc %
Show SQL text, child cursors and execution stats for SQLID 7g732rx16j8jc child %
HASH_VALUE PLAN_HASH_VALUE CH# SQL_TEXT
---------- --------------- ---- ------------------------------------------------------------------------------------------------------------------------------------------------------
1114153516 4168499536 0 insert into WRH$_SERVICE_STAT (snap_id, dbid, instance_number, service_name_hash, stat_id, value) select :snap_id, :dbid, :instance_number,
stat.service_name_hash, stat.stat_id, stat.value from v$active_services asvc, v$service_stats stat where asvc.name_hash = stat.service_name_hash
1114153516 4168499536 2 insert into WRH$_SERVICE_STAT (snap_id, dbid, instance_number, service_name_hash, stat_id, value) select :snap_id, :dbid, :instance_number,
stat.service_name_hash, stat.stat_id, stat.value from v$active_services asvc, v$service_stats stat where asvc.name_hash = stat.service_name_hash
CH# PARENT_HANDLE OBJECT_HANDLE PLAN_HASH PARSES H_PARSES EXECUTIONS FETCHES ROWS_PROCESSED ROWS_PER_FETCH CPU_SEC CPU_SEC_EXEC ELA_SEC LIOS PIOS SORTS USERS_EXECUTING
---- ---------------- ---------------- ---------- ---------- ---------- ---------- ---------- -------------- -------------- ---------- ------------ ---------- ---------- ---------- ---------- ---------------
0 00000000BE2A17D0 0000000101E08018 4168499536 22 8 22 0 4256 .272018 .012364455 .630342 1040 138 0 0
2 00000000BE2A17D0 00000000BD655CC0 4168499536 1 2 1 0 224 .016001 .016001 .056783 77 10 0 0
SQL_ID 7g732rx16j8jc, child number 2
-------------------------------------
insert into WRH$_SERVICE_STAT (snap_id, dbid, instance_number,
service_name_hash, stat_id, value) select :snap_id, :dbid,
:instance_number, stat.service_name_hash, stat.stat_id, stat.value
from v$active_services asvc, v$service_stats stat where
asvc.name_hash = stat.service_name_hash
Plan hash value: 4168499536
---------------------------------------------------------------------------------------------
| Id | Operation | Name | E-Rows | OMem | 1Mem | Used-Mem |
---------------------------------------------------------------------------------------------
| 0 | INSERT STATEMENT | | | | | |
| 1 | LOAD TABLE CONVENTIONAL | | | | | |
|* 2 | HASH JOIN | | 166 | 965K| 965K| 1081K (0)|
| 3 | NESTED LOOPS | | 212 | | | |
|* 4 | FIXED TABLE FULL | X$KSWSASTAB | 8 | | | |
|* 5 | FIXED TABLE FIXED INDEX| X$KEWSSVCV (ind:1) | 28 | | | |
|* 6 | FIXED TABLE FULL | X$KEWSSMAP | 28 | | | |
---------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("S"."KEWSOFF"="M"."OFFST")
4 - filter(("KSWSASTABACT"=1 AND "INST_ID"=USERENV('INSTANCE')))
5 - filter(("S"."INST_ID"=USERENV('INSTANCE') AND "KSWSASTABNMH"="S"."SVCHSH"))
6 - filter("M"."AGGID"=3)
Note:
收集统计信息后,SQL执行非常快执行计划改变。
查看FIXED TABLE INDEX
SQL> select * from v$indexed_fixed_column where table_name='X$KEWSSVCV' ; TABLE_NAME INDEX_NUMBER COLUMN_NAME COLUMN_POSITION ------------------------------ ------------ ------------------------------ --------------- X$KEWSSVCV 2 KEWSOFF 0 X$KEWSSVCV 1 SVCHSH 0
Note:
Fixed index在dba_indexes中查询不到,它是一种内存中的C结构对象,索引名是以编号形式存在,如ind:2 , ind:1 .., 收集统计信息前访问X$KEWSSVCV使用的是2号索引,在KEWSOFF列上的索引,该列是大量重复的序列,选择率并不好, 而收集统计信息后使用了1号索引(svchsh列),该列像是service hash值,唯一性特别好。使用1号索引查看SQL的平均执行时间在毫秒级。
检查MMON 是否挂起
SQL> oradebug unit_test kebm_dmp_slv_attrs kewrmafsa_ Status: 3 Flags: 0 Runtime limit: 900 CPU time limit: 300 Violations: 28 Suspended until: 1589036765 <<<---- not zero indicates that MMON is still suspended. SQL> EXECUTE DBMS_WORKLOAD_REPOSITORY.CREATE_SNAPSHOT(); PL/SQL procedure successfully completed. Note: 这时再次手动生成AWR已经可以,耗时可以接受,下面清除MMON的suspended状态 SQL> oradebug unit_test kebm_set_slv_attrs kewrmafsa_ retain retain retain retain 0 0 Modified attributes of kewrmafsa_ (slave id 13) SQL> oradebug unit_test kebm_dmp_slv_attrs kewrmafsa_ Status: 3 Flags: 0 Runtime limit: 900 CPU time limit: 300 Violations: 0 Suspended until: 0
至此问题已经解决,AWR已经可以自动生成,再看看gv$service_stats(X$KEWSSVCV)那么大是什么原因。 Automatic Workload Repository (kew) Statistics SerViCe V ? (I guess)
SQL> select inst_id,service_name,count(1) from gv$service_stats
2 group by inst_id,service_name
3 order by 1,2,count(1) desc;
INST_ID SERVICE_NAME COUNT(1)
---------- ---------------------------------------------------------------- ----------
1 --UNKNOWN-- 2576100
1 SYS$BACKGROUND 28
1 SYS$USERS 28
1 SYS.KUPC$C_1_20200413145208 27
1 SYS.KUPC$C_1_20200413152721 28
1 SYS.KUPC$C_1_20200509102215 7
1 SYS.KUPC$S_1_20200413145208 28
1 SYS.KUPC$S_1_20200413152721 27
1 SYS.KUPC$S_1_20200509102215 8
1 cdr.hebei.mobile.com 28
1 cdrXDB 27
2 --UNKNOWN-- 56
2 SYS$BACKGROUND 28
2 SYS$USERS 28
2 cdr.hebei.mobile.com 28
2 cdrXDB 28
3 --UNKNOWN-- 56
3 SYS$BACKGROUND 28
3 SYS$USERS 28
3 cdr.hebei.mobile.com 28
3 cdrXDB 28
4 --UNKNOWN-- 112
4 SYS$BACKGROUND 28
4 SYS$USERS 28
4 cdr.hebei.mobile.com 28
4 cdrXDB 28
这是个4节点的RAC, 因为EXPDP导出任务都在节点1, 所以X$KEWSSVCV 内存表记录在节点1非常明显, X$KEWSSVCV中的”–UNKNOWN–” 服务的问题(没有记录在MOS中的BUG)是由expdp每次执行结束后插入,测试该问题在19c中都一直存在, 并且X$ FIXED TABLE只允许查询不允许任何DML,即使是SYSDBA也不可以,所以如果上面收集了统计信息后问题依旧存在(即使使用了正常的索引)或SQL性能慢的不可接受,建议重启实例释放X$ fixed table.
对不起,这篇文章暂时关闭评论。