Tuning: latch: cache buffers chains 又一案例
过年刚回来就忙了起来,各种问题最近一直在处理优化几个应用的SQL, 其中开班第一天,就有套库的负载突然高的有个应用差点把库拖垮,下面简单的记录一下案例分析过程。
因为过年原因把好多月初该做的周期计算的任务推在开班第一天,应用的也着急,因为按以前的经验较快出的结果数据库一直没有跑完,他们就不停的加进程认为多那样总得时间更快,殊不知数据库系统资源是有限的,起再多算不过来也只会回到队列进程使系统更慢,这是一套32c的数据库主机当我统计是发现已起了约700个进程都是ACTIVE的。 这些活动的会话进程wait event基本都是“latch: cache buffers chains” 和” read by other session”, “latch: cache buffers chains” 以前也记录过几篇笔记,当一个进程试取得cache buffer hash chain时的一种latch, 查找一个数据块的DBA当前是不是在Buffer Cache中,CBC latch竞争的原因很多,通常也可以理解为一种热块, 对于CBC 通常都是从session wait中找到child latch address 然后再去x$bh的hladdr字段找到相应的obj,1个BH handle可以会关连多个obj, 再参考TCH 列确认比较hot的对象。大多数CBC的OLTP系统多数应该注意一下sql 的执行计划中使用了NL join的方式; 对于read by other session 也是一种hot block事件,当一个进程想把data block读进BH时,另一个进程快开始做这事情了,但还没读完。
SQL> @asess USERNAME SID EVENT MACHINE PROGRAM STATUS LAST_CALL_ET HASH_VALUE SQL_ID WAI_SECINW BS SQLTEXT CH# SEQ# ROW_WAIT_OBJ# ---------- ---------- -------------------- ---------- -------------- -------- ------------ ---------- --------------- ---------- ---------- ------------------------------ ---- ---------- ------------- ANBOB_P1 907 latch: cache buffers kyyz1 weejar_pro@kyyz ACTIVE 3758 3133220035 ac2a4daxc2963 0:4 : SELECT SUBSID,VALTYPE,TARGET_ 0 49467 3462360 ANBOB_P1 3426 latch: cache buffers kyyz1 weejar_pro@kyyz ACTIVE 3758 658665007 f9bu1ywmn4vjg 0:4 : SELECT SUBSID,VALTYPE,TARGET_ 0 57994 3462360 ANBOB_P1 7961 latch: cache buffers kyyz1 weejar_pro@kyyz ACTIVE 3758 455527273 bwpfag8dkdkv9 0:4 : SELECT SUBSID,VALTYPE,TARGET_ 0 42487 3462360 ANBOB_P1 7107 latch: cache buffers kyyz1 weejar_pro@kyyz ACTIVE 3771 3924654184 grq56avnyuy38 0:4 : SELECT SUBSID,VALTYPE,TARGET_ 0 51638 3462360 ANBOB_P1 1717 latch: cache buffers kyyz1 weejar_pro@kyyz ACTIVE 3748 1602056967 1ub6u7tgruws7 457:5 : SELECT SUBSID,VALTYPE,TARGET_ 0 5290 3462360 ANBOB_P1 36 latch: cache buffers kyyz1 weejar_pro@kyyz ACTIVE 3750 183002219 dg676ns5fht3b 0:5 : SELECT SUBSID,VALTYPE,TARGET_ 0 52221 3462360 ANBOB_P1 2297 latch: cache buffers kyyz1 weejar_pro@kyyz ACTIVE 3760 2376432804 3fmp9fq6uaz54 0:5 : SELECT SUBSID,VALTYPE,TARGET_ 0 58741 3462360 ANBOB_P1 6821 latch: cache buffers kyyz1 weejar_pro@kyyz ACTIVE 3888 1323001963 a6pf3r57dqu3b 0:5 : SELECT SUBSID,VALTYPE,TARGET_ 0 6526 3462360 ... 681 rows selected.
Note:
看到SQL id虽不同但是文本貌似相同,随后我们确认了应用未使用绑定变量, 只是SQL中的几个字面量不一样, 找一条SQL我们手动执行一下,查看执行计划和效率(确保手动执行计划和应用的相同)。
SQL> SELECT SUBSID, 2 VALTYPE, 3 TARGET_ID, 4 TARGET_VAL 5 FROM ANBOB_P1.CH_NGSETTLE_SUBS_TARGET PARTITION (PART_310_201601) B 6 WHERE EXISTS 7 (SELECT N.OID 8 FROM ANBOB_P1.RECEPTION PARTITION (PART_310_201601) N 9 WHERE B.SUBSID = N.OID AND N.RECORGID = 'xxxdf.xxx.sfdsfsdf') 10 AND CARRYINGTYPE = 'ceReception' 11 ORDER BY SUBSID; 59 rows selected. Elapsed: 00:04:44.50 Execution Plan ---------------------------------------------------------- Plan hash value: 1746076586 -------------------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop | -------------------------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 112 | 1 (0)| 00:00:01 | | | | 1 | NESTED LOOPS SEMI | | 1 | 112 | 1 (0)| 00:00:01 | | | | 2 | PARTITION RANGE SINGLE | | 1 | 81 | 0 (0)| 00:00:01 | 12 | 12 | |* 3 | TABLE ACCESS BY LOCAL INDEX ROWID| CH_NGSETTLE_SUBS_TARGET | 1 | 81 | 0 (0)| 00:00:01 | 12 | 12 | | 4 | INDEX FULL SCAN | IDX_NGSUBS_TARTGE_SUBSID | 1 | | 0 (0)| 00:00:01 | 12 | 12 | | 5 | PARTITION RANGE SINGLE | | 1 | 31 | 1 (0)| 00:00:01 | 18 | 18 | |* 6 | TABLE ACCESS BY LOCAL INDEX ROWID| RECEPTION | 1 | 31 | 1 (0)| 00:00:01 | 18 | 18 | |* 7 | INDEX SKIP SCAN | IDX_RECEPTION_OID | 1 | | 1 (0)| 00:00:01 | 18 | 18 | -------------------------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 3 - filter("CARRYINGTYPE"='ceReception') 6 - filter("N"."RECORGID"='xxxdf.xxx.sfdsfsdf') 7 - access("B"."SUBSID"="N"."OID") filter("B"."SUBSID"="N"."OID") Statistics ---------------------------------------------------------- 1 recursive calls 0 db block gets 44058611 consistent gets 1933 physical reads 0 redo size 1870 bytes sent via SQL*Net to client 520 bytes received via SQL*Net from client 2 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 59 rows processed
Note:
这就是一个简单的查询, 两个表使用了分区名的方式限制了一个数据周期,1个表使用了IFS而另一个使用了ISS 这样的访问路径,和使用了NL JOIN 的连接方式,这就是我们上面提到过的CBC很多情况与NL join相关。但是看cost非常小,猜测是统计信息不准,用时00:04:44.50, 而且之返回59行的记录,产生了44058611 一致读, 这个SQL有很大的优化空间。
1,不建议使用分区名的方试限制分区范围
比如对分区名过分依赖而不是分区条件,就像我曾经写的那篇”数据去哪了”, 分区名只是指定了表的裁剪区,强烈建议使用分区列做为where查询的谓词条件去过滤数据,因为那样让CBO更好的估算分区表裁剪和对应的分区索引裁剪扫描范围。
SELECT SUBSID, VALTYPE, TARGET_ID, TARGET_VAL FROM ANBOB_P1.CH_NGSETTLE_SUBS_TARGET B WHERE b.region = 310 AND b.CYCLE = 201601 AND EXISTS (SELECT N.OID FROM ANBOB_P1.RECEPTION N WHERE n.region = 310 AND n.CYCLE = 201601 AND B.SUBSID = N.OID AND N.RECORGID = 'xxxdf.xxx.sfdsfsdf') AND CARRYINGTYPE = 'ceReception' ORDER BY SUBSID; Elapsed: 00:01:45.62 Execution Plan ---------------------------------------------------------- Plan hash value: 1743099239 ------------------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop | ------------------------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 164 | 0 (0)| 00:00:01 | | | | 1 | NESTED LOOPS SEMI | | 1 | 164 | 0 (0)| 00:00:01 | | | | 2 | PARTITION RANGE SINGLE | | 1 | 107 | 0 (0)| 00:00:01 | 12 | 12 | | 3 | TABLE ACCESS BY LOCAL INDEX ROWID| CH_NGSETTLE_SUBS_TARGET | 1 | 107 | 0 (0)| 00:00:01 | 12 | 12 | |* 4 | INDEX RANGE SCAN | IDX_SUBS_TARGET1 | 1 | | 0 (0)| 00:00:01 | 12 | 12 | | 5 | PARTITION RANGE SINGLE | | 1 | 57 | 0 (0)| 00:00:01 | 18 | 18 | |* 6 | TABLE ACCESS BY LOCAL INDEX ROWID| RECEPTION | 1 | 57 | 0 (0)| 00:00:01 | 18 | 18 | |* 7 | INDEX RANGE SCAN | IDX_RECEPTION_OID | 1 | | 0 (0)| 00:00:01 | 18 | 18 | ------------------------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 4 - access("B"."REGION"=310 AND "B"."CYCLE"=201601 AND "CARRYINGTYPE"='ceReception') 6 - filter("N"."RECORGID"='xxxdf.xxx.sfdsfsdf') 7 - access("N"."REGION"=310 AND "N"."CYCLE"=201601 AND "B"."SUBSID"="N"."OID") Statistics ---------------------------------------------------------- 1 recursive calls 0 db block gets 1255979 consistent gets 2895 physical reads 0 redo size 1870 bytes sent via SQL*Net to client 520 bytes received via SQL*Net from client 2 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 59 rows processed
Note:
注意我们把上面的SQL 把分区表改成了where 条件,用时立即从4m44s 降到1m:45s, 一致读从44058611降到了1255979,索引访问方式都变成了IRS且有一个表使用了新的索引(不是新建的,只是列更合适,关于这个列字段略),并且扫描的分区范围也是意料内的单分区,当然到这还没结束,这两个表都是较大的表,因为统计信息的错误,低估了表上的数据,才使用了NESTED LOOPS SEMI 这种连接方式,我们再把两个表的统计信息收集一下(可以只收那两个表分区及索引分区)。
2,统计信息收集 略, 再来看收集以后的执行计划。
SQL> SELECT SUBSID, VALTYPE, TARGET_ID, TARGET_VAL FROM ANBOB_P1.CH_NGSETTLE_SUBS_TARGET B WHERE b.region = 310 AND b.CYCLE = 201601 and EXISTS (SELECT N.OID FROM ANBOB_P1.RECEPTION N WHERE B.SUBSID = N.OID AND N.RECORGID = 'xxxdf.xxx.sfdsfsdf' and n.region = 310 AND n.CYCLE = 201601 ) AND CARRYINGTYPE = 'ceReception' ORDER BY SUBSID ; 59 rows selected. Elapsed: 00:00:04.12 Execution Plan ---------------------------------------------------------- Plan hash value: 3997962624 -------------------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop | -------------------------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 12154 | 902K| 417 (1)| 00:00:02 | | | | 1 | MERGE JOIN SEMI | | 12154 | 902K| 417 (1)| 00:00:02 | | | | 2 | PARTITION RANGE SINGLE | | 39M| 1558M| 0 (0)| 00:00:01 | 12 | 12 | | 3 | TABLE ACCESS BY LOCAL INDEX ROWID | CH_NGSETTLE_SUBS_TARGET | 39M| 1558M| 0 (0)| 00:00:01 | 12 | 12 | |* 4 | INDEX RANGE SCAN | IDX_SUBS_TARGET1 | 867K| | 0 (0)| 00:00:01 | 12 | 12 | |* 5 | SORT UNIQUE | | 1633 | 57155 | 417 (1)| 00:00:02 | | | | 6 | TABLE ACCESS BY GLOBAL INDEX ROWID| RECEPTION | 1633 | 57155 | 416 (1)| 00:00:02 | 18 | 18 | |* 7 | INDEX RANGE SCAN | IDX_RECEPTION_RECORGID | 7421 | | 24 (0)| 00:00:01 | | | -------------------------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 4 - access("B"."REGION"=310 AND "B"."CYCLE"=201601 AND "CARRYINGTYPE"='ceReception') 5 - access("B"."SUBSID"="N"."OID") filter("B"."SUBSID"="N"."OID") 7 - access("N"."RECORGID"='xxxdf.xxx.sfdsfsdf' AND "N"."CYCLE"=201601 AND "N"."REGION"=310) Statistics ---------------------------------------------------------- 1 recursive calls 0 db block gets 650583 consistent gets 0 physical reads 0 redo size 1870 bytes sent via SQL*Net to client 520 bytes received via SQL*Net from client 2 SQL*Net roundtrips to/from client 1 sorts (memory) 0 sorts (disk) 59 rows processed
Note:
用时已从最开始的4m44s到上面的1m45s再到现在的仅4秒,一致读也从最初的4000多万降到了65万,”NESTED LOOPS SEMI”换成了”MERGE JOIN SEMI”,数据库恢复了昔日的平静。
Summary:
优化的SQL前提是良好的SQL书写格式,不要用分区名,不要用分区名,不要用分区名
对于分区的统计信息比如以月份的的可以copy上月的统计到本月,也最好不要拆完分区立即收集,因为当时数据还没有,num_rows,blks 都是0,还不如动态采样呢
对不起,这篇文章暂时关闭评论。