Troubleshooting DB load high wait ‘ON CPU’ by New ASH in 12c R2
前在前面
2021年1月6日,石家庄被按下了暂停键,街道变得宁静,新冠肺炎疫情反扑,周围气氛变的紧张, “空城”、“管控”、“隔离”、“确诊”等词让人不安, 他日的新闻中的场景,今日我们正在经历, 我在国际庄,距离藁城最近,5号晚封城后开始多渠道筹备粮食,居家办公,等核算检测结果,每天刷看确诊人员活动轨迹是否有与自己重合,最近2天附近又有新增诊被接走,希望庄里的小伙伴做好防护,感谢医务人员和所有抗疫人员的付出,再坚持一下,寒冷的“冬天”总会过去,一起加油。
远程办公第二天就2起故障,有1个故障感觉比较有意思,特分享出来, 为了是表明猜测充分,原谅我这里附上较多的输出。
背景
6日下午17:20接到告警CPU使用率高, 环境 Oracle 12cR2 3-nodes RAC on Linux. 登录系统查看是实例1活动会话接近200. 这是一个<90 cpu的主机. Session status 全是on cpu. 当然从我们的监控上很直观定位2条select SQL 占用了50% 左右的db time. 首先可能怀疑是SQL 执行量增长或执行计划改变等效率变差了, 这是一个1类系统,需要快速恢复应用,客户要求先kill部分会话, 查看了sql执行计划并未改变, 可是在KILL 部分会话后还在负载逐渐增长,没有多长时间,17:29 主机ping不通了,并且没有重启。
应用还好有配置TAF自动failover到了其它节点,并且运行良好,就当时的负载能把主机压死确实不多见,难道X86就如此脆弱? 下一步如何分析? 我们的AWR 周期是30分钟,AWR中关于SQL stat的信息还没收集。 当然事后会说为什么不收systemdump , 为什么不收hanganalyze, 为什么不手动收AWR, 为什么不把v$ash 物化下来, 可往往故障就是这么猝不及防,如何根据当前的信息进行分析?
如果此时有套监控采集信息或许能够提供部分信息, 如分析sysstat中sql executes是不是增长了?奇怪这该时间段我们的监控sysstat数据也没有采集到.
当时KILL 部分会话后收集
sampling SID all with interval 5 seconds, taking 1 snapshots... -- Session Snapper v4.14 BETA - by Tanel Poder ( http://blog.tanelpoder.com ) - Enjoy the Most Advanced Oracle Troubleshooting Script on the Planet! :) ---------------------------------------------------------------------------------------------------- Active% | INST | SQL_ID | SQL_CHILD | EVENT | WAIT_CLASS ---------------------------------------------------------------------------------------------------- 3500% | 1 | 56twj7s93jaz4 | 0 | ON CPU | ON CPU 1700% | 1 | dp8fnkzqdt3km | 0 | ON CPU | ON CPU 1300% | 1 | 05qszn0ufs4ff | 0 | ON CPU | ON CPU 1200% | 1 | 1b9zsamawq6mh | 0 | ON CPU | ON CPU 1100% | 1 | a7kcm21nbngvx | 0 | ON CPU | ON CPU 1100% | 1 | 2v7njuhdw0pgm | 1 | ON CPU | ON CPU 1000% | 1 | 13nf2mwh3xmsh | 1 | ON CPU | ON CPU 900% | 1 | | 0 | ON CPU | ON CPU 700% | 1 | 60t0pum7f1pbm | 1 | ON CPU | ON CPU 600% | 1 | 60t0pum7f1pbm | 2 | ON CPU | ON CPU -- End of ASH snap 1, end=2021-01-06 17:24:43, seconds=9, samples_taken=1
可以确认当时全时ON CPU, 从上图看负载是从16:58左右开始, 分析这类问题DASH是常用的分析工具,之前总结过ASH相关的BLOG《Know more about Oracle ASH》,前不久还有同事问我ASH是何时从memory刷到disk中的, 1,默认AWR的快照间隔;2,如果ASH buffer 已满时MMNL进程都会负责这事。 如果17:00 到 17:30的AWR 没有形成,那17:00前那2分钟的DASH(dba_hist_active_sess_history)可以做为分析的入口。在统计top SQL 历史快照时发现实例1其实16:30 – 17:00 AWR snapshot也没形成。问题时间段跨越的2个AWR snapshot都没生成,那DASH 真没有数据了吗? 11G前是这样的,当12c不是的。
12c ASH 从memroy刷到disk形为改变
检查DASH数据是否存在?
SQL> select to_char(sample_time,'yyyymmdd hh24:mi'),count(*) --not * 10 from dba_hist_active_sess_history where sample_time >to_date('2021-01-06 17','yyyy-mm-dd hh24') and sample_time <to_date('2021-01-06 18','yyyy-mm-dd hh24') and instance_number=1 group by to_char(sample_time,'yyyymmdd hh24:mi') order by 1; TO_CHAR(SAMPLE COUNT(*) -------------- ---------- 20210106 17:00 503 20210106 17:01 273 20210106 17:02 572 20210106 17:03 601 20210106 17:04 607 20210106 17:05 293 20210106 17:06 571 20210106 17:07 559 20210106 17:08 553 20210106 17:09 286 20210106 17:10 510 20210106 17:11 458 20210106 17:12 443 20210106 17:13 434 20210106 17:14 455 20210106 17:15 710 20210106 17:16 457 20210106 17:17 433 20210106 17:18 481 20210106 17:19 503 20210106 17:20 259 20210106 17:21 527 20210106 17:22 530 20210106 17:23 529 20210106 17:24 209 25 rows selected.
Note:
实例1 crash(17:28)之前的ASH数据基本上都在,这也正是oracle宣传的,ASH及时在系统负载很快时也可以很好的工作。 检查了db alert log没有提示ASH buffer 不足的问题, 那ASH是什么刷新频率刷到DISK上的呢?
SQL> select max(sample_time),sysdate from dba_hist_active_sess_history where instance_number=1; MAX(SAMPLE_TIME) SYSDATE --------------------------------------------------------------------------- ----------------- 14-JAN-21 10.54.32.234 PM 20210114 22:58:23 SQL> select max(sample_time),sysdate from dba_hist_active_sess_history where instance_number=1; MAX(SAMPLE_TIME) SYSDATE --------------------------------------------------------------------------- ----------------- 14-JAN-21 10.54.32.234 PM 20210114 22:59:52 SQL> r 1* select max(sample_time),sysdate from dba_hist_active_sess_history where instance_number=1 MAX(SAMPLE_TIME) SYSDATE --------------------------------------------------------------------------- ----------------- 14-JAN-21 10.59.54.686 PM 20210114 23:00:56 SQL> r 1* select max(sample_time),sysdate from dba_hist_active_sess_history where instance_number=1 MAX(SAMPLE_TIME) SYSDATE --------------------------------------------------------------------------- ----------------- 14-JAN-21 11.04.26.622 PM 20210114 23:07:58
Note:
非AWR snapshot flush时间,DASH也在逐渐的更新最新数据,基本上可以判断,当前的ASH是大概5分钟更新到DASH(dba_hist_active_sess_history)中, 那继续查找一下相关的ASH参数。
SQL> select
2 n.indx
3 , to_char(n.indx, 'XXXX') i_hex
4 , n.ksppinm pd_name
5 , c.ksppstvl pd_value
6 , n.ksppdesc pd_descr
7 from sys.x$ksppi n, sys.x$ksppcv c
8 where n.indx=c.indx
9 and
10 lower(n.ksppinm) || ' ' || lower(n.ksppdesc) like lower('%\_ash%')
11 escape '\' ;
INDX I_HEX NAME VALUE DESCRIPTION
---------- ----- ------------------------------------ ----------- ----------------------------------------------------------------------
4546 11C2 _ash_sampling_interval 1000 Time interval between two successive Active Session samples in
millisecs
4547 11C3 _ash_size 1048618 To set the size of the in-memory Active Session History buffers
4548 11C4 _ash_enable TRUE To enable or disable Active Session sampling and flushing
4549 11C5 _ash_disk_write_enable TRUE To enable or disable Active Session History flushing
4550 11C6 _ash_disk_filter_ratio 10 Ratio of the number of in-memory samples to the number of samples
actually written to disk
4551 11C7 _ash_eflush_trigger 66 The percentage above which if the in-memory ASH is full the emergency
flusher will be triggered
4552 11C8 _ash_sample_all FALSE To enable or disable sampling every connected session including ones
waiting for idle waits
4553 11C9 _ash_dummy_test_param 0 Oracle internal dummy ASH parameter used ONLY for testing!
4554 11CA _ash_min_mmnl_dump 90 Minimum Time interval passed to consider MMNL Dump
4555 11CB _ash_compression_enable TRUE To enable or disable string compression in ASH
4556 11CC _ash_progressive_flush_interval 300 ASH Progressive Flush interval in secs
11 rows selected.
Note:
_ash_progressive_flush_interval 这个隐藏参数值300秒,描述也说明了它是控制ASH渐进式刷新频率秒数,该参数11G中并不存在,与之相关的TIPs几乎没有, 不过我们可以判断从12c除了上面提到的2种ASH flush形为,还有第3种,每300秒渐进式ASH也会flush到disk. 不得不感叹ORACLE一直在悄悄改善的更加完美, 这么实用的功能没有得到广泛的宣传。
从DASH 分析的SQL效率
先从现有的dba_hist_sqlstat分析TOP SQL的执行情况
脚本sql_hist.sql
Summary Execution Statistics Over Time of SQL_ID:dp8fnkzqdt3km Avg Avg Snapshot Avg LIO Avg PIO CPU (secs) Elapsed (secs) Beg Time INSTANCE_NUMBER Execs Per Exec Per Exec Per Exec Per Exec ------------ --------------- ------------ ------------------- --------------- --------------- ------------------- 06-JAN 14:00 1 824 362,726.48 3.82 1.56 1.57 06-JAN 14:00 2 841 0.00 6.04 1.58 1.59 06-JAN 14:30 1 1,084 358,591.06 0.80 1.52 1.53 06-JAN 14:30 2 725 239,373.03 10.00 1.35 1.36 06-JAN 15:00 1 1,303 335,387.17 0.36 1.56 1.56 06-JAN 15:00 2 903 254,055.09 2.25 1.37 1.38 06-JAN 15:30 2 850 315,445.81 18.58 1.63 1.66 06-JAN 15:30 1 1,238 335,970.25 1.11 1.50 1.50 06-JAN 16:00 1 1,048 339,324.59 1.46 1.54 1.55 06-JAN 16:00 2 765 310,440.36 1.66 1.43 1.44 06-JAN 16:30 2 703 273,389.33 21.77 1.48 1.50 06-JAN 17:00 2 595 238,034.54 67.25 1.33 1.46 06-JAN 17:30 2 450 313,110.14 215.90 1.55 1.72 06-JAN 18:00 2 408 312,342.91 59.53 1.50 1.58 06-JAN 18:30 1 64 305,746.14 3,045.39 2.78 4.15 06-JAN 18:30 2 152 395,698.88 224.32 1.94 2.06 06-JAN 19:00 1 75 378,465.49 249.27 2.08 2.16 06-JAN 19:00 2 122 346,031.39 33.33 1.88 1.90 06-JAN 19:30 2 31 237,640.52 30.10 1.59 1.61 06-JAN 19:30 1 77 193,193.10 37.27 1.49 1.51 06-JAN 20:00 1 88 317,937.30 347.97 1.67 1.76 06-JAN 20:00 2 26 253,282.27 6.77 1.41 1.42 06-JAN 20:30 1 20 430,854.10 1.65 2.42 2.43 06-JAN 20:30 2 18 344,005.61 75.72 2.42 2.47 06-JAN 21:00 1 5 306,552.00 25.60 1.53 1.56 06-JAN 21:00 2 13 342,130.31 2.77 2.31 2.32 06-JAN 21:30 1 35 6,480.89 2.60 0.06 0.06 07-JAN 07:30 1 12 396,023.42 2,950.17 2.50 3.96 07-JAN 07:30 2 17 2,123,654.59 3,804.82 6.46 8.35 07-JAN 08:00 1 144 225,576.83 489.54 1.35 1.76
Note:
当然没有问题时间段的实例1 sql的执行情况,但是可以判断基本上每次执行20-40万的逻辑读,耗费时间为1-2秒,另外每天7:00和实例1刚启动时因为第一次物理读,单次执行也是在4-8秒之间。
看一下具体的负载趋势
SQL> create table dbmt.dash0106 tablespace users
as select * from dba_hist_active_sess_history
where sample_time >to_date('2021-01-06 16','yyyy-mm-dd hh24') and sample_time <to_date('2021-01-06 18','yyyy-mm-dd hh24') ; Table created. SQL> select * from (
2 select etime,nvl(event,'on cpu') events,sql_id, dbtime, cnt,first_time,end_time,
3 round(100*ratio_to_report(dbtime) OVER (partition by etime ),2) pct,row_number() over(partition by etime order by dbtime desc) rn
4 from (
5 select to_char(SAMPLE_TIME,'yyyymmdd hh24:mi') etime,event,sql_id,count(*)*10 dbtime,count(*) cnt,
6 to_char(min(SAMPLE_TIME),'hh24:mi:ss') first_time,to_char(max(SAMPLE_TIME),'hh24:mi:ss') end_time
7 from dbmt.dash0106
8 --where sample_time between to_date('2015-4-1 16:00','yyyy-mm-dd hh24:mi') and to_date('2015-4-1 17:00','yyyy-mm-dd hh24:mi')
9 where INSTANCE_NUMBER=1
10 group by to_char(SAMPLE_TIME,'yyyymmdd hh24:mi'),event,sql_id
11 )
12 ) where rn<=5;
ETIME EVENTS SQL_ID DBTIME CNT FIRST_TI END_TIME PCT RN
-------------- ---------- ------------- ---------- ---------- -------- -------- ---------- ----------
20210106 16:30 on cpu d506gkxjgw7xr 40 4 16:30:04 16:30:54 7.41 1
on cpu 4vk65sy477mxm 40 4 16:30:24 16:30:54 7.41 2
on cpu 1wcsvyq7xshqz 40 4 16:30:24 16:30:54 7.41 3
on cpu 40 4 16:30:04 16:30:44 7.41 4
on cpu 60dw4cw904b83 30 3 16:30:14 16:30:34 5.56 5
20210106 16:31 on cpu 56twj7s93jaz4 120 12 16:31:24 16:31:24 15.19 1
on cpu b15j01vphnb60 100 10 16:31:14 16:31:34 12.66 2
on cpu dp8fnkzqdt3km 100 10 16:31:04 16:31:54 12.66 3
db file se fmtaf5hf9tm7s 60 6 16:31:04 16:31:54 7.59 4
quential r
ead
on cpu gm9rz10b0kb50 60 6 16:31:14 16:31:44 7.59 5
20210106 16:32 on cpu dp8fnkzqdt3km 50 5 16:32:15 16:32:45 9.43 1
on cpu 4vk65sy477mxm 40 4 16:32:25 16:32:55 7.55 2
on cpu gvv9dcnsc1jf9 40 4 16:32:25 16:32:55 7.55 3
on cpu d506gkxjgw7xr 40 4 16:32:04 16:32:45 7.55 4
on cpu 0hpuufdajtzch 30 3 16:32:04 16:32:25 5.66 5
20210106 16:33 on cpu b15j01vphnb60 60 6 16:33:05 16:33:55 16.22 1
on cpu dp8fnkzqdt3km 40 4 16:33:15 16:33:45 10.81 2
on cpu 4vk65sy477mxm 40 4 16:33:25 16:33:55 10.81 3
db file se fmtaf5hf9tm7s 30 3 16:33:05 16:33:55 8.11 4
quential r
ead
gc cr gran fmtaf5hf9tm7s 20 2 16:33:15 16:33:25 5.41 5
t 2-way
...
...
...
20210106 16:54 on cpu d506gkxjgw7xr 70 7 16:54:05 16:54:55 15.22 1
on cpu 40 4 16:54:25 16:54:45 8.7 2
gc cr gran fmtaf5hf9tm7s 30 3 16:54:05 16:54:45 6.52 3
t 2-way
on cpu 1wcsvyq7xshqz 30 3 16:54:05 16:54:25 6.52 4
on cpu gvv9dcnsc1jf9 30 3 16:54:25 16:54:55 6.52 5
20210106 16:55 on cpu d506gkxjgw7xr 70 7 16:55:05 16:55:45 12.07 1
on cpu 50 5 16:55:05 16:55:45 8.62 2
db file se fmtaf5hf9tm7s 50 5 16:55:05 16:55:55 8.62 3
quential r
ead
on cpu gm9rz10b0kb50 40 4 16:55:05 16:55:55 6.9 4
on cpu gvv9dcnsc1jf9 40 4 16:55:05 16:55:55 6.9 5
20210106 16:56 on cpu dp8fnkzqdt3km 100 10 16:56:05 16:56:56 17.24 1
on cpu c8t8f3rps66d5 50 5 16:56:05 16:56:46 8.62 2
on cpu gvv9dcnsc1jf9 40 4 16:56:16 16:56:46 6.9 3
on cpu d506gkxjgw7xr 40 4 16:56:05 16:56:56 6.9 4
gc cr gran fmtaf5hf9tm7s 30 3 16:56:16 16:56:56 5.17 5
t 2-way
20210106 16:57 on cpu 56twj7s93jaz4 920 92 16:57:23 16:57:44 27.71 1
on cpu dp8fnkzqdt3km 350 35 16:57:08 16:57:44 10.54 2
on cpu 310 31 16:57:08 16:57:44 9.34 3
on cpu d506gkxjgw7xr 210 21 16:57:08 16:57:44 6.33 4
on cpu gvv9dcnsc1jf9 160 16 16:57:08 16:57:44 4.82 5
20210106 16:58 on cpu 56twj7s93jaz4 1970 197 16:58:08 16:58:57 36.41 1
on cpu dp8fnkzqdt3km 1100 110 16:58:08 16:58:57 20.33 2
on cpu 390 39 16:58:08 16:58:57 7.21 3
on cpu d506gkxjgw7xr 250 25 16:58:08 16:58:57 4.62 4
on cpu gvv9dcnsc1jf9 250 25 16:58:08 16:58:57 4.62 5
20210106 16:59 on cpu 56twj7s93jaz4 1370 137 16:59:25 16:59:53 32.39 1
on cpu dp8fnkzqdt3km 1220 122 16:59:25 16:59:53 28.84 2
on cpu 240 24 16:59:25 16:59:53 5.67 3
on cpu d506gkxjgw7xr 160 16 16:59:25 16:59:53 3.78 4
on cpu 05qszn0ufs4ff 140 14 16:59:25 16:59:53 3.31 5
20210106 17:00 on cpu dp8fnkzqdt3km 1650 165 17:00:25 17:00:56 32.8 1
on cpu 56twj7s93jaz4 1360 136 17:00:25 17:00:56 27.04 2
on cpu 270 27 17:00:25 17:00:56 5.37 3
on cpu d500bxxqf6dbs 160 16 17:00:25 17:00:56 3.18 4
on cpu 05qszn0ufs4ff 150 15 17:00:25 17:00:56 2.98 5
20210106 17:01 on cpu dp8fnkzqdt3km 900 90 17:01:30 17:01:30 32.97 1
on cpu 56twj7s93jaz4 620 62 17:01:30 17:01:30 22.71 2
on cpu 60t0pum7f1pbm 160 16 17:01:30 17:01:30 5.86 3
on cpu 2v7njuhdw0pgm 110 11 17:01:30 17:01:30 4.03 4
on cpu 100 10 17:01:30 17:01:30 3.66 5
20210106 17:02 on cpu dp8fnkzqdt3km 1890 189 17:02:03 17:02:37 33.04 1
on cpu 56twj7s93jaz4 970 97 17:02:03 17:02:37 16.96 2
on cpu 60t0pum7f1pbm 510 51 17:02:03 17:02:37 8.92 3
on cpu 2v7njuhdw0pgm 460 46 17:02:03 17:02:37 8.04 4
on cpu 05qszn0ufs4ff 190 19 17:02:03 17:02:37 3.32 5
20210106 17:03 on cpu dp8fnkzqdt3km 2090 209 17:03:12 17:03:47 34.78 1
on cpu 56twj7s93jaz4 730 73 17:03:12 17:03:47 12.15 2
on cpu 2v7njuhdw0pgm 660 66 17:03:12 17:03:47 10.98 3
on cpu 60t0pum7f1pbm 470 47 17:03:12 17:03:47 7.82 4
on cpu 320 32 17:03:12 17:03:47 5.32 5
20210106 17:04 on cpu dp8fnkzqdt3km 2100 210 17:04:19 17:04:58 34.6 1
on cpu 2v7njuhdw0pgm 550 55 17:04:19 17:04:58 9.06 2
on cpu 56twj7s93jaz4 550 55 17:04:19 17:04:58 9.06 3
on cpu 60t0pum7f1pbm 360 36 17:04:19 17:04:58 5.93 4
on cpu 260 26 17:04:19 17:04:58 4.28 5
20210106 17:05 on cpu dp8fnkzqdt3km 990 99 17:05:33 17:05:33 33.79 1
on cpu 2v7njuhdw0pgm 240 24 17:05:33 17:05:33 8.19 2
on cpu 56twj7s93jaz4 230 23 17:05:33 17:05:33 7.85 3
on cpu 60t0pum7f1pbm 190 19 17:05:33 17:05:33 6.48 4
on cpu a7kcm21nbngvx 180 18 17:05:33 17:05:33 6.14 5
20210106 17:06 on cpu dp8fnkzqdt3km 1980 198 17:06:05 17:06:40 34.68 1
on cpu 2v7njuhdw0pgm 380 38 17:06:05 17:06:40 6.65 2
on cpu 56twj7s93jaz4 380 38 17:06:05 17:06:40 6.65 3
on cpu a7kcm21nbngvx 380 38 17:06:05 17:06:40 6.65 4
on cpu 60t0pum7f1pbm 320 32 17:06:05 17:06:40 5.6 5
20210106 17:07 on cpu dp8fnkzqdt3km 1870 187 17:07:14 17:07:47 33.45 1
on cpu 2v7njuhdw0pgm 400 40 17:07:14 17:07:47 7.16 2
on cpu a7kcm21nbngvx 320 32 17:07:14 17:07:47 5.72 3
on cpu f018b0x00auxp 310 31 17:07:14 17:07:47 5.55 4
on cpu 60t0pum7f1pbm 300 30 17:07:14 17:07:47 5.37 5
20210106 17:08 on cpu dp8fnkzqdt3km 1820 182 17:08:18 17:08:51 32.91 1
on cpu 2v7njuhdw0pgm 420 42 17:08:18 17:08:51 7.59 2
on cpu f018b0x00auxp 400 40 17:08:18 17:08:51 7.23 3
on cpu 13nf2mwh3xmsh 390 39 17:08:18 17:08:51 7.05 4
on cpu 60t0pum7f1pbm 330 33 17:08:18 17:08:51 5.97 5
20210106 17:09 on cpu dp8fnkzqdt3km 910 91 17:09:26 17:09:26 31.82 1
on cpu 2v7njuhdw0pgm 310 31 17:09:26 17:09:26 10.84 2
on cpu f018b0x00auxp 220 22 17:09:26 17:09:26 7.69 3
on cpu 60t0pum7f1pbm 180 18 17:09:26 17:09:26 6.29 4
on cpu 13nf2mwh3xmsh 150 15 17:09:26 17:09:26 5.24 5
20210106 17:10 on cpu dp8fnkzqdt3km 1830 183 17:10:02 17:10:35 35.88 1
on cpu 2v7njuhdw0pgm 530 53 17:10:02 17:10:35 10.39 2
on cpu 13nf2mwh3xmsh 290 29 17:10:02 17:10:35 5.69 3
on cpu f018b0x00auxp 280 28 17:10:02 17:10:35 5.49 4
on cpu 270 27 17:10:02 17:10:35 5.29 5
...
20210106 17:23 on cpu dp8fnkzqdt3km 1390 139 17:23:09 17:23:40 26.28 1
on cpu 56twj7s93jaz4 770 77 17:23:09 17:23:40 14.56 2
on cpu 60t0pum7f1pbm 330 33 17:23:09 17:23:40 6.24 3
on cpu 05qszn0ufs4ff 260 26 17:23:09 17:23:40 4.91 4
on cpu 250 25 17:23:09 17:23:40 4.73 5
20210106 17:24 on cpu 56twj7s93jaz4 390 39 17:24:07 17:24:07 18.66 1
on cpu 160 16 17:24:07 17:24:07 7.66 2
on cpu 60t0pum7f1pbm 140 14 17:24:07 17:24:07 6.7 3
on cpu 05qszn0ufs4ff 130 13 17:24:07 17:24:07 6.22 4
on cpu a7kcm21nbngvx 110 11 17:24:07 17:24:07 5.26 5
425 rows selected.
Note:
从16:57开始的, DBtime 呈上升趋势,并且wait event: ON CPU, 也能看到TOP SQL。
正常时间段DASH中TOP sql
SQL> select session_id,to_char(sample_time,'yyyymmdd hh24:mi:ss') etime,event,sql_exec_id,SQL_EXEC_START,session_state,TIME_WAITED,IN_SQL_EXECUTION,TM_DELTA_CPU_TIME from dbmt.dash0106 where sample_time >to_date('2021-01-06 16:45','yyyy-mm-dd hh24:mi') and sample_time <to_date('2021-01-06 16:54','yyyy-mm-dd hh24:mi') and sql_id='dp8fnkzqdt3km' order by 1,2; SESSION_ID ETIME EVENT SQL_EXEC_ID SQL_EXEC_START SESSION TIME_WAITED I TM_DELTA_CPU_TIME ---------- ----------------- ---------- ----------- ----------------- ------- ----------- - ----------------- 12 20210106 16:49:22 20337054 20210106 16:49:20 ON CPU 0 Y 5824114 12 20210106 16:49:53 20337078 20210106 16:49:51 ON CPU 0 Y 2731150 467 20210106 16:45:06 41876133 20210106 16:45:05 ON CPU 0 Y 12821930 921 20210106 16:52:14 20337167 20210106 16:52:13 ON CPU 0 Y 9083661 1229 20210106 16:48:48 41876201 20210106 16:48:45 ON CPU 0 Y 28963774 1532 20210106 16:45:10 20336897 20210106 16:45:06 ON CPU 0 Y 12105818 1532 20210106 16:51:43 20337139 20210106 16:51:43 ON CPU 0 Y 9087716 1535 20210106 16:47:52 20337014 20210106 16:47:51 ON CPU 0 Y 6957901 1535 20210106 16:48:02 20337023 20210106 16:48:01 ON CPU 0 Y 6100204 1537 20210106 16:47:11 20336989 20210106 16:47:11 ON CPU 0 Y 10426243 1559 20210106 16:48:12 20337027 20210106 16:48:11 ON CPU 0 Y 20086346 1559 20210106 16:48:22 20337032 20210106 16:48:17 ON CPU 0 Y 5404968 1559 20210106 16:50:13 20337087 20210106 16:50:12 ON CPU 0 Y 5066286 2292 20210106 16:49:02 20337039 20210106 16:49:00 ON CPU 0 Y 6094836 2435 20210106 16:46:16 41876161 20210106 16:46:15 ON CPU 0 Y 16064789 2741 20210106 16:49:02 20337040 20210106 16:49:01 ON CPU 0 Y 5135847 2746 20210106 16:45:31 20336918 20210106 16:45:30 ON CPU 0 Y 559010 2746 20210106 16:46:21 20336951 20210106 16:46:20 ON CPU 0 Y 442342
TIP:
主要是看同一session 同一SQL的执行情况,通过session_id(session_serial#)和sql_exec_id来确认,sql_exec_id在同一SESSION执行sql会累增,sql_exec_id不变再看SQL_EXEC_START SQL开始执行的时间。 可见正常时间SQL持续运行的时间都是<2秒的。
问题时间段的DASH中TOP sql
SQL> select session_id,--SESSION_SERIAL#, (used to uniquely identify a session's objects) to_char(sample_time,'yyyymmdd hh24:mi:ss') etime,event,sql_exec_id,SQL_EXEC_START,session_state,TIME_WAITED,IN_SQL_EXECUTION,TM_DELTA_CPU_TIME from dbmt.dash0106 where sample_time >to_date('2021-01-06 16:58','yyyy-mm-dd hh24:mi') and sample_time <to_date('2021-01-06 17:10','yyyy-mm-dd hh24:mi') and sql_id='dp8fnkzqdt3km' order by 1,2 SESSION_ID ETIME EVENT SQL_EXEC_ID SQL_EXEC_START SESSION TIME_WAITED I TM_DELTA_CPU_TIME ---------- ----------------- ---------- ----------- ----------------- ------- ----------- - ----------------- 9 20210106 17:02:03 20337526 20210106 17:02:01 ON CPU 0 Y 3050637 9 20210106 17:02:37 20337526 20210106 17:02:01 ON CPU 0 Y 8411155 9 20210106 17:03:12 20337526 20210106 17:02:01 ON CPU 0 Y 7732513 9 20210106 17:03:47 20337526 20210106 17:02:01 ON CPU 0 Y 8855886 9 20210106 17:04:19 20337526 20210106 17:02:01 ON CPU 0 Y 6742950 9 20210106 17:04:58 20337526 20210106 17:02:01 ON CPU 0 Y 8861440 9 20210106 17:05:33 20337526 20210106 17:02:01 ON CPU 0 Y 7035218 9 20210106 17:06:05 20337526 20210106 17:02:01 ON CPU 0 Y 7703591 9 20210106 17:06:40 20337526 20210106 17:02:01 ON CPU 0 Y 8466669 12 20210106 16:58:08 20337396 20210106 16:57:38 ON CPU 0 Y 7636384 12 20210106 16:58:31 20337396 20210106 16:57:38 ON CPU 0 Y 6840769 12 20210106 16:58:57 20337396 20210106 16:57:38 ON CPU 0 Y 6953856 12 20210106 16:59:25 20337396 20210106 16:57:38 ON CPU 0 Y 6760293 12 20210106 16:59:53 20337396 20210106 16:57:38 ON CPU 0 Y 6476894 12 20210106 17:00:25 20337396 20210106 16:57:38 ON CPU 0 Y 6974848 12 20210106 17:00:56 20337396 20210106 16:57:38 ON CPU 0 Y 5540165 12 20210106 17:01:30 20337396 20210106 16:57:38 ON CPU 0 Y 6374697 12 20210106 17:02:03 20337396 20210106 16:57:38 ON CPU 0 Y 6177705 12 20210106 17:02:37 20337396 20210106 16:57:38 ON CPU 0 Y 5990692 12 20210106 17:03:12 20337396 20210106 16:57:38 ON CPU 0 Y 6111413 12 20210106 17:03:47 20337396 20210106 16:57:38 ON CPU 0 Y 6873905 12 20210106 17:04:19 20337396 20210106 16:57:38 ON CPU 0 Y 5680029 12 20210106 17:04:58 20337396 20210106 16:57:38 ON CPU 0 Y 6661646 12 20210106 17:05:33 20337396 20210106 16:57:38 ON CPU 0 Y 5920078 12 20210106 17:06:05 20337396 20210106 16:57:38 ON CPU 0 Y 6712145 12 20210106 17:06:40 20337396 20210106 16:57:38 ON CPU 0 Y 6765375 12 20210106 17:07:14 20337396 20210106 16:57:38 ON CPU 0 Y 5585508 12 20210106 17:07:47 20337396 20210106 16:57:38 ON CPU 0 Y 6002971 12 20210106 17:08:18 20337396 20210106 16:57:38 ON CPU 0 Y 6159024 12 20210106 17:08:51 20337396 20210106 16:57:38 ON CPU 0 Y 6594718 12 20210106 17:09:26 20337396 20210106 16:57:38 ON CPU 0 Y 6444770 40 20210106 17:03:47 20337544 20210106 17:03:13 ON CPU 0 Y 8750964 40 20210106 17:04:19 20337544 20210106 17:03:13 ON CPU 0 Y 5033203 40 20210106 17:04:58 20337544 20210106 17:03:13 ON CPU 0 Y 6892512 40 20210106 17:05:33 20337544 20210106 17:03:13 ON CPU 0 Y 6081303 40 20210106 17:06:05 20337544 20210106 17:03:13 ON CPU 0 Y 6165369 40 20210106 17:06:40 20337544 20210106 17:03:13 ON CPU 0 Y 6525653 40 20210106 17:07:14 20337544 20210106 17:03:13 ON CPU 0 Y 5790222 40 20210106 17:07:47 20337544 20210106 17:03:13 ON CPU 0 Y 6261831 40 20210106 17:08:18 20337544 20210106 17:03:13 ON CPU 0 Y 5841272 40 20210106 17:08:51 20337544 20210106 17:03:13 ON CPU 0 Y 6215711 40 20210106 17:09:26 20337544 20210106 17:03:13 ON CPU 0 Y 6987163 44 20210106 17:00:56 20337496 20210106 17:00:34 ON CPU 0 Y 7026321 44 20210106 17:01:30 20337496 20210106 17:00:34 ON CPU 0 Y 8437444 44 20210106 17:02:03 20337496 20210106 17:00:34 ON CPU 0 Y 8608695 44 20210106 17:02:37 20337496 20210106 17:00:34 ON CPU 0 Y 8561660 44 20210106 17:03:12 20337496 20210106 17:00:34 ON CPU 0 Y 8350169 181 20210106 16:58:08 20337392 20210106 16:57:35 ON CPU 0 Y 10671584 ...
Note:
这个时段的sql个别会话单次执行时间4分钟,10分钟以上。且这时间段没有I/O 类(物理读)wait event 全是ON CPU,业务也确认该时间没有太多数据变化, 从业务failover到其它节点后的执行效率也基本上是2秒内和之前一样,也能判断不是数据变化问题。
当然2个时间段的SQL 执行计划也一致
SQL> select to_char(sample_time,'yyyymmdd hh24:mi') etime,sql_id,SQL_PLAN_HASH_VALUE,count(*) from dbmt.dash0106 where sample_time >to_date('2021-01-06 16:40','yyyy-mm-dd hh24:mi') and sample_time <to_date('2021-01-06 17:10','yyyy-mm-dd hh24:mi') and sql_id in('dp8fnkzqdt3km','56twj7s93jaz4') group by to_char(sample_time,'yyyymmdd hh24:mi'),sql_id,SQL_PLAN_HASH_VALUE order by 2,1; ETIME SQL_ID SQL_PLAN_HASH_VALUE COUNT(*) -------------- ------------- ------------------- ---------- 20210106 16:40 dp8fnkzqdt3km 2900077901 11 20210106 16:41 dp8fnkzqdt3km 2900077901 4 20210106 16:42 dp8fnkzqdt3km 2900077901 7 20210106 16:43 dp8fnkzqdt3km 2900077901 9 ... 20210106 16:51 dp8fnkzqdt3km 2900077901 3 20210106 16:52 dp8fnkzqdt3km 2900077901 6 20210106 16:53 dp8fnkzqdt3km 2900077901 9 20210106 16:54 dp8fnkzqdt3km 2900077901 6 20210106 16:55 dp8fnkzqdt3km 2900077901 6 20210106 16:56 dp8fnkzqdt3km 2900077901 11 20210106 16:57 dp8fnkzqdt3km 2900077901 47 20210106 16:58 dp8fnkzqdt3km 2900077901 111 20210106 16:59 dp8fnkzqdt3km 2900077901 123 20210106 17:00 dp8fnkzqdt3km 2900077901 168 20210106 17:01 dp8fnkzqdt3km 2900077901 95 20210106 17:02 dp8fnkzqdt3km 2900077901 193 20210106 17:03 dp8fnkzqdt3km 2900077901 213 ... 20210106 17:08 dp8fnkzqdt3km 2900077901 186 20210106 17:09 dp8fnkzqdt3km 2900077901 95
其它SQL也存在这种现象。执行计划一致,数据无变化,短时间内单次执行时间从秒级放大分钟级,并且session state 也是在ON CPU, 这段时间在buffer gets做运算,所以怀疑当时buffer gets出现系统调用问题。
主机什么原因CRASH
主机上的messages log服务问题没有生成日志,但是硬件工程师确认是因为物理内存损坏了,时间点也基本和数据库负载突增吻合, 所以有理由怀疑是在主机CRASH前,内存异常,导致数据库同一类SQL的内存内数据buffer get异常,单次SQL执行时间变长,最终导致负载的积压。
本次数据库负载异常或故障突然CRASH,而AWR snapshot没有形成时,在12c后中的ASH每5分钟逐渐式flush disk,已不会刷新太频繁而增加系统负载,也不会等到AWR SNAPSHOT时间大粒度间隔而突然重启而ASH数据缺失无法分析。本次就是利用DASH中SQL两个时间段的SQL执行持续时间判断SQL变慢而导致的业务积压,而非SQL执行量增加,或执行计划变化。
— ENJOY —
对不起,这篇文章暂时关闭评论。