MMNL生成非常大的trace文件,包含ASH dump, MMON进程不存在
前几日一套11.2.0.4.7 on suse 11 2nodes2 RAC中节点2的oracle软件目录使用率高,发现mmnl后台进程生成了64GB的trace文件, 包含大量的ash dump信息, 下面简单的记录一下该问题。
Mmnl trace file
=========================
*** 2018-08-22 20:33:13.512 kewa_sampler_cb: Exception handler 453 kewa_sampler_cb: Cleared error 453 kewa_sampler: Re-signalling MMON(74) (!= 73) at 22-AUG-18 08.53.28.796 PM *** 2018-08-22 20:53:28.796 kewa_sampler_cb: Exception handler 453 kewa_sampler_cb: Cleared error 453 kewa_sampler: Re-signalling MMON(75) (!= 74) at 22-AUG-18 09.03.37.184 PM *** 2018-08-22 21:03:37.184 kewa_sampler_cb: Exception handler 453 kewa_sampler_cb: Cleared error 453 kewa_sampler: Re-signalling MMON(76) (!= 75) at 22-AUG-18 09.13.36.996 PM *** 2018-08-22 21:13:36.996 kewa_sampler_cb: Exception handler 453 kewa_sampler_cb: Cleared error 453 *** 2018-08-22 21:23:37.692 kewa_sampler: Re-signalling MMON(77) (!= 76) at 22-AUG-18 09.23.37.692 PM kewa_sampler_cb: Exception handler 453 kewa_sampler_cb: Cleared error 453 kgskmetricupd: could not get change-schema latch kgskmetricupd: could not get change-schema latch kewa_sampler: Re-signalling MMON(78) (!= 77) at 23-AUG-18 12.23.40.741 AM *** 2018-08-23 00:23:40.741 kewa_sampler_cb: Exception handler 453 kewa_sampler_cb: Cleared error 453 kgskmetricupd: could not get change-schema latch kgskmetricupd: could not get change-schema latch *** 2018-08-23 01:02:31.559 kgskmetricupd: could not get change-schema latch *** 2018-08-23 01:04:30.886 kgskmetricupd: could not get change-schema latch kewa_sampler: Re-signalling MMON(79) (!= 78) at 23-AUG-18 01.23.57.348 AM *** 2018-08-23 01:23:57.349 kewa_sampler_cb: Exception handler 453 kewa_sampler_cb: Cleared error 453 ASH dump *** 2018-08-23 01:38:45.614 <<<ACTIVE SESSION HISTORY - PROCESS TRACE DUMP HEADER BEGIN>>> **************** SCRIPT TO IMPORT **************** ------------------------------------------ Step 1: Create destination table <ashdump> ------------------------------------------ CREATE TABLE ashdump AS SELECT * FROM SYS.WRH$_ACTIVE_SESSION_HISTORY WHERE rownum < 0 ---------------------------------------------------------------- Step 2: Create the SQL*Loader control file <ashldr.ctl> as below ---------------------------------------------------------------- load data infile * "str '\n####\n'" append into table ashdump fields terminated by ',' optionally enclosed by '"' ( SNAP_ID CONSTANT 0 , DBID , INSTANCE_NUMBER , ... ... ... <<<ACTIVE SESSION HISTORY - PROCESS TRACE DUMP END>>>
Note:
Trace文件中包含大旱的ASH DUMP信息,MMNL后台进程负责当前活动会话历史记录(ASH)进行采样,存储在循环SGA内存缓冲区中的信息将通过快照或紧急刷新刷新到AWR。
如果循环缓冲区已满三分之二,则ASH紧急冲洗将启动。如果由于某种原因无法将信息写入AWR,则ASH会将其内容(如果没有重新启动,则每12小时一次)转储到MMNL跟踪文件。
将信息写入跟踪而不是AWR的可能原因包括:
ASH缓冲区太小<===首先检查ASH缓冲区是否太小
MMON进程没有响应<====然后,检查MMON进程没有响应的原因
1, 首先确认当前的ASH 大小是否紧张?
如果因为ash buffer size小而紧急刷新,可以从db alert log确认,也可以使用下面的sql确认:
SQL> select total_size,awr_flush_count,awr_flush_emergency_count from v$ash_info; TOTAL_SIZE AWR_FLUSH_COUNT AWR_FLUSH_EMERGENCY_COUNT ---------- --------------- ------------------------- 134217728 20529 6828
Note:
这里的紧急刷新(AWR_FLUSH_EMERGENCY_COUNT)次数不少,可以调大该参数(可在线调整),确认ash buffer从v$ash_info.total_size, 默认“_ash_size”并非当前值,但是修改该内存大小是修改”_ash_size”参数值,可以每次尝试增长50%观察使用情况。
SQL> alter system set "_ash_size"=100m; alter system set "_ash_size"=100m * ERROR at line 1: ORA-02097: parameter cannot be modified because specified value is invalid ORA-02097: parameter cannot be modified because specified value is invalid SQL> @pd ash_size Show all parameters and session values from x$ksppi/x$ksppcv... INDX I_HEX NAME VALUE DESCRIPTION ---------- ----- -------------------------------------------------- ------------------------------ ---------------------------------------------------------------------- 2647 A57 _olap_dimension_corehash_size 30 OLAP Dimension In-Core Hash Table Maximum Memory Use 2731 AAB _kffmap_hash_size 1024 size of kffmap_hash table 2732 AAC _kffmop_hash_size 2048 size of kffmop_hash table 2840 B18 _ash_size 1048618 To set the size of the in-memory Active Session History buffers SQL> alter system set "_ash_size"=200m; System altered. SQL> select total_size,awr_flush_count,awr_flush_emergency_count from v$ash_info; TOTAL_SIZE AWR_FLUSH_COUNT AWR_FLUSH_EMERGENCY_COUNT ---------- --------------- ------------------------- 209715200 20531 6829
Note:
如果调整的值小于当前值会提示ora-2097错误,但是在12C之前的版本超过了“_ash_size”的最大上限254Mb,同样会提示ora-2097错误。 但是从12c以后版本_ash_size值可以调整到大于254MB.
2. 检查MMON
既然写了这么多的ASH dump说明很多次是flush AWR失败的, 负责AWR Snapshot收集的是MMON进程, 然后检查MMON trace无日志文件。
oracle@ANBOB-2:/oracle/app/oracle/diag/rdbms/ANBOB/ANBOB2/trace>ps -ef|grep mmon grid 8576 1 0 2016 ? 03:29:29 asm_mmon_+ASM2 oracle@ANBOB-2:/oracle/app/oracle/diag/rdbms/ANBOB/ANBOB2/trace>ps -ef|grep mmnl grid 8578 1 0 2016 ? 2-17:07:26 asm_mmnl_+ASM2 oracle 9176 1 1 2016 ? 11-20:41:46 ora_mmnl_ANBOB2 SQL> col systimestamp form a35 SQL> col most_recent_snap_time form a25 SQL> col snap_interval form a17 SQL> select systimestamp, most_recent_snap_time, snap_interval from wrm$_wr_control where dbid = (select dbid from v$database); SYSTIMESTAMP MOST_RECENT_SNAP_TIME SNAP_INTERVAL ----------------------------------- ------------------------- ----------------- 24-AUG-18 09.36.27.638363 AM +08:00 24-AUG-18 09.00.33.122 AM +00000 01:00:00.0 SQL> select thread#,open_time from v$thread; THREAD# OPEN_TIME ---------- ----------------- 1 20161225 17:46:00 2 20160226 14:31:03 SQL> select open_mode from v$database; OPEN_MODE -------------------- READ WRITE SQL> col instance_number form 999 head INST SQL> col begin_interval_time form a25 SQL> col flush_elapsed form a17 SQL> col status form 999 SQL> col error_count form 999 head ERR SQL> col snap_flag form 999 head SNAP SQL> select * from (select snap_id, instance_number, begin_interval_time, flush_elapsed, status, error_count, snap_flag from wrm$_snapshot where dbid = (select dbid from v$database) and instance_number=2 --# instance 2 order by snap_id desc) where rownum <= 10 order by snap_id ; -- none --
Note:
该DB实例的MMON进程都不存在,且近段时间均为未生成AWR snapshot.
AWR没有生成SNAPSHOT的原因:
1,数据库配置问题和当前状态
1.1 实例参数STATISTICS_LEVEL
1.2 数据库没打开
1.3 数据库打开没有在READ WRITE模式
1.4 配置的间隔时间问题
2,数据库MMON后台进程问题
2.1 MMON进程未启动
2.2 MMON进程挂起
2.3 MMON在收集某项时时间异常久,没有完成
2.4 SYSAUX表空间可用空间紧张
实例1的AWR正常,只有实例2的存在这现象,此时可以尝试手动生成AWR Snapshot,检查是否有报错或是某项收集时间长。
SQL> exec dbms_workload_repository.create_snapshot;
SQL> exec dbms_workload_repository.create_snapshot;
SQL>
SQL> select * from
2 (select snap_id,
3 instance_number,
4 begin_interval_time,
5 flush_elapsed,
6 status,
7 error_count,
8 snap_flag
9 from wrm$_snapshot
10 where dbid = (select dbid from v$database)
11 and instance_number=2 --# instance 2
12 order by snap_id desc)
13 where rownum <= 10
14 order by snap_id ;
Snap Id INST BEGIN_INTERVAL_TIME FLUSH_ELAPSED STATUS ERR SNAP
--------- ---- ------------------------- ----------------- ------ ---- ----
22765 2 16-SEP-17 08.56.39.415 AM +00000 00:00:12.2 0 0 1
22766 2 24-AUG-18 09.51.14.766 AM +00000 00:00:03.3 0 0 1
Note:
手动是可以生成,说明在AWR实际收集是没有问题的,只是调度进程出了问题, 当然如果手动收集时间都异常久可以配置AWR debug trace查看具体日志
Gathering MMON action trace and snapshot flush trace: SQL> alter session set "_swrf_test_action" = 28; Session altered. SQL> alter session set "_swrf_test_action" = 10; Session altered.
等待AWR snapshot生成后,关闭TRACE对应_swrf_test_action值使用29和11, 同时还有awr_test dump event level(1,2,3,4)可以trace AWR的过程。该事件虽然是alter session, 但影响是实例级的。
对于本案例主要是因为MMON进程不存在,配置EVENT已没有意义, 下一步查看db alert log 查找MMON进程没有启动或停止的原因。
# alert log Sat Sep 16 05:56:38 2017 VKRM started with pid=87, OS id=44418 Sat Sep 16 07:24:10 2017 Thread 2 advanced to log sequence 3377 (LGWR switch) Current log# 3 seq# 3377 mem# 0: +DATADG/yygk/redo03.log Sat Sep 16 09:14:04 2017 ORA-25455 encountered when generating server alert SMG-3500 Sat Sep 16 09:15:04 2017 Exception [type: SIGSEGV, SI_KERNEL(general_protection)] [ADDR:0x0] [PC:0x9929640, kocdsgt()+368] [flags: 0x0, count: 1] Errors in file /oracle/app/oracle/diag/rdbms/yygk/yygk2/trace/yygk2_mmon_9174.trc (incident=288273): ORA-07445: exception encountered: core dump [kocdsgt()+368] [SIGSEGV] [ADDR:0x0] [PC:0x9929640] [SI_KERNEL(general_protection)] [] Incident details in: /oracle/app/oracle/diag/rdbms/yygk/yygk2/incident/incdir_288273/yygk2_mmon_9174_i288273.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Sat Sep 16 09:15:06 2017 Dumping diagnostic data in directory=[cdmp_20170916091506], requested by (instance=2, osid=9174 (MMON)), summary=[incident=288273]. Sat Sep 16 14:58:12 2017 Thread 2 advanced to log sequence 3378 (LGWR switch) Current log# 4 seq# 3378 mem# 0: +DATADG/yygk/redo04.log Sun Sep 17 01:56:39 2017 Closing Resource Manager plan via scheduler window Clearing Resource Manager plan via parameter Sun Sep 17 05:56:39 2017 Setting Resource Manager plan DEFAULT_MAINTENANCE_PLAN via parameter Sun Sep 17 05:56:39 2017 Starting background process VKRM Sun Sep 17 05:56:39 2017 oracle@anbob-2:/home/oracle>vi /oracle/app/oracle/diag/rdbms/anbob/anbob2/incident/incdir_288273/anbob2_mmon_9174_i288273.trc Dump file /oracle/app/oracle/diag/rdbms/anbob/anbob2/incident/incdir_288273/anbob2_mmon_9174_i288273.trc Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP, Data Mining and Real Application Testing options ORACLE_HOME = /oracle/app/oracle/product/11.2.0/db System name: Linux Node name: anbob-2 Release: 3.0.76-0.11-default Version: #1 SMP Fri Jun 14 08:21:43 UTC 2013 (ccab990) Machine: x86_64 Instance name: anbob2 Redo thread mounted by this instance: 2 Oracle process number: 34 Unix process pid: 9174, image: oracle@anbob-2 (MMON) *** 2017-09-16 09:15:05.098 *** SESSION ID:(4795.1) 2017-09-16 09:15:05.098 *** CLIENT ID:() 2017-09-16 09:15:05.098 *** SERVICE NAME:(SYS$BACKGROUND) 2017-09-16 09:15:05.098 *** MODULE NAME:() 2017-09-16 09:15:05.098 *** ACTION NAME:() 2017-09-16 09:15:05.098 Dump continued from file: /oracle/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_mmon_9174.trc ORA-07445: exception encountered: core dump [kocdsgt()+368] [SIGSEGV] [ADDR:0x0] [PC:0x9929640] [SI_KERNEL(general_protection)] [] ========= Dump for incident 288273 (ORA 7445 [kocdsgt()+368]) ======== ----- Beginning of Customized Incident Dump(s) ----- Exception [type: SIGSEGV, SI_KERNEL(general_protection)] [ADDR:0x0] [PC:0x9929640, kocdsgt()+368] [flags: 0x0, count: 1] Registers: %rax: 0x00007f14f45e3640 %rbx: 0x0000000000000000 %rcx: 0x0000000000000008 %rdx: 0x0140000700000100 %rdi: 0x00007f14f4325568 %rsi: 0x0000000000000000 %rsp: 0x00007fffbec8d8c0 %rbp: 0x00007fffbec8db60 %r8: 0x0000000000000000 %r9: 0x00007f14f4325d70 %r10: 0x00007f14f4326570 %r11: 0x0000000000000000 %r12: 0x00007f14f4344e20 %r13: 0x0000000000002010 %r14: 0x0000000000000000 %r15: 0x00007fffbec8dca8 %rip: 0x0000000009929640 %efl: 0x0000000000010203 kocdsgt()+352 (0x9929630) add $-48,%rdx kocdsgt()+356 (0x9929634) mov %rdx,-0x40(%rbp) kocdsgt()+360 (0x9929638) mov -0x40(%rbp),%rax kocdsgt()+364 (0x992963c) mov 0x8(%rax),%rdx > kocdsgt()+368 (0x9929640) movzwl (%rdx),%ebx kocdsgt()+371 (0x9929643) mov -0x38(%rbp),%rcx kocdsgt()+375 (0x9929647) movzwl 0xc(%rcx),%esi kocdsgt()+379 (0x992964b) cmp %esi,%ebx kocdsgt()+381 (0x992964d) jnz 0x992982b ----- SQL Statement (None) ----- Current SQL information unavailable - no cursor. ----- Call Stack Trace ----- location type -------------------- ------ skdstdst()+41 < ksedst1()+103 < ksedst()+39 < dbkedDefDump()+2746 < ksedmp()+41 < ssexhd()+2462 < __sighandler() < kocdsgt()+368 signal kocioc()+648 < kokdn2p()+520 < kodin2p()+176 < kocapm()+1051 < koioapm()+525 < OCITypeByName()+816 < keltgtyp()+75 < kpotcuexec()+254 < rpiswu2()+1776 < OCIKSwitchUserWithPriv()+200 < keltenq()+583 < keltpost()+1812 < kelr_issue_alert()309 kelr_monitor_metrics()+3631 < ksb_run_managed_action()+384 < ksbcti()+2490 < ksbabs()+1735 < kebm_mmon_main()+209 < ksbrdp()+1045 < opirip()+623 < opidrv()+603 < sou2o()+103 < opimai_real()+250 < ssthrdmain()+265 < main()+201
Note:
根据DB ALERT LOG可以发现该实例启动后,过了一段时间MMON进程因ORA-7445内部错误挂掉后就一直没有再启动,AWR SNAPSHOT也一直没有再生成。 MOS中有个相似的CASE。 Bug 22494766 : ORA-7445[KOCCNGT+74] ON MMON PROCESS AFTER ORA-25455
解决方法
1,重启实例
或者
2. 临时启动并禁用限制模式
ALTER SYSTEM enable restricted session;
ALTER SYSTEM disable restricted session;
因为这是一个新交维的边缘库,平时没有什么新连接和活动,但是重启实例目前是不可以的。 启用限制模式只影响新连接请求,已有连接无影响,我们采用方案2.
— db alert log
2018-08-24 09:01:23.236000 +08:00 Thread 2 advanced to log sequence 6820 (LGWR switch) Current log# 4 seq# 6820 mem# 0: +DATADG/anbob/redo04.log 2018-08-24 10:25:59.102000 +08:00 Stopping background process MMNL 2018-08-24 10:26:00.101000 +08:00 Starting background process MMON MMON started with pid=35, OS id=76908 Starting background process MMNL MMNL started with pid=120, OS id=76910 ALTER SYSTEM enable restricted session; ALTER SYSTEM disable restricted session; 2018-08-24 10:26:01.811000 +08:00 minact-scn: Inst 2 is a slave inc#:40 mmon proc-id:76908 status:0x2 minact-scn status: grec-scn:0x0ccf.d5bb5cdf gmin-scn:0x0ccf.d5b7d696 gcalc-scn:0x0ccf.d5bb3a01 Some DDE async actions failed or were cancelled oracle@anbob-2:/home/oracle>ps -ef|grep ora_m|grep -v grep oracle 9144 1 0 2016 ? 02:07:06 ora_mman_anbob2 oracle 9184 1 0 2016 ? 01:55:18 ora_mark_anbob2 oracle 76908 1 0 10:25 ? 00:01:35 ora_mmon_anbob2 oracle 76910 1 1 10:25 ? 00:05:52 ora_mmnl_anbob2 SQL> select * from 2 (select snap_id, 3 instance_number, 4 begin_interval_time, 5 flush_elapsed, 6 status, 7 error_count, 8 snap_flag 9 from wrm$_snapshot 10 where dbid = (select dbid from v$database) 11 and instance_number=2 --# instance 2 12 order by snap_id desc) 13 where rownum <= 10 14 order by snap_id ; Snap Id INST BEGIN_INTERVAL_TIME FLUSH_ELAPSED STATUS ERR SNAP --------- ---- ------------------------- ----------------- ------ ---- ---- 22765 2 16-SEP-17 08.56.39.415 AM +00000 00:00:12.2 0 0 1 22766 2 24-AUG-18 09.51.14.766 AM +00000 00:00:03.3 0 0 1 SQL> / Snap Id INST BEGIN_INTERVAL_TIME FLUSH_ELAPSED STATUS ERR SNAP --------- ---- ------------------------- ----------------- ------ ---- ---- 22765 2 16-SEP-17 08.56.39.415 AM +00000 00:00:12.2 0 0 1 22766 2 24-AUG-18 09.51.14.766 AM +00000 00:00:03.3 0 0 1 22767 2 24-AUG-18 09.52.24.661 AM +00000 00:00:05.1 0 0 0 22768 2 24-AUG-18 11.00.09.946 AM +00000 00:00:04.0 0 0 0 22769 2 24-AUG-18 12.00.22.008 PM +00000 00:00:03.2 0 0 0 22770 2 24-AUG-18 01.00.33.819 PM +00000 00:00:05.1 0 0 0 22771 2 24-AUG-18 02.00.45.857 PM +00000 00:00:04.0 0 0 0 22772 2 24-AUG-18 03.00.57.960 PM +00000 00:00:04.0 0 0 0 22773 2 24-AUG-18 04.00.09.952 PM +00000 00:00:03.9 0 0 0 22774 2 24-AUG-18 05.00.22.065 PM +00000 00:00:04.1 0 0 0
后期确认数据库实例已经可以正常收集AWR,MMON记启动,同时MMNL进程已重启,(KILL两进程对实例可用性无影响,该进程可以自动重启)。MMNL TRACE没有再生成那么大量的日志。
— enjoy it —
对不起,这篇文章暂时关闭评论。