诊断一起存储链路引起的数据库性能问题案例 ORA-32701 and krsv_proc_kill: Killing 1 processes (Process by index) in alert
前几日一套库出现在性能问题,虽然最终问题不在数据库, 但是记录一下希望遇到同样问题时可以节约你的时间 ,这是11.2.0.3 RAC 2nodes on hpux (EMC存储), 问题是从16:40左右起中间件偶尔有瞬间的业务积压, 积压时数据库大多数会话都是简单的insert同一表数据 ,后来大概停止了该批处理,数据库后来没有再出明显积压,因为当时反馈给我是周末晚上而且后来没有再起该批业务,只是中间件日志有提示会话被kill的错误,想了解一下会话被kill原因,同事传了我一段db alert日志如下:
# DB ALERT
Sun Jul 19 16:32:11 2015 Errors in file /oracle/app/oracle/diag/rdbms/weejar/weejar2/trace/weejar2_dia0_2471.trc (incident=512083): ORA-32701: Possible hangs up to hang ID=0 detected Incident details in: /oracle/app/oracle/diag/rdbms/weejar/weejar2/incident/incdir_512083/weejar2_dia0_2471_i512083.trc Sun Jul 19 16:32:13 2015 Sweep [inc][512083]: completed Sun Jul 19 16:32:13 2015 Sweep [inc2][512083]: completed System State dumped to trace file /oracle/app/oracle/diag/rdbms/weejar/weejar2/incident/incdir_512083/weejar2_m000_29275_i512083_a.trc DIA0 terminating blocker (ospid: 4073 sid: 7829 ser#: 4119) of hang with ID = 223 requested by master DIA0 process on instance 1 Hang Resolution Reason: Automatic hang resolution was performed to free a significant number of affected sessions. by terminating the process DIA0 successfully terminated process ospid:4073. Errors in file /oracle/app/oracle/diag/rdbms/weejar/weejar2/trace/weejar2_dia0_2471.trc (incident=512084): ORA-32701: Possible hangs up to hang ID=0 detected Incident details in: /oracle/app/oracle/diag/rdbms/weejar/weejar2/incident/incdir_512084/weejar2_dia0_2471_i512084.trc Sun Jul 19 16:32:34 2015 Sweep [inc][512084]: completed Sun Jul 19 16:32:35 2015 DIA0 terminating blocker (ospid: 16104 sid: 2287 ser#: 58237) of hang with ID = 225 requested by master DIA0 process on instance 1 Hang Resolution Reason: Automatic hang resolution was performed to free a significant number of affected sessions. by terminating session sid: 2287 ospid: 16104 DIA0 successfully terminated session sid:2287 ospid:16104 with status 31. Sun Jul 19 16:33:13 2015 Sweep [inc2][512084]: completed Sun Jul 19 16:33:50 2015 WARN: ARC2: Terminating pid 4793 hung on an I/O operation Sun Jul 19 16:34:13 2015 krsv_proc_kill: Killing 1 processes (Process by index) ARC2: Error 16198 due to hung I/O operation to LOG_ARCHIVE_DEST_1 #<<<<<<<<<<<<<< ARC2: Detected ARCH process failure ARC2: STARTING ARCH PROCESSES Sun Jul 19 16:34:15 2015 ARC0 started with pid=34, OS id=3953 Sun Jul 19 16:34:15 2015 Errors in file /oracle/app/oracle/diag/rdbms/weejar/weejar2/trace/weejar2_dia0_2471.trc (incident=512085): ORA-32701: Possible hangs up to hang ID=0 detected Incident details in: /oracle/app/oracle/diag/rdbms/weejar/weejar2/incident/incdir_512085/weejar2_dia0_2471_i512085.trc ARC0: Archival started ARC2: STARTING ARCH PROCESSES COMPLETE ARCH: Archival stopped, error occurred. Will continue retrying ORACLE Instance weejar2 - Archival Error ORA-16014: log 11 sequence# 12803 not archived, no available destinations ORA-00312: online log 11 thread 2: '/dev/anbob_oravg02/ranbob_redo11' Sun Jul 19 16:34:17 2015 Sweep [inc][512085]: completed ... ... here had truncated ... DISTRIB TRAN 00000028.716D7979313000000000000000000000000000000000000000000000000000000006D2A955A7E4B002ECC089 is local tran 414.10.43815 (hex=19e.0a.ab27)) delete pending collecting tran, scn=14464580264119 (hex=d27.cc2b20b7) DISTRIB TRAN 00000028.716D7979313000000000000000000000000000000000000000000000000000000006D2A955A7E4B002ECC08E is local tran 125.19.296884 (hex=7d.13.487b4)) delete pending collecting tran, scn=14464580264148 (hex=d27.cc2b20d4) Sun Jul 19 17:32:17 2015 Thread 2 advanced to log sequence 12808 (LGWR switch) Current log# 10 seq# 12808 mem# 0: /dev/anbob_oravg02/ranbob_redo10 Sun Jul 19 17:32:25 2015 Archived Log entry 22618 added for thread 2 sequence 12807 ID 0x17418ab3 dest 1: Sun Jul 19 17:37:00 2015 Errors in file /oracle/app/oracle/diag/rdbms/weejar/weejar2/trace/weejar2_dia0_16512.trc (incident=512088): ORA-32701: Possible hangs up to hang ID=0 detected Incident details in: /oracle/app/oracle/diag/rdbms/weejar/weejar2/incident/incdir_512088/weejar2_dia0_16512_i512088.trc Sun Jul 19 17:37:02 2015 Sweep [inc][512088]: completed Sun Jul 19 17:37:02 2015 Sweep [inc2][512088]: completed System State dumped to trace file /oracle/app/oracle/diag/rdbms/weejar/weejar2/incident/incdir_512088/weejar2_m000_10298_i512088_a.trc DIA0 terminating blocker (ospid: 29778 sid: 9592 ser#: 20163) of hang with ID = 374 requested by master DIA0 process on instance 1 Hang Resolution Reason: Automatic hang resolution was performed to free a significant number of affected sessions. by terminating session sid: 9592 ospid: 29778 DIA0 successfully terminated session sid:9592 ospid:29778 with status 31. Sun Jul 19 17:48:15 2015 Errors in file /oracle/app/oracle/diag/rdbms/weejar/weejar2/trace/weejar2_ora_7290.trc (incident=516793): ORA-00600: internal error code, arguments: [qerltcUserIterGet_1], [56], [56], [], [], [], [], [], [], [], [], [] ORA-24761: transaction rolled back Incident details in: /oracle/app/oracle/diag/rdbms/weejar/weejar2/incident/incdir_516793/weejar2_ora_7290_i516793.trc Sun Jul 19 17:48:15 2015 Errors in file /oracle/app/oracle/diag/rdbms/weejar/weejar2/trace/weejar2_ora_5983.trc (incident=525073): ORA-00600: internal error code, arguments: [qerltcUserIterGet_1], [56], [56], [], [], [], [], [], [], [], [], [] ORA-24761: transaction rolled back
NOTE:
注意alert中从那个时间点的确出现了ORA-32701 hang的进程, 而且从下面一段要以判断当时应该是有I/O 高负载或者异常,因为日志中只出现一次,没有再深究
“WARN: ARC2: Terminating pid 4793 hung on an I/O operation
Sun Jul 19 16:34:13 2015
krsv_proc_kill: Killing 1 processes (Process by index)
ARC2: Error 16198 due to hung I/O operation to LOG_ARCHIVE_DEST_1
ARC2: Detected ARCH process failure”
不过看到了DIA0进程的解有终止进程,dia0进程是HM(Hang manger)的后台进程,是11G 新特性,负责定期在RAC 环境中收集进程hang的信息,或可以自行解决一些hang进程,受隐藏参数的影响, 了解HM 可以自己去官网查阅。而且后来的分布 式事务回滚, 及ora-600 [qerltcUserIterGet_1] ORA-24761 也是受会话hang ,HM kill后受到的影响。
# hang trace file
Dump file /oracle/app/oracle/diag/rdbms/weejar/weejar2/incident/incdir_512083/weejar2_dia0_2471_i512083.trc .. *** 2015-07-19 16:32:11.494 Resolvable Hangs in the System Root Chain Total Hang Hang Hang Inst Root #hung #hung Hang Hang Resolution ID Type Status Num Sess Sess Sess Conf Span Action ----- ---- -------- ---- ----- ----- ----- ------ ------ ------------------- 223 HANG RSLNPEND 2 7829 2 19 HIGH LOCAL Terminate Process Hang Resolution Reason: Automatic hang resolution was performed to free a significant number of affected sessions. inst# SessId Ser# OSPID PrcNm Event ----- ------ ----- --------- ----- ----- 2 300 2745 5568 FG read by other session 2 7829 4119 4073 FG db file sequential read Dumping process info of pid[2767.4073] (sid:7829, ser#:4119) requested by master DIA0 process on instance 1. *** 2015-07-19 16:32:11.494 Process diagnostic dump for oracle@qdanbob2, OS id=4073, pid: 2767, proc_ser: 21, sid: 7829, sess_ser: 4119 ------------------------------------------------------------------------------- os thread scheduling delay history: (sampling every 1.000000 secs) 0.000000 secs at [ 16:32:11 ] NOTE: scheduling delay has not been sampled for 0.224630 secs 0.000657 secs from [ 16:32:07 - 16:32:12 ], 5 sec avg 0.000455 secs from [ 16:31:12 - 16:32:12 ], 1 min avg 0.000284 secs from [ 16:27:11 - 16:32:12 ], 5 min avg *** 2015-07-19 16:32:11.907 loadavg : 0.17 0.16 0.17 Swapinfo : Avail = 504818.16Mb Used = 170655.53Mb Swap free = 334112.25Mb Kernel rsvd = 17269.27Mb Free Mem = 241739.08Mb F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME COMD 3401 S grid 4073 1 0 178 20 e000000feb8d5680 97832 e00000103c925097 Jul 10 ? 13:13 oracleweejar2 (LOCAL=NO) Short stack dump: ksedsts()+544<-ksdxfstk()+48<-ksdxcb()+3216<-sspuser()+688<-<-_read_sys()+48<-_read()+224<-$cold_skgfqio()+864<-ksfd_skgfqio()+400<-ksfd_io()+1168<-ksfdread()+336<-kcfrbd1()+1328<-kcbzib()+304 0<-kcbgcur()+10400<-ktbgcur()+192<-ktspfpblk()+720<-ktspfsrch()+944<-ktspscan_bmb()+608<-ktspgsp_main()+1520<-kdtgsp()+1248<-kdtgsph()+1440<-kdtgrs()+560<-kdtInsRow()+1584 show parameter LOG_ARCHIVE_DEST_1 NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ log_archive_dest_1 string location=/anbob2_arch oracle@qdanbob2:/home/oracle> mount ... /anbob2_arch on /dev/anbob_arch/fslvanbob_arch2 ioerror=mwdisable,largefiles,delaylog,dev=40120002 on Fri Jun 19 00:48:11 2015 /anbob1_arch on qdanbob1:/anbob1_arch noac,forcedirectio,rsize=32768,wsize=32768,NFSv3,dev=4000003 on Fri Jun 19 00:58:59 2015 SQL>@p hang NAME VALUE ---------------------------------------- ---------------------------------------- _ksv_pool_hang_kill_to 0 _hang_analysis_num_call_stacks 3 _local_hang_analysis_interval_secs 3 _hang_msg_checksum_enabled TRUE _hang_detection_enabled TRUE #<<<<<<<<<<< HM enable _hang_detection_interval 32 _hang_resolution_scope PROCESS #<<<<<<<<<<<< 表示HM 会自己终止HANG PROCESS(非后台process) _hang_resolution_policy HIGH _hang_resolution_confidence_promotion FALSE _hang_resolution_global_hang_confidence_ FALSE ... 37 rows selected.
Note:
该RAC 是启用了HM, 而且HM 可以自己kill一些非核心的后台进程, 从hang trace 发现了存在root blocker I/O 的等待导致了进程的hang, 执行的SQL是用户的insert 进程,这样就找到了会话被kill的凶手,另外注意上面db alert中归档出错的路径是本地的存储路径(非NFS)
第二天中午一来9点时,发现业务又开始了积压, 查询到还是昨天的insert, 查询了一下该SQL近几天的业务量
Per-Plan Execution Statistics Over Time Avg Avg Plan Snapshot Avg LIO Avg PIO CPU (secs) Elapsed (secs) Hash Value Time Execs Per Exec Per Exec Per Exec Per Exec ---------- ------------ -------- ------------------- ------------------- ------------------- ------------------- 18-JUL 07:00 81,954 27.60 0.68 0.00 0.01 18-JUL 08:00 164,767 27.44 0.52 0.00 0.00 18-JUL 09:00 148,988 27.89 0.57 0.00 0.00 18-JUL 10:00 152,462 27.89 0.55 0.00 0.00 18-JUL 11:00 133,345 27.51 0.57 0.00 0.00 18-JUL 12:00 144,135 27.71 0.57 0.00 0.00 18-JUL 13:00 126,773 27.66 0.64 0.00 0.00 18-JUL 14:00 134,799 27.53 0.53 0.00 0.00 18-JUL 15:00 125,364 27.89 0.57 0.00 0.00 18-JUL 16:00 129,116 27.79 0.56 0.00 0.00 18-JUL 17:00 131,873 27.86 0.58 0.00 0.00 18-JUL 18:00 125,614 27.64 0.55 0.00 0.00 18-JUL 19:00 126,923 27.77 0.56 0.00 0.00 18-JUL 20:00 123,668 27.86 0.60 0.00 0.00 18-JUL 21:00 112,380 27.58 0.52 0.00 0.00 18-JUL 22:00 92,602 27.59 0.43 0.00 0.00 18-JUL 23:00 53,483 27.53 0.43 0.00 0.00 19-JUL 06:00 39,651 27.83 0.69 0.00 0.01 19-JUL 07:00 75,884 27.77 0.69 0.00 0.01 19-JUL 08:00 172,000 27.51 0.54 0.00 0.00 19-JUL 09:00 148,931 27.67 0.60 0.00 0.00 19-JUL 10:00 153,295 27.64 0.54 0.00 0.00 19-JUL 11:00 125,487 27.73 0.58 0.00 0.00 19-JUL 12:00 136,311 27.84 0.59 0.00 0.00 19-JUL 13:00 114,741 27.52 0.63 0.00 0.00 19-JUL 14:00 104,638 27.64 0.55 0.00 0.00 19-JUL 15:00 134,834 27.65 0.57 0.00 0.00 19-JUL 16:00 124,629 27.72 0.57 0.00 0.39 19-JUL 17:01 134,321 28.35 0.62 0.00 0.39 19-JUL 18:00 125,256 27.75 0.56 0.00 0.14 19-JUL 19:00 126,852 28.61 0.61 0.00 0.32 19-JUL 20:00 133,179 28.30 0.62 0.00 0.13 19-JUL 21:00 109,673 28.19 0.59 0.00 0.21 19-JUL 22:00 84,624 27.73 0.49 0.00 0.07 19-JUL 23:00 47,534 27.51 0.48 0.00 0.05 20-JUL 00:00 20,786 28.88 0.60 0.00 1.87 20-JUL 01:00 20,900 27.69 0.72 0.00 0.50 20-JUL 02:00 8,399 28.67 0.77 0.00 1.34 20-JUL 03:00 7,174 28.26 0.75 0.00 0.35 20-JUL 04:00 7,798 38.87 0.72 0.00 6.86 20-JUL 05:00 16,600 30.29 0.74 0.00 1.59 20-JUL 06:00 41,540 28.78 0.72 0.00 1.46 20-JUL 07:00 77,297 28.04 0.72 0.00 0.72 20-JUL 08:00 495,309 30.34 0.53 0.00 0.85 #<<<<<<<<<<<<<<<<<< ********** -------- ------------------- ------------------- ------------------- ------------------- avg 26.42 0.50 0.00 0.06 sum ########
Note:
从AWR DBA_HIST_SQL_STATS中可以确认当前8-9点之间exec执行数是比一前多了30万次,而且单次执行的响应时间是从昨天的16:00的快照看变长不到1秒,通知了存储工程师去确认存储是否正常,找业务确认后是他们又在批量执行, 当时判断应该还是量变引起的质变,又让他们先停止了该业务,其实后来发现只是该业务掩盖真实的原因,也正是因为这个大量insert和commit,才让问题表现出来。
但是在9:00 多后停止了批量后,10:00 又出同了短时的业务积压。
SQL> create table ash0720 tablespace users as select * from v$active_session_history ; Table created. SQL> select * from ( 2 select etime,nvl(event,'on cpu') events, dbtime, round(100*ratio_to_report(dbtime) OVER (partition by etime ),2) pct,row_number() over(partition by etime order by dbtime desc) rn 3 from ( 4 select substr(to_char(SAMPLE_TIME,'yyyymmdd hh24:mi'),1,13)||'0' etime,event,count(*) dbtime 5 from ash0720 6 --where sample_time between to_date('2015-3-18 10:00','yyyy-mm-dd hh24:mi') and to_date('2015-3-18 11:00','yyyy-mm-dd hh24:mi') 7 group by substr(to_char(SAMPLE_TIME,'yyyymmdd hh24:mi'),1,13),event 8 ) 9 ) where rn<=10; ETIME EVENTS DBTIME PCT RN -------------- ---------------------------------------------- ---------- ---------- ---------- 20150720 10:20 log file sync 19782 33 1 20150720 10:20 gc buffer busy release 12219 20.39 2 20150720 10:20 buffer busy waits 9621 16.05 3 20150720 10:20 enq: TX - index contention 7034 11.74 4 20150720 10:20 enq: SQ - contention 6303 10.52 5 20150720 10:20 read by other session 1153 1.92 6 20150720 10:20 db file sequential read 1061 1.77 7 20150720 10:20 inactive transaction branch 971 1.62 8 20150720 10:20 enq: TX - row lock contention 832 1.39 9 20150720 10:20 on cpu 219 .37 10 20150720 10:30 enq: TX - index contention 93536 43.91 1 #<<<<<<<<<<<<<<< 20150720 10:30 db file sequential read 47243 22.18 2 20150720 10:30 read by other session 21236 9.97 3 20150720 10:30 log file sync 12610 5.92 4 20150720 10:30 gc buffer busy release 7793 3.66 5 20150720 10:30 buffer busy waits 6061 2.85 6 20150720 10:30 enq: SQ - contention 5360 2.52 7 20150720 10:30 on cpu 5332 2.5 8 20150720 10:30 enq: TX - row lock contention 4271 2 9 20150720 10:30 inactive transaction branch 2698 1.27 10 20150720 10:40 read by other session 26429 40.58 1 20150720 10:40 db file sequential read 24469 37.57 2 20150720 10:40 on cpu 5291 8.12 3 20150720 10:40 db file parallel write 2136 3.28 4 20150720 10:40 SQL*Net message from dblink 1162 1.78 5 20150720 10:40 inactive transaction branch 1097 1.68 6 20150720 10:40 SQL*Net more data from dblink 1054 1.62 7 20150720 10:40 inactive session 698 1.07 8 20150720 10:40 enq: TX - row lock contention 540 .83 9 20150720 10:40 db file scattered read 472 .72 10 20150720 10:50 db file sequential read 24371 50.1 1 20150720 10:50 read by other session 11212 23.05 2 20150720 10:50 on cpu 5499 11.3 3 20150720 10:50 db file parallel write 1817 3.74 4 20150720 10:50 enq: TX - index contention 1596 3.28 5 20150720 10:50 SQL*Net more data from dblink 1191 2.45 6 20150720 10:50 SQL*Net message from dblink 976 2.01 7 20150720 10:50 inactive transaction branch 276 .57 8 20150720 10:50 enq: TX - row lock contention 246 .51 9 20150720 10:50 inactive session 196 .4 10 ...
Note:
20150720 10:30 到10:40出现了负载压力大情况
SQL> select * from ( 2 SELECT 3 h.event "Wait Event", 4 SUM(h.wait_time + h.time_waited)/1000000 "Total Wait Time" 5 FROM 6 ash0720 h, 7 v$event_name e 8 WHERE 9 sample_time between to_date('2015-07-20 10:30','yyyy-mm-dd hh24:mi') and to_date('2015-07-20 10:40','yyyy-mm-dd hh24:mi') 10 AND h.event_id = e.event_id 11 AND e.wait_class <>'Idle' 12 GROUP BY h.event 13 ORDER BY 2 DESC) 14 where rownum <10; Wait Event Total Wait Time ---------------------------------------------------------------- --------------- enq: TX - index contention 78846.1543 log file sync 56877.8177 db file sequential read 43827.3883 read by other session 23515.1048 enq: SQ - contention 10085.8243 gc buffer busy release 9988.80736 buffer busy waits 6469.06895 enq: TX - row lock contention 4129.69438 inactive transaction branch 2714.48481 9 rows selected.
Note:
从top event看应该还是与insert相关,比如enq:sq ,enq: tx, log file sync,BBS 还有两个I/O 明显的事件,gc buffer busy release通常是因log file write 慢或网络问题相关,当然这些还是我们的经验主观判断,要用数据去分析问题。
接下来就以’enq: TX – index contention’查找的wait chain
SQL> select sql_id,count(*) 2 from ash0720 3 where sample_time between to_date('2015-07-20 10:30','yyyy-mm-dd hh24:mi') and to_date('2015-07-20 10:40','yyyy-mm-dd hh24:mi') 4 and event='enq: TX - index contention' group by sql_id order by 2 desc; SQL_ID COUNT(*) -------------------- ---------- b1kh15rh5d2h9 33113 d71tkyh0b2q1y 21026 8nhv049t5zzh6 12854 2p8uq2t4b5v6r 8554 29rwby2p0367t 2786 ... SQL> select sql_id,blocking_session,count(*) 2 from ash0720 3 where sample_time between to_date('2015-07-20 10:30','yyyy-mm-dd hh24:mi') and to_date('2015-07-20 10:40','yyyy-mm-dd hh24:mi') 4 and event='enq: TX - index contention' 5 and sql_id in ('b1kh15rh5d2h9','d71tkyh0b2q1y','8nhv049t5zzh6') 6 group by sql_id,blocking_session 7 order by 1,3 desc; SQL_ID BLOCKING_SESSION COUNT(*) -------------------- ---------------- ---------- 8nhv049t5zzh6 7881 12736 8nhv049t5zzh6 10634 62 8nhv049t5zzh6 53 8nhv049t5zzh6 3026 3 b1kh15rh5d2h9 532 18733 b1kh15rh5d2h9 8317 8996 b1kh15rh5d2h9 8801 5148 b1kh15rh5d2h9 9340 224 b1kh15rh5d2h9 7 b1kh15rh5d2h9 8621 3 b1kh15rh5d2h9 6628 1 b1kh15rh5d2h9 6817 1 d71tkyh0b2q1y 6538 15560 d71tkyh0b2q1y 11553 5372 d71tkyh0b2q1y 1019 85 d71tkyh0b2q1y 9 16 rows selected. select session_id,blocking_session,count(*) cnt from ash0720 where sample_time between to_date('2015-07-20 10:30','yyyy-mm-dd hh24:mi') and to_date('2015-07-20 10:40','yyyy-mm-dd hh24:mi') and event='enq: TX - index contention' and sql_id in ('b1kh15rh5d2h9','d71tkyh0b2q1y','8nhv049t5zzh6') and session_id in (7881,532,8317,6538,11553) group by session_id,blocking_session SESSION_ID BLOCKING_SESSION CNT ---------- ---------------- ---------- 532 8801 32 8317 532 5 11553 6538 53 select session_id,blocking_session,count(*) cnt from ash0720 where sample_time between to_date('2015-07-20 10:30','yyyy-mm-dd hh24:mi') and to_date('2015-07-20 10:40','yyyy-mm-dd hh24:mi') and session_id in (8801,6538) group by session_id,blocking_session SESSION_ID BLOCKING_SESSION CNT ---------- ---------------- ---------- 6538 233 6538 6276 32 8801 6276 32 select session_id,blocking_session,count(*) cnt from ash0720 where sample_time between to_date('2015-07-20 10:30','yyyy-mm-dd hh24:mi') and to_date('2015-07-20 10:40','yyyy-mm-dd hh24:mi') and session_id in (6276,6538) group by session_id,blocking_session SESSION_ID BLOCKING_SESSION CNT ---------- ---------------- ---------- 6276 234 6538 233 6538 6276 32
最终确认了阻塞会话根源是6276 会话,下面查看问题当时6276 是什么进程?
SQL> l 1 select machine,program,event,wait_time ,time_waited,sql_id,SESSION_STATE,CURRENT_OBJ# 2 from ash0720 3 where sample_time between to_date('2015-07-20 10:30','yyyy-mm-dd hh24:mi') and to_date('2015-07-20 10:40','yyyy-mm-dd hh24:mi') 4* and session_id in (6276) SQL> / MACHINE PROGRAM EVENT WAIT_TIME TIME_WAITED SQL_ID SESSION CURRENT_OBJ# ---------- -------------------------------- ------------------------- ---------- ----------- ------------- ------- ------------ qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 1487 WAITING -1 qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 1367 WAITING -1 qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 4800 WAITING -1 qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 2006 WAITING -1 qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 1198 WAITING -1 qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 2812 WAITING -1 qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 3939 WAITING -1 qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 5072 WAITING -1 qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 6973 WAITING -1 qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 6783 WAITING -1 qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 1225 WAITING -1 qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 3826 WAITING -1 qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 3154 WAITING -1 qdanbob2 oracle@qdanbob2 (LGWR) 5070 0 ON CPU -1 qdanbob2 oracle@qdanbob2 (LGWR) 21949 0 ON CPU -1 qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 9892 WAITING -1 qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 28703 WAITING -1 qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 436821 WAITING -1 qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 3153 WAITING -1 qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 1487 WAITING -1 qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 205371 WAITING -1 qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 5429 WAITING -1 qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 5180 WAITING -1 qdanbob2 oracle@qdanbob2 (LGWR) log file parallel write 0 134327524 WAITING -1 ... 234 rows selected.
Note:
因为篇幅太大已截取, 是因为LGWR 后台进程等待事件log file parallel write,而且wait time 很大,’log file parallel write’发生在事务commit后,等待磁盘write成功确认时。 想到昨晚曾经出现过一个归档日志写入的IO错误,怀疑是I/O 慢的情况存在。 上午通知了通知存储部门检查存储情况,因为上午没有root 密码,中午用root登录后发现OS 日志中确认有存储层的错误如下
#OS log /var/adm/message
Jul 19 01:05:01 qdanbob2 vmunix: Jul 19 01:05:03 qdanbob2 above message repeats 6 times Jul 19 01:05:03 qdanbob2 CIM Indication[3683]: Indication (default format):IndicationIdentifier = 2064320150719010503, ProviderName = HPUXESCSIIndicationProvider, PerceivedSeverity = 6, EventID = 206 Jul 19 01:05:04 qdanbob2 CIM Indication[3683]: Indication (default format):IndicationIdentifier = 2064420150719010503, ProviderName = HPUXESCSIIndicationProvider, PerceivedSeverity = 6, EventID = 206 Jul 19 01:05:04 qdanbob2 CIM Indication[3683]: Indication (default format):IndicationIdentifier = 2064520150719010504, ProviderName = HPUXESCSIIndicationProvider, PerceivedSeverity = 6, EventID = 206 Jul 19 01:05:04 qdanbob2 CIM Indication[3683]: Indication (default format):IndicationIdentifier = 2064620150719010504, ProviderName = HPUXESCSIIndicationProvider, PerceivedSeverity = 6, EventID = 206 Jul 19 01:05:04 qdanbob2 CIM Indication[3683]: Indication (default format):IndicationIdentifier = 2064720150719010504, ProviderName = HPUXESCSIIndicationProvider, PerceivedSeverity = 6, EventID = 206 Jul 19 05:28:35 qdanbob2 sshd[29499]: error: PAM: No account present for user for illegal user aqzhdg from 133.96.102.206 Jul 19 05:28:37 qdanbob2 sshd[29499]: error: PAM: No account present for user for illegal user aqzhdg from 133.96.102.206 Jul 19 16:25:15 qdanbob2 vmunix: DIAGNOSTIC SYSTEM WARNING: Jul 19 16:25:15 qdanbob2 vmunix: The diagnostic logging facility has started receiving excessive Jul 19 16:25:15 qdanbob2 vmunix: errors. The error entries will be lost until the cause of Jul 19 16:25:15 qdanbob2 vmunix: the excessive error logging is corrected. Jul 19 16:25:15 qdanbob2 vmunix: Use oserrlogd(1M) man page for further details. Jul 19 16:25:20 qdanbob2 vmunix: DIAGNOSTIC SYSTEM WARNING: Jul 19 16:25:20 qdanbob2 vmunix: The diagnostic logging facility is no longer receiving excessive Jul 19 16:25:20 qdanbob2 vmunix: errors . 9 error entries were lost. Jul 19 16:25:23 qdanbob2 vmunix: The diagnostic logging facility has started receiving excessive Jul 19 16:25:23 qdanbob2 vmunix: errors. The error entries will be lost until the cause of Jul 19 16:25:23 qdanbob2 vmunix: the excessive error logging is corrected. Jul 19 16:25:23 qdanbob2 vmunix: DIAGNOSTIC SYSTEM WARNING: Jul 19 16:25:23 qdanbob2 vmunix: Use oserrlogd(1M) man page for further details. Jul 19 16:25:48 qdanbob2 vmunix: emcp:Mpx:Error: Jul 19 16:25:48 qdanbob2 vmunix: emcp:Mpx:Error: Path lunpath413 to 000290102087 is dead.Killing bus 6 to Symmetrix 000290102087 port 8aA.Path lunpath320 to 000290102087 is dead.Path lunpath327 to 0 00290102087 is dead.Path lunpath325 to 000290102087 is dead.Path lunpath326 to 000290102087 is dead.Pat Jul 19 16:25:48 qdanbob2 vmunix: emcp:Mpx:Error: h lunpath338 to 000290102087 is dead.Path lunpath337 to 000290102087 is dead.Path lunpath330 to 000290102087 is dead.Path lunpath345 to 000290102087 is d ead.Path lunpath332 to 000290102087 is dead.Path lunpath351 to 000290102087 is dead.Path lunpath339 to Jul 19 16:25:48 qdanbob2 vmunix: emcp:Mpx:Error: 000290102087 is dead.Path lunpath329 to 000290102087 is dead.Path lunpath323 to 000290102087 is dead.Path lunpath356 to 000290102087 is dead.Path lunpath 328 to 000290102087 is dead.Path lunpath331 to 000290102087 is dead.Path lunpath321 to 000290102087 is ... 407 to 000290102087 is dead.Path lunpath402 to 000290102087 is dead.Path lunpath311 to 000290102087 is Jul 19 16:25:48 qdanbob2 vmunix: emcp:Mpx:Error: dead.Path lunpath410 to 000290102087 is dead.Path lunpath408 to 000290102087 is dead.Path lunpath400 to 000290102087 is dead.Path lunpath409 to 000290102 087 is dead.Path lunpath310 to 000290102087 is dead.Path lunpath419 to 000290102087 is dead.Path lunpat ... Jul 20 14:24:50 qdanbob2 vmunix: class : disk, instance 2004 Jul 20 14:24:50 qdanbob2 vmunix: All available lunpaths of a LUN (dev=0xd000120) Jul 20 14:24:50 qdanbob2 vmunix: have gone offline. The LUN has entered a transient Jul 20 14:24:50 qdanbob2 vmunix: condition. The transient time threshold is 120 seconds. Jul 20 14:24:50 qdanbob2 vmunix: 1 lunpaths are currently in a failed state. Jul 20 14:24:50 qdanbob2 vmunix: Jul 20 14:24:50 qdanbob2 vmunix: class : disk, instance 2005 Jul 20 14:24:50 qdanbob2 vmunix: All available lunpaths of a LUN (dev=0xcb250200) Jul 20 14:24:50 qdanbob2 vmunix: have gone offline. The LUN has entered a transient Jul 20 14:24:50 qdanbob2 vmunix: condition. The transient time threshold is 120 seconds. Jul 20 14:24:50 qdanbob2 vmunix: 1 lunpaths are currently in a failed state. Jul 20 14:24:50 qdanbob2 vmunix: Jul 20 14:24:50 qdanbob2 vmunix: DIAGNOSTIC SYSTEM WARNING: Jul 20 14:24:50 qdanbob2 vmunix: The diagnostic logging facility has started receiving excessive Jul 20 14:24:50 qdanbob2 vmunix: errors. The error entries will be lost until the cause of Jul 20 14:24:50 qdanbob2 vmunix: the excessive error logging is corrected. Jul 20 14:24:50 qdanbob2 vmunix: Use oserrlogd(1M) man page for further details. Jul 20 14:24:50 qdanbob2 vmunix: class : disk, instance 2006 Jul 20 14:24:50 qdanbob2 vmunix: All available lunpaths of a LUN (dev=0xcb250300) Jul 20 14:24:50 qdanbob2 vmunix: have gone offline. The LUN has entered a transient Jul 20 14:24:50 qdanbob2 vmunix: condition. The transient time threshold is 120 seconds. Jul 20 14:24:50 qdanbob2 vmunix: 1 lunpaths are currently in a failed state. Jul 20 14:24:50 qdanbob2 vmunix: Jul 20 14:24:51 qdanbob2 vmunix: emcp:Mpx:Error: Jul 20 14:25:01 qdanbob2 vmunix: class : tgtpath, instance 6 Jul 20 14:25:01 qdanbob2 vmunix: Target path (class=tgtpath, instance=6) has gone online. The target path h/w path is 41/0/2/2/0/0/0.0x5006048452a6d1c7 Jul 20 14:25:01 qdanbob2 vmunix: Jul 20 14:25:01 qdanbob2 vmunix: DIAGNOSTIC SYSTEM WARNING: Jul 20 14:25:01 qdanbob2 vmunix: The diagnostic logging facility is no longer receiving excessive Jul 20 14:25:01 qdanbob2 vmunix: errors . 2 error entries were lost. Jul 20 14:28:42 qdanbob2 vmunix: emcp:Mpx:Info: Jul 20 14:30:37 qdanbob2 vmunix: emcp:Mpx:Info: h374 to 000290102087 is dead.Path lunpath248 to 000290102087 is dead.Path lunpath369 to 000290102087 is dead.Path lunpath279 to 000290102087 is alive.Path lunpath279 to 000290102087 is dead.Path lunpath409 to 000290102087 is alive.Path lunpath369 to 000290 Jul 20 14:34:06 qdanbob2 vmunix: emcp:Mpx:Info: Jul 20 14:35:11 qdanbob2 vmunix: class : lunpath, instance 374 Jul 20 14:35:11 qdanbob2 vmunix: lun path (class = lunpath, instance = 374) belonging to LUN (default minor = 0xec) has gone offline. The lunpath hwpath is 41/0/2/2/0/0/0.0x5006048452a6d1c7.0x411200000 0000000 Jul 20 14:35:11 qdanbob2 vmunix: Jul 20 14:35:50 qdanbob2 vmunix: class : lunpath, instance 374 Jul 20 14:35:50 qdanbob2 vmunix: lun path (class = lunpath, instance = 374) belonging to LUN (default minor = 0xec) has come online Jul 20 14:35:50 qdanbob2 vmunix: Jul 20 14:36:48 qdanbob2 vmunix: emcp:Mpx:Error: Jul 20 14:36:48 qdanbob2 vmunix: emcp:Mpx:Error: 102087 is alive.Path lunpath243 to 000290102087 is alive.Path lunpath243 to 000290102087 is dead.Killing bus 6 to Symmetrix 000290102087 port 8aA.Pat h lunpath409 to 000290102087 is dead.Path lunpath369 to 000290102087 is dead.Path lunpath288 to 0002901 Jul 20 14:36:48 qdanbob2 vmunix: emcp:Mpx:Error:
确认从昨天16:20后存储就开始出现问题,不稳定或间歇性问题出现。这时存储方成确认了有点存储链路有问题
qdanbob2[/var/adm]#powermt display Symmetrix logical device count=202 CLARiiON logical device count=0 Hitachi logical device count=0 Invista logical device count=0 HP xp logical device count=0 Ess logical device count=0 HP HSx logical device count=0 ============================================================================== ----- Host Bus Adapters --------- ------ I/O Paths ----- ------ Stats ------ ### HW Path Summary Total Dead IO/Sec Q-IOs Errors ============================================================================== 0 41/0/0/2/0/0/0 optimal 194 0 - 2 0 6 41/0/2/2/0/0/0 degraded 16 12 - 0 10 <<<<< 8 44/0/0/2/0/0/0 optimal 194 0 - 3 0 42 44/0/2/2/0/0/0 optimal 194 0 - 1 0
禁掉了#6存储链路,数据库恢复了正常,I/O 延迟也明显下降, 最终定位为存储问题,因为是偶尔出现,AWR 是一小时的粒度,一开始没有看,后来发现AWR里TOP EVENT里I/O avg time也是比较大的,而且看pw+pr iops 也可以做为参考。
qdanbob2[/var/adm]#powerrmt display sh: powerrmt: not found. qdanbob2[/var/adm]# qdanbob2[/var/adm]#powermt display Symmetrix logical device count=202 CLARiiON logical device count=0 Hitachi logical device count=0 Invista logical device count=0 HP xp logical device count=0 Ess logical device count=0 HP HSx logical device count=0 ============================================================================== ----- Host Bus Adapters --------- ------ I/O Paths ----- ------ Stats ------ ### HW Path Summary Total Dead IO/Sec Q-IOs Errors ============================================================================== 0 41/0/0/2/0/0/0 optimal 194 0 - 2 0 6 41/0/2/2/0/0/0 failed 16 16 - 0 0 <<<<<<<< 8 44/0/0/2/0/0/0 optimal 194 0 - 2 0 42 44/0/2/2/0/0/0 optimal 194 0 - 1 0
Summary:
整个问题现在可以梳理清楚,从昨天16:30 起存储链路就出现了问题或者间歇性, 当insert 这种大批量灌数据及每条commit发生时,存储问题就使数据库I/O类wait event变的非常明显,比如log file sync,索引分裂,gc buffer, read by other session.. , 禁掉了4条存储链路中坏的一条后,恢复了正常,后续需要加强对存储的监控。
对不起,这篇文章暂时关闭评论。