首页 » ORACLE 9i-23ai » Troubleshooting Oracle LGWR wait Event ‘reliable message’ and %sys CPU Usage High, instance crash during DSG running

Troubleshooting Oracle LGWR wait Event ‘reliable message’ and %sys CPU Usage High, instance crash during DSG running

最近一客银行客户的Oracle环境,在部署了DSG做数据抽取后,数据库频繁的重启,希望分析一下原因, 环境oracle 12c 2nodes- RAC on RHEL x86-64 7.3 , 数据库实例为Datatguard Pyhical Standby端,使用多租户。开始LGwr等待 ‘reliable message’,后出现IPC Send timeout detected, 过几分钟后实例2驱逐,不久后实例1 crash 。Oracle home和/ 使用XFS 文件系统。 问题期间大量进程活动,从ps查看处于D状态,并且WCHAN等待为xlog_G开头的函数调用,这里记录一下该事件。

DB inst 1 alert log

2024-11-07T23:13:05.770687+08:00
TT08: Standby redo logfile selected for thread 1 sequence 247283 for destination LOG_ARCHIVE_DEST_4
2024-11-07T23:13:06.008202+08:00
Archived Log entry 1370449 added for T-1.S-247282 ID 0x512a7116 LAD:1
2024-11-07T23:13:06.260776+08:00
TT07: Standby redo logfile selected for thread 1 sequence 247283 for destination LOG_ARCHIVE_DEST_3
2024-11-07T23:22:30.367476+08:00
Errors in file /u01/app/oracle/diag/rdbms/anbob_com/anbob1/trace/anbob1_ora_29964.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-12805: parallel query server died unexpectedly
2024-11-07T23:22:30.367607+08:00
Errors in file /u01/app/oracle/diag/rdbms/anbob_com/anbob1/trace/anbob1_ora_96147.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-12805: parallel query server died unexpectedly
2024-11-07T23:22:32.615509+08:00
LGWR (ospid: 64433) waits for event 'enq: IV -  contention' for 0 secs.
2024-11-07T23:24:30.771542+08:00
LGWR (ospid: 64433) waits for event 'reliable message' for 59 secs.
2024-11-07T23:25:26.189320+08:00
LGWR (ospid: 64433) waits for event 'reliable message' for 114 secs.
2024-11-07T23:26:23.282447+08:00
LGWR (ospid: 64433) waits for event 'reliable message' for 171 secs.
2024-11-07T23:26:27.174052+08:00
IPC Send timeout detected. Sender: ospid 90234 [oracle@CLIENT1]
Receiver: inst 2 binc 10 ospid 1148
2024-11-07T23:27:34.577764+08:00
LGWR (ospid: 64433) waits for event 'reliable message' for 242 secs.   
2024-11-07T23:27:43.449045+08:00
ESCMONODB(14):IPC Send timeout detected. Sender: ospid 31576 [oracle@CLIENT1]
ESCMONODB(14):Receiver: inst 2 binc 10 ospid 79077
2024-11-07T23:28:27.518025+08:00
IPC Send timeout detected. Sender: ospid 96147 [oracle@CLIENT1]
Receiver: inst 2 binc 10 ospid 78901
2024-11-07T23:28:29.374610+08:00
RUOYIDB(4):IPC Send timeout detected. Sender: ospid 31740 [oracle@CLIENT1]
RUOYIDB(4):Receiver: inst 2 binc 10 ospid 79012
2024-11-07T23:28:30.950458+08:00
CNAPSDB(11):IPC Send timeout detected. Sender: ospid 31465 [oracle@CLIENT1]
CNAPSDB(11):Receiver: inst 2 binc 10 ospid 79022
2024-11-07T23:28:34.611903+08:00
ESCCENODB(13):IPC Send timeout detected. Sender: ospid 31447 [oracle@CLIENT1]
ESCCENODB(13):Receiver: inst 2 binc 10 ospid 78909
2024-11-07T23:28:35.168674+08:00
RUOYIDB(4):IPC Send timeout detected. Sender: ospid 31599 [oracle@CLIENT1]
...
2024-11-07T23:28:35.766709+08:00
MIVSDB(16):IPC Send timeout detected. Sender: ospid 31537 [oracle@CLIENT1]
MIVSDB(16):Receiver: inst 2 binc 10 ospid 78912
2024-11-07T23:28:37.913906+08:00
ESCCENODB(13):IPC Send timeout detected. Sender: ospid 31588 [oracle@CLIENT1]
ESCCENODB(13):Receiver: inst 2 binc 10 ospid 78974
2024-11-07T23:28:39.814580+08:00
IPC Send timeout detected. Sender: ospid 32440 [oracle@CLIENT1]
Receiver: inst 2 binc 10 ospid 78977
2024-11-07T23:28:52.174104+08:00
LGWR (ospid: 64433) waits for event 'reliable message' for 320 secs.
2024-11-07T23:29:56.692630+08:00
LGWR (ospid: 64433) waits for event 'reliable message' for 385 secs.
2024-11-07T23:30:18.801219+08:00
IPC Send timeout detected. Sender: ospid 64627 [oracle@CLIENT1 (LCK0)]
Receiver: inst 2 binc 10 ospid 130754
2024-11-07T23:30:24.862600+08:00
Communications reconfiguration: instance_number 2 by ospid 64627
2024-11-07T23:30:35.260242+08:00
LGWR (ospid: 64433) waits for event 'reliable message' for 423 secs.
2024-11-07T23:31:14.041423+08:00
LGWR (ospid: 64433) waits for event 'reliable message' for 462 secs.
2024-11-07T23:31:14.534753+08:00
Detected an inconsistent instance membership by instance 1
Evicting instance 2 from cluster     <<<<<<<<<<<<<<<
2024-11-07T23:31:17.318800+08:00
Waiting for instances to leave: 2
2024-11-07T23:31:25.821187+08:00
...
2024-11-07T23:36:16.810355+08:00
WARNING: inbound connection timed out (ORA-3136)
2024-11-07T23:36:32.508374+08:00
IPC Send timeout detected. Sender: ospid 64296 [oracle@CLIENT1 (PING)]
Receiver: inst 2 binc 10 ospid 130561
2024-11-07T23:45:30.631116+08:00
ERROR: terminating instance because ASMB is stuck for 242 seconds
2024-11-07T23:45:30.683940+08:00
System state dump requested by (instance=1, osid=64256 (GEN0)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/anbob_com/anbob1/trace/anbob1_diag_64277_20241107234530.trc
2024-11-07T23:45:30.723837+08:00
GEN0 (ospid: 64256): terminating the instance due to error 15082
2024-11-07T23:45:37.372583+08:00                                               《《《《《《《《《《《《《《《《《《《《 instance crash    ******************************************* 23:45:37
Instance terminated by GEN0, pid = 64256

23:23:32 inst1 lgwr hang
23:25:14 node1 system cpu high 75% begein
23:31:36 inst2 reconfig not start ORA-29740: evicted by instance number 1,
23:37:32 asm2 lmon recfg LMS0 (ospid: 24853) has detected no messaging activity from instance 1
23:37:50 node1 system cpu high 75% end
23:45:37 inst1 crash

NOTE:
先是lgwr等待’reliable message’ 一直没有反馈,日志显示持续了有近10分钟,同时还IPC send timeout,说明当时该节点在和其他进程通信时已经无法完成,在中间过程中等待,可能是本地资源,可能是网络。但是lgwr是不跨网络的,本地资源的概率大。

DB inst 2 alert log

2024-11-07T23:12:37.372124+08:00
(3):Resize operation completed for file# 58, old size 1249280K, new size 1259520K
2024-11-07T23:30:13.856726+08:00
IPC Receiver dump detected. Sender instance 1 Receiver pnum 81 ospid 27602 [oracle@anbob2 (LCK0)], pser 1
2024-11-07T23:30:13.857702+08:00
Errors in file /u01/app/oracle/diag/rdbms/anbob_com/anbob2/trace/anbob2_lck0_27602.trc:
2024-11-07T23:31:00.206118+08:00
Detected an inconsistent instance membership by instance 1
Errors in file /u01/app/oracle/diag/rdbms/anbob_com/anbob2/trace/anbob2_lmon_27210.trc  (incident=321465) (PDBNAME=CDB$ROOT):
ORA-29740: evicted by instance number 1, group incarnation 16
Incident details in: /u01/app/oracle/diag/rdbms/anbob_com/anbob2/incident/incdir_321465/anbob2_lmon_27210_i321465.trc
2024-11-07T23:31:01.449171+08:00
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
2024-11-07T23:31:01.449409+08:00
Errors in file /u01/app/oracle/diag/rdbms/anbob_com/anbob2/trace/anbob2_lmon_27210.trc:
ORA-29740: evicted by instance number 1, group incarnation 16
Errors in file /u01/app/oracle/diag/rdbms/anbob_com/anbob2/trace/anbob2_lmon_27210.trc  (incident=321466) (PDBNAME=CDB$ROOT):
ORA-29740 [] [] [] [] [] [] [] [] [] [] [] []
Incident details in: /u01/app/oracle/diag/rdbms/anbob_com/anbob2/incident/incdir_321466/anbob2_lmon_27210_i321466.trc
2024-11-07T23:31:01.511743+08:00
USER (ospid: 80956): terminating the instance due to error 481
2024-11-07T23:31:03.199858+08:00
License high water mark = 173
2024-11-07T23:31:12.566133+08:00
Instance terminated by USER, pid = 80956

Note:
节点2是被实例1驱逐。

Event ‘reliable message’

在 Oracle 数据库 中,event ‘reliable message’ 是与其内部消息传递和通信机制相关的事件,等待“reliable message”是通信机制的一部分,当另一个进程向该资源发布消息时,该资源(称为通道)的多个订阅者将收到消息通知。进程向其他进程订阅的通道发送消息。当发送reliable message时,消息发布者等待“reliable message”并阻塞,直到所有订阅者都使用了该消息,首先要确定的是,‘reliable message’可靠消息意味着消息的发布者确实会等待其发布消息的通道的所有订阅者完成消息的使用。这就是可靠消息的“可靠”含义:

My Oracle Support 上的可靠消息说明提供了很好的描述:
https://support.oracle.com/epmos/faces/DocContentDisplay ?_afrLoop=34595019638128&id=69088.1&_afrWindowMode=0&_adf.ctrl-state=15ief27plw_97

说明了什么是可靠消息:
– 使用 KSR 实例内广播服务发送的消息。
– 消息发布者等待此等待事件,直到所有订阅者都已使用该消息。

– 发布者等待最多一秒钟,然后重新测试所有订阅者是否都已使用该消息。

下面是MOS中的已知问题.

Troubleshooting High Waits for ‘Reliable Message’ (Doc ID 2017390.1)

Waits for ‘reliable message’ are part of a communication mechanism where multiple subscribers to a resource (called a Channel) are notified of a message receipt when another process publishes to the resource. A process sends a message to a channel subscribed to by other processes. When a Reliable message is sent, then the message publisher waits on ‘reliable message’ and block until all subscribers have consumed the message.

This means that if these channels are being used by applications in your database, waits for ‘reliable message’ are expected. Although these waits may not be causing a perfomance issue, their presence may be noticed when they appear towards the top of AWR reports.

If there is no performance issue, these waits can be ignored.

Very High Waits for ‘reliable message’ After Upgrade to 11.2.0.4 When Using Result Cache (Doc ID 1951729.1).

After upgrade to 11.2.0.4, high waits for Reliable message is seen in AWR reports:
Top 10 Foreground Events by Total Wait Time
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                               Total        Wait         % DB
Event                          Waits        Time Avg(ms) time   Wait Class
------------------------------ ------------ ---- ------- ------ ----------
reliable message                 34,293,326 1430      42   64.1 Other
DB CPU                                       265           11.9
enq: TX - row lock contention     2,556,859 114K      45    5.1 Application
PX Deq: Signal ACK RSG              102,595 101K     985    4.5 Other
log file sync                     6,458,803 86.5      13    3.9 Commit
cursor: pin S wait on X             184,770 34.8     189    1.6 Concurrent


SELECT CHANNEL,
SUM(wait_count) sum_wait_count
FROM GV$CHANNEL_WAITS
GROUP BY CHANNEL
ORDER BY SUM(wait_count) DESC;

CHANNEL Wait Count
------------------------------------------------------ -----------
Result Cache: Channel 307718423

To workaround the issue, disable result cache

Bug 27834569 Hang In RAC CDB Showing ‘row cache lock’ And ‘reliable message’ Wait Events

If you see hang in RAC + multitenant database with following wait event, you may be seeing this problem.
- GEN0 waiting for 'row cache lock'
- Someone waiting for 'reliable message' while executing
  DBMS_AUDIT_MGMT.CLEAN_AUDIT_TRAIL(USE_LAST_ARCH_TIMESTAMP => FALSE)

Workaround
Use DBMS_AUDIT_MGMT.CLEAN_AUDIT_TRAIL(USE_LAST_ARCH_TIMESTAMP => TRUE) instead.

该问题进入AWR Top event时,需要关注是因为多次还是单次长时间的等待,要检查等待时间,请查看平均等待时间和等待直方图,确定与reliable message相关联的通道,也可以使用类似于下面的查询来查找通道中最长的等待时间:

SELECT    inst_id,    channel,    wait_time_usec
FROM    GV$CHANNEL_WAITS
WHERE    wait_time_usec = (
        SELECT            MAX ( wait_time_usec )
        FROM            GV$CHANNEL_WAITS
);

SELECT CHANNEL, SUM(wait_count) sum_wait_count
FROM GV$CHANNEL_WAITS
GROUP BY CHANNEL
ORDER BY SUM(wait_count) DESC;

可以通过ASH 报告 ,确定与 ‘reliable message’等待相关联的SQL,也可以从hanganalyze 中查找p1 p2 p3 value.

-- replace  with the p1 value of the reliable message wait event
select des.id_ksrcdes, des.name_ksrcdes
from x$ksrcctx ctx, x$ksrcdes des 
where to_number(ctx.addr,'XXXXXXXXXXXXXXXX') =  
and ctx.name_ksrcctx = des.indx;

您可以使用 x$ksrchdl 视图并将其与频道上下文的发布者和订阅者链接起来:

-- inspired by and largely taken from Tanel Poder's channels2.sql script (https://github.com/tanelpoder/tpt-oracle/blob/master/channel2.sql)
-- replace  with the p1 value of the reliable message wait event
select case when bitand(c.flags_ksrchdl,1)=1 then 'PUB ' end || 
       case when bitand(c.flags_ksrchdl,2)=2 then 'SUB ' end || 
       case when bitand(c.flags_ksrchdl,16)=16 then 'INA ' end flags,
       s.sid,
       p.program,
       cd.name_ksrcdes channel_name, 
       cd.id_ksrcdes, 
       c.ctxp_ksrchdl
from   x$ksrchdl c , 
       v$process p, 
       v$session s, 
       x$ksrcctx ctx, 
       x$ksrcdes cd
where 1=1
and to_number(c.ctxp_ksrchdl,'XXXXXXXXXXXXXXXX') = 
and s.paddr(+)=c.owner_ksrchdl 
and p.addr(+)=c.owner_ksrchdl
and c.ctxp_ksrchdl=ctx.addr
and cd.indx=ctx.name_ksrcctx
;
FLAGS          SID PROGRAM                                                      CHANNEL_NAME                             ID_KSRCDES CTXP_KSRCHDL
------------ ----- ------------------------------------------------------------ ---------------------------------------- ---------- ----------------
SUB            258 oracle@oracle-database.local (CKPT)                          obj broadcast channel                    kcbo       000000007AAB3020
PUB            141 oracle@oracle-database.local (M001)                          obj broadcast channel                    kcbo       000000007AAB3020
PUB            142 oracle@oracle-database.local (W005)                          obj broadcast channel                    kcbo       000000007AAB3020
PUB            145 oracle@oracle-database.local (W007)                          obj broadcast channel                    kcbo       000000007AAB3020
PUB            396 oracle@oracle-database.local (TNS V1-V3)                     obj broadcast channel                    kcbo       000000007AAB30

Frits Hoogland在他的博客的例子中如上,这个通道“kcbo”有 4 个发布者(“my” session 396 ,以及后台进程 M001、W005 和 W007)和一个订阅者:CKPT,即检查点进程。在这种情况下,我们正在等待的进程的信息是可以猜出来的,这不仅是因为我自己造成了等待的情况,还因为 v$session 中的“final_blocking_session”字段会将您指向它。但在有多个订阅者的情况下,您可以使用此查询查找当前参与该通道的所有进程。

订阅会话本质上会执行 3 件对可靠消息等待至关重要的事情:
1. 它在订阅的频道中找到已发布的消息,并开始对该消息采取行动。
2. 它执行作为该频道订阅者应该完成的任务。
3. 它“消费”该消息,从而结束发布者的可靠消息等待事件。

但是以上已知问题和分析都不匹配也不具备,因为数据库是dataguard standby端(只读),数据库都已重启没有ASH可以使用,无法获取当时数据库内部的信息,crash 时的systemstate信息也没发现有价值信息,操作系统无错误日志,还好有osw。

OSW分析系统资源

# cat   vmstat.dat
zzz ***Thu Nov 7 23:22:36 CST 2024
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
11  0 7541836 18806696  12844 321531200    0    0  3309   168    0    0  5  2 92  1  0
20  0 7541836 18809128  12844 321552640    0    0 204766 15941 137700 246695  6  4 91  0  0
17  0 7541836 18778348  12844 321614112    0    0 134766  3866 128380 231076  7  5 89  0  0
zzz ***Thu Nov 7 23:23:51 CST 2024
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 5  0 7541828 18278892  12844 323527072    0    0  3309   168    0    0  5  2 92  1  0
 6  0 7541828 18271016  12844 323527584    0    0 181703  8434 113684 232293  3  2 95  0  0
 6  0 7541828 18271408  12844 323527200    0    0 103946  5767 112520 231018  3  2 95  0  0
zzz ***Thu Nov 7 23:25:14 CST 2024
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
227  0 7538552 13170584  12844 324721888    0    0  3309   168    0    0  5  2 92  1  0
182  0 7538552 13158816  12844 324724096    0    0 87137  6541 142565 167124  4 75 21  0  0       <<<<<<<<<<<<<<<<< sys cpu high 75%
196  0 7538552 13154240  12844 324726624    0    0 90009  2213 143354 169086  4 75 21  0  0
zzz ***Thu Nov 7 23:26:29 CST 2024
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
212  1 7538428 8480808  12844 325425024    0    0  3309   168    0    0  5  2 92  1  0
188  1 7538420 8322508  12844 325450400   64    0 158427  8875 157294 194546  9 71 20  0  0
203  1 7538420 8245808  12844 325458848    0    0 107657 16625 160095 196190  7 72 20  0  0
zzz ***Thu Nov 7 23:27:44 CST 2024
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
199  0 7538400 7740080  12844 325719008    0    0  3309   168    0    0  5  2 92  1  0
239  0 7538400 7740020  12844 325721280    0    0 54608  8751 140473 161132  3 77 20  0  0
186  0 7538400 7738368  12844 325722560    0    0 107210  8775 142060 165850  3 77 20  0  0
zzz ***Thu Nov 7 23:29:00 CST 2024
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
221  0 7537328 2416808  12844 326702560    0    0  3309   168    0    0  5  2 92  1  0
172  0 7537328 2404624  12844 326707744    0    0 67249  4329 142713 170111  4 75 20  0  0
175  0 7537328 2359524  12844 326716960    0    0 59329  4402 144166 170062  5 76 20  0  0
zzz ***Thu Nov 7 23:30:16 CST 2024
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
187  0 6261420 111976976  12844 258522064    0    0  3309   168    0    0  5  2 92  1  0
195  0 6261396 111918856  12844 258523280   32    0 24282 30398 145962 154892 11 70 19  0  0
308  1 6261332 111832128  12844 258524784  204    0 28854 21659 149888 159073 11 70 20  0  0
...
zzz ***Thu Nov 7 23:36:35 CST 2024
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
141  0 2758228 240106928  12844 183423680    0    0  3309   168    0    0  5  2 92  1  0
142  0 2758212 240095808  12844 183426976   72    0   968 19370 130196 122438  3 76 20  0  0
167  0 2758212 240091776  12844 183429136    0    0 22528 11624 126890 118465  3 77 21  0  0
zzz ***Thu Nov 7 23:37:50 CST 2024
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
166  1 2758196 241069904  12844 183555920    0    0  3309   168    0    0  5  2 92  1  0
158  0 2758196 241061728  12844 183560352    0    0  1008 10368 131145 113137  5 80 15  0  0
132  0 2758192 241057504  12844 183563136   20    0  1028 19594 132034 118878  5 78 17  0  0
zzz ***Thu Nov 7 23:39:04 CST 2024
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 9  0 2758188 242105056  12844 184251232    0    0  3309   168    0    0  5  2 92  1  0
12  0 2758188 242106288  12844 184249712    0    0 24080 284882 66376 123636  4  3 94  0  0

# egrep '^zzz|64433' CLIENT1_ps_24.11.07.2300.dat
 
 USER        PID   PPID PRI %CPU %MEM    VSZ   RSS WCHAN  S  STARTED     TIME COMMAND
zzz ***Thu Nov 7 23:01:13 CST 2024
oracle    64433      1  19  2.7  0.0 77793632 351956 poll_s S   Jun 28 3-14:14:35 ora_lgwr_anbob1
zzz ***Thu Nov 7 23:02:29 CST 2024
oracle    64433      1  19  2.7  0.0 77793632 351956 poll_s S   Jun 28 3-14:14:37 ora_lgwr_anbob1
zzz ***Thu Nov 7 23:03:42 CST 2024
oracle    64433      1  19  2.7  0.0 77793632 351956 poll_s S   Jun 28 3-14:14:40 ora_lgwr_anbob1
zzz ***Thu Nov 7 23:04:57 CST 2024
oracle    64433      1  19  2.7  0.0 77793632 351956 poll_s S   Jun 28 3-14:14:41 ora_lgwr_anbob1
zzz ***Thu Nov 7 23:06:14 CST 2024
oracle    64433      1  19  2.7  0.0 77793632 351956 poll_s S   Jun 28 3-14:14:44 ora_lgwr_anbob1
zzz ***Thu Nov 7 23:07:29 CST 2024
oracle    64433      1  19  2.7  0.0 77793632 351956 poll_s S   Jun 28 3-14:14:48 ora_lgwr_anbob1
zzz ***Thu Nov 7 23:08:46 CST 2024
oracle    64433      1  19  2.7  0.0 77793632 351956 -   S   Jun 28 3-14:14:49 ora_lgwr_anbob1
zzz ***Thu Nov 7 23:10:03 CST 2024
oracle    64433      1  19  2.7  0.0 77793632 351956 poll_s S   Jun 28 3-14:14:52 ora_lgwr_anbob1
zzz ***Thu Nov 7 23:11:20 CST 2024
oracle    64433      1  19  2.7  0.0 77793632 351956 poll_s S   Jun 28 3-14:14:54 ora_lgwr_anbob1
zzz ***Thu Nov 7 23:12:37 CST 2024
oracle    64433      1  19  2.7  0.0 77793632 351956 poll_s S   Jun 28 3-14:14:56 ora_lgwr_anbob1
zzz ***Thu Nov 7 23:13:53 CST 2024
oracle    64433      1  19  2.7  0.0 77793632 351956 poll_s S   Jun 28 3-14:14:58 ora_lgwr_anbob1
zzz ***Thu Nov 7 23:15:09 CST 2024
oracle    64433      1  19  2.7  0.0 77793632 351956 poll_s S   Jun 28 3-14:15:00 ora_lgwr_anbob1
zzz ***Thu Nov 7 23:16:24 CST 2024
oracle    64433      1  19  2.7  0.0 77793632 351956 poll_s S   Jun 28 3-14:15:04 ora_lgwr_anbob1
zzz ***Thu Nov 7 23:17:40 CST 2024
oracle    64433      1  19  2.7  0.0 77793632 351956 poll_s S   Jun 28 3-14:15:05 ora_lgwr_anbob1
zzz ***Thu Nov 7 23:18:56 CST 2024
oracle    64433      1  19  2.7  0.0 77793632 351956 poll_s S   Jun 28 3-14:15:08 ora_lgwr_anbob1
zzz ***Thu Nov 7 23:20:11 CST 2024
oracle    64433      1  19  2.7  0.0 77793632 351956 poll_s S   Jun 28 3-14:15:10 ora_lgwr_anbob1
zzz ***Thu Nov 7 23:21:27 CST 2024
oracle    64433      1  19  2.7  0.0 77793632 351956 poll_s S   Jun 28 3-14:15:10 ora_lgwr_anbob1
zzz ***Thu Nov 7 23:22:42 CST 2024
oracle    64433      1  19  2.7  0.0 77793632 351992 poll_s S   Jun 28 3-14:15:12 ora_lgwr_anbob1
zzz ***Thu Nov 7 23:23:57 CST 2024
oracle    64433      1  19  2.7  0.0 77793632 351992 poll_s S   Jun 28 3-14:15:13 ora_lgwr_anbob1
zzz ***Thu Nov 7 23:25:20 CST 2024
oracle    64433      1  19  2.7  0.0 77793632 351992 poll_s S   Jun 28 3-14:15:14 ora_lgwr_anbob1
zzz ***Thu Nov 7 23:26:35 CST 2024
oracle    64433      1  19  2.7  0.0 77793632 351992 poll_s S   Jun 28 3-14:15:16 ora_lgwr_anbob1
zzz ***Thu Nov 7 23:27:51 CST 2024
oracle    64433      1  19  2.7  0.0 77793632 351992 poll_s S   Jun 28 3-14:15:17 ora_lgwr_anbob1
zzz ***Thu Nov 7 23:29:08 CST 2024
oracle    64433      1  19  2.7  0.0 77793632 351992 poll_s S   Jun 28 3-14:15:18 ora_lgwr_anbob1
zzz ***Thu Nov 7 23:30:24 CST 2024
oracle    64433      1  19  2.7  0.0 77793632 351992 poll_s S   Jun 28 3-14:15:20 ora_lgwr_anbob1
zzz ***Thu Nov 7 23:31:39 CST 2024
oracle    64433      1  19  2.7  0.0 77793632 352012 poll_s S   Jun 28 3-14:15:23 ora_lgwr_anbob1
zzz ***Thu Nov 7 23:32:56 CST 2024
oracle    64433      1  19  2.7  0.0 77793632 353964 poll_s S   Jun 28 3-14:15:28 ora_lgwr_anbob1
zzz ***Thu Nov 7 23:34:12 CST 2024
oracle    64433      1  19  2.7  0.0 77793632 354084 poll_s S   Jun 28 3-14:15:28 ora_lgwr_anbob1
zzz ***Thu Nov 7 23:35:27 CST 2024
oracle    64433      1  19  2.7  0.0 77793632 354188 poll_s S   Jun 28 3-14:15:33 ora_lgwr_anbob1
zzz ***Thu Nov 7 23:36:42 CST 2024
oracle    64433      1  19  2.7  0.0 77793632 354188 poll_s S   Jun 28 3-14:15:33 ora_lgwr_anbob1
zzz ***Thu Nov 7 23:37:56 CST 2024
oracle    64433      1  19  2.7  0.0 77793632 354188 poll_s S   Jun 28 3-14:15:33 ora_lgwr_anbob1
zzz ***Thu Nov 7 23:39:10 CST 2024
oracle    64433      1  19  2.7  0.0 77793632 354188 poll_s S   Jun 28 3-14:15:33 ora_lgwr_anbob1
zzz ***Thu Nov 7 23:40:26 CST 2024
oracle    64433      1  19  2.7  0.0 77793632 354188 poll_s S   Jun 28 3-14:15:33 ora_lgwr_anbob1
zzz ***Thu Nov 7 23:41:42 CST 2024
oracle    64433      1  19  2.7  0.0 77793632 354188 poll_s S   Jun 28 3-14:15:33 ora_lgwr_anbob1
zzz ***Thu Nov 7 23:42:57 CST 2024
oracle    64433      1  19  2.7  0.0 77793632 354188 poll_s S   Jun 28 3-14:15:33 ora_lgwr_anbob1
zzz ***Thu Nov 7 23:44:11 CST 2024
oracle    64433      1  19  2.7  0.0 77793632 354188 poll_s S   Jun 28 3-14:15:33 ora_lgwr_anbob1
zzz ***Thu Nov 7 23:45:26 CST 2024
oracle    64433      1  19  2.7  0.0 77793632 354188 poll_s S   Jun 28 3-14:15:33 ora_lgwr_anbob1
zzz ***Thu Nov 7 23:46:39 CST 2024
-- lgwr是S[leep]状态等待回复,也匹配上面的event.

# awk '$10=="D"  ||  /zzz/' CLIENT1_ps_24.11.07.2300.dat
zzz ***Thu Nov 7 23:01:13 CST 2024
...
zzz ***Thu Nov 7 23:16:24 CST 2024
root      82938      2  19  0.0  0.0      0     0 msleep D 22:40:57 00:00:01 [kworker/u256:3]
oracle    21134      1  19 17.0  0.0 42393808 114312 blockd D 23:16:23 00:00:00 oraclebs34cdb41 (LOCAL=NO)
oracle    10814      1  19 14.9  0.0 42505936 240960 blockd D 23:10:18 00:00:54 oraclebs34cdb11 (LOCAL=NO)
zzz ***Thu Nov 7 23:17:40 CST 2024
root      82938      2  19  0.0  0.0      0     0 msleep D 22:40:57 00:00:01 [kworker/u256:3]
zzz ***Thu Nov 7 23:18:56 CST 2024
oracle    74554      1  19  4.2  0.3 43866440 1706596 -  D   Jun 28 5-14:00:41 ora_pr00_bs34cdb11
oracle    25868      1  19  0.0  0.0 73848768 37088 SyS_io D 23:18:55 00:00:00 oraclebsrac1 (LOCAL=NO)
zzz ***Thu Nov 7 23:20:11 CST 2024
oracle    51157      1  19  1.6  0.2 73851848 1331188 blockd D   Nov 01 02:31:36 ora_ppa7_bsrac1
root      26937   1571  19  0.7  0.0  44684  2176 -      D 23:19:37 00:00:00 /usr/lib/systemd/systemd-udevd
oracle    27654      1  19  8.5  0.0 73853324 185156 blockd D 23:20:09 00:00:00 oraclebsrac1 (LOCAL=NO)
oracle    27591      1  19 16.0  0.0 73848764 198036 SyS_io D 23:20:08 00:00:00 oraclebsrac1 (LOCAL=NO)
oracle    27581      1  19 22.0  0.0 73848760 153680 SyS_io D 23:20:08 00:00:00 oraclebsrac1 (LOCAL=NO)
oracle    27501      1  19 12.4  0.0 42396304 182932 blockd D 23:20:06 00:00:00 oraclebs34cdb21 (LOCAL=NO)
zzz ***Thu Nov 7 23:21:27 CST 2024
zzz ***Thu Nov 7 23:22:42 CST 2024
zzz ***Thu Nov 7 23:23:57 CST 2024
oracle    61631      1  19  3.8  0.1 42948160 767588 -   D   Sep 21 1-19:39:22 ora_pr00_bs34cdb21
zzz ***Thu Nov 7 23:25:20 CST 2024
oracle    35695      1  19 26.0  0.5 73855724 2891224 xlog_g D 23:24:11 00:00:17 oraclebsrac1 (LOCAL=NO)
oracle    36113      1  19 22.7  0.2 73855520 1526468 xlog_g D 23:24:18 00:00:14 oraclebsrac1 (LOCAL=NO)
oracle    35687      1  19 28.4  0.2 73855520 1506220 xlog_g D 23:24:10 00:00:19 oraclebsrac1 (LOCAL=NO)
oracle    35493      1  19 23.8  0.2 73853276 1337496 xlog_g D 23:24:06 00:00:17 oraclebsrac1 (LOCAL=NO)
oracle    35832      1  19 29.3  0.1 73848632 541832 -   D 23:24:14 00:00:19 oraclebsrac1 (LOCAL=NO)
oracle    30405      1  19 23.5  0.1 42396052 750148 xlog_g D 23:22:04 00:00:46 oraclebs34cdb11 (LOCAL=NO)
oracle    36658      1  19 19.6  0.0 42392980 166732 xlog_g D 23:24:30 00:00:09 oraclebs34cdb11 (LOCAL=NO)
oracle    36564      1  19 23.6  0.0 77781196 110520 xlog_g D 23:24:28 00:00:12 oracleanbob1 (LOCAL=NO)
oracle    36551      1  19 22.4  0.0 26665176 122568 xlog_g D 23:24:27 00:00:11 oracleanbob21 (LOCAL=NO)
oracle    36523      1  19 23.4  0.0 26663128 92188 xlog_g D 23:24:27 00:00:12 oracleanbob21 (LOCAL=NO)
oracle    36497      1  19 23.8  0.0 42390336 114384 xlog_g D 23:24:26 00:00:12 oraclebs34cdb41 (LOCAL=NO)
oracle    36495      1  19 23.3  0.0 42393812 94620 xlog_g D 23:24:26 00:00:12 oraclebs34cdb41 (LOCAL=NO)
oracle    36483      1  19 23.9  0.0 77780792 103672 xlog_g D 23:24:26 00:00:12 oracleanbob1 (LOCAL=NO)
oracle    36475      1  19 23.3  0.0 77781196 40008 xlog_g D 23:24:25 00:00:12 oracleanbob1 (LOCAL=NO)
oracle    36473      1  19 22.8  0.0 26667820 98200 -    D 23:24:25 00:00:12 oracleanbob21 (LOCAL=NO)
oracle    36458      1  19 23.2  0.0 26665176 116920 xlog_g D 23:24:25 00:00:12 oracleanbob21 (LOCAL=NO)
oracle    36452      1  19 24.0  0.0 77779772 37568 -    D 23:24:25 00:00:13 oracleanbob1 (LOCAL=NO)
oracle    36444      1  19 18.8  0.0 42391764 116024 xlog_g D 23:24:25 00:00:10 oraclebs34cdb41 (LOCAL=NO)
oracle    36432      1  19 24.5  0.0 26663128 40608 xlog_g D 23:24:25 00:00:13 oracleanbob21 (LOCAL=NO)
oracle    36430      1  19 24.1  0.0 77783244 217664 xlog_g D 23:24:25 00:00:13 oracleanbob1 (LOCAL=NO)
oracle    36409      1  19 24.6  0.0 77783248 113732 xlog_g D 23:24:24 00:00:13 oracleanbob1 (LOCAL=NO)
oracle    36392      1  19 23.7  0.0 42392156 111316 xlog_g D 23:24:23 00:00:13 oraclebs34cdb41 (LOCAL=NO)
oracle    36356      1  19 25.0  0.0 42391764 44132 xlog_g D 23:24:22 00:00:14 oraclebs34cdb41 (LOCAL=NO)
oracle    36344      1  19 25.1  0.0 42395860 156076 xlog_g D 23:24:22 00:00:14 oraclebs34cdb41 (LOCAL=NO)
oracle    36324      1  19 25.2  0.0 42390332 38860 xlog_g D 23:24:21 00:00:14 oraclebs34cdb41 (LOCAL=NO)
oracle    36320      1  19 23.8  0.0 42392976 115868 xlog_g D 23:24:21 00:00:14 oraclebs34cdb41 (LOCAL=NO)
oracle    36317      1  19 17.9  0.0 42392212 121884 xlog_g D 23:24:21 00:00:10 oraclebs34cdb41 (LOCAL=NO)
oracle    36306      1  19 25.1  0.0 42393804 70948 xlog_g D 23:24:21 00:00:14 oraclebs34cdb41 (LOCAL=NO)
oracle    36296      1  19 25.0  0.0 42390732 39948 xlog_g D 23:24:20 00:00:15 oraclebs34cdb41 (LOCAL=NO)
oracle    36293      1  19 24.8  0.0 42391760 56228 xlog_g D 23:24:20 00:00:14 oraclebs34cdb41 (LOCAL=NO)
oracle    36288      1  19 22.4  0.0 42391952 69296 xlog_g D 23:24:20 00:00:13 oraclebs34cdb41 (LOCAL=NO)
oracle    36267      1  19 24.9  0.0 42393804 144544 xlog_g D 23:24:19 00:00:15 oraclebs34cdb41 (LOCAL=NO)
oracle    36265      1  19 22.0  0.0 42396048 46052 xlog_g D 23:24:19 00:00:13 oraclebs34cdb41 (LOCAL=NO)
oracle    36261      1  19 25.9  0.0 42390332 38044 xlog_g D 23:24:19 00:00:15 oraclebs34cdb41 (LOCAL=NO)
oracle    36259      1  19 17.7  0.0 42394000 125212 xlog_g D 23:24:19 00:00:10 oraclebs34cdb41 (LOCAL=NO)
oracle    36257      1  19 25.8  0.0 42390336 38344 xlog_g D 23:24:18 00:00:16 oraclebs34cdb41 (LOCAL=NO)
oracle    36168      1  19 19.5  0.0 42394256 187616 xlog_g D 23:24:18 00:00:12 oraclebs34cdb41 (LOCAL=NO)
oracle    36073      1  19 26.3  0.0 73847604 42152 xlog_g D 23:24:18 00:00:16 oraclebsrac1 (LOCAL=NO)
oracle    36067      1  19 25.9  0.0 42391760 135416 xlog_g D 23:24:18 00:00:16 oraclebs34cdb41 (LOCAL=NO)
oracle    36008      1  19 25.7  0.0 42391764 42000 xlog_g D 23:24:17 00:00:16 oraclebs34cdb41 (LOCAL=NO)
oracle    36003      1  19 21.6  0.0 73849224 200144 xlog_g D 23:24:17 00:00:13 oraclebsrac1 (LOCAL=NO)
oracle    35997      1  19 26.6  0.0 42390332 78068 xlog_g D 23:24:17 00:00:16 oraclebs34cdb41 (LOCAL=NO)
oracle    35995      1  19 25.3  0.0 42394004 88516 xlog_g D 23:24:17 00:00:15 oraclebs34cdb41 (LOCAL=NO)
oracle    35991      1  19 24.2  0.0 42392212 173728 xlog_g D 23:24:16 00:00:15 oraclebs34cdb21 (LOCAL=NO)
oracle    35983      1  19 26.5  0.0 73847604 41788 xlog_g D 23:24:16 00:00:17 oraclebsrac1 (LOCAL=NO)
oracle    35976      1  19 25.9  0.0 42391760 94300 xlog_g D 23:24:16 00:00:16 oraclebs34cdb41 (LOCAL=NO)
oracle    35968      1  19 24.0  0.0 42394256 165476 xlog_g D 23:24:16 00:00:15 oraclebs34cdb41 (LOCAL=NO)
oracle    35950      1  19 17.8  0.0 42394004 114664 xlog_g D 23:24:16 00:00:11 oraclebs34cdb41 (LOCAL=NO)
oracle    35940      1  19 27.4  0.0 42393812 68828 xlog_g D 23:24:16 00:00:17 oraclebs34cdb41 (LOCAL=NO)
oracle    35930      1  19 27.0  0.0 42391764 188032 xlog_g D 23:24:15 00:00:17 oraclebs34cdb21 (LOCAL=NO)
oracle    35912      1  19 27.0  0.0 73851532 142920 xlog_g D 23:24:15 00:00:17 oraclebsrac1 (LOCAL=NO)
oracle    35902      1  19 27.4  0.0 73849036 44424 xlog_g D 23:24:15 00:00:17 oraclebsrac1 (LOCAL=NO)
oracle    35868      1  19 26.8  0.0 42391760 83480 xlog_g D 23:24:14 00:00:17 oraclebs34cdb21 (LOCAL=NO)
oracle    35848      1  19 27.2  0.0 42390336 100772 xlog_g D 23:24:14 00:00:17 oraclebs34cdb21 (LOCAL=NO)
oracle    35844      1  19 20.0  0.0 42394004 198560 xlog_g D 23:24:14 00:00:13 oraclebs34cdb21 (LOCAL=NO)
...

# sed -n '/23:2/,$p'  CLIENT1_iostat_24.11.07.2300.dat |grep -C 1 avg-cpu
zzz ***Thu Nov 7 23:20:08 CST 2024
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          10.73    0.00    4.73    3.08    0.00   81.46
--
zzz ***Thu Nov 7 23:21:24 CST 2024
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.56    0.00    2.16    0.04    0.00   94.24
--
zzz ***Thu Nov 7 23:22:39 CST 2024
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.40    0.00    4.53    0.08    0.00   91.00
--
zzz ***Thu Nov 7 23:23:54 CST 2024
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.45    0.00    1.64    0.08    0.00   94.83
--
zzz ***Thu Nov 7 23:25:17 CST 2024
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.89    0.00   75.40    0.02    0.00   20.69              <<<<<<<<<<<<<<<<<< system cpu -- zzz ***Thu Nov 7 23:26:32 CST 2024 avg-cpu: %user %nice %system %iowait %steal %idle 4.42 0.00 75.62 0.12 0.00 19.84 -- zzz ***Thu Nov 7 23:27:47 CST 2024 avg-cpu: %user %nice %system %iowait %steal %idle 3.36 0.00 76.59 0.05 0.00 20.00 -- zzz ***Thu Nov 7 23:29:04 CST 2024 avg-cpu: %user %nice %system %iowait %steal %idle 5.77 0.00 74.86 0.02 0.00 19.35 ... # head -n 7 CLIENT1_iostat_24.11.07.2300.dat ;awk '/^zzz/ || $14>50 ' CLIENT1_iostat_24.11.07.2300.dat
...
zzz ***Thu Nov 7 23:17:37 CST 2024
zzz ***Thu Nov 7 23:18:52 CST 2024
zzz ***Thu Nov 7 23:20:08 CST 2024
zzz ***Thu Nov 7 23:21:24 CST 2024
zzz ***Thu Nov 7 23:22:39 CST 2024
emcpowerd         0.00     0.00  786.00    2.00 284872.00   201.50   723.54     0.81    1.03    1.03    0.50   0.69  54.60
zzz ***Thu Nov 7 23:23:54 CST 2024
emcpowerd         0.00     0.00  822.00    1.00 293992.00    10.00   714.46     0.81    0.99    0.99    0.00   0.66  54.50
zzz ***Thu Nov 7 23:25:17 CST 2024
zzz ***Thu Nov 7 23:26:32 CST 2024
zzz ***Thu Nov 7 23:27:47 CST 2024
zzz ***Thu Nov 7 23:29:04 CST 2024
zzz ***Thu Nov 7 23:30:20 CST 2024

-- disk 并不繁忙

zzz ***Thu Nov 7 23:25:19 CST 2024
top - 23:25:21 up 1132 days, 19:47,  1 user,  load average: 228.71, 88.17, 44.82
Tasks: 5975 total, 187 running, 5787 sleeping,   0 stopped,   1 zombie
%Cpu(s):  4.1 us, 73.9 sy,  0.0 ni, 20.8 id,  0.0 wa,  0.0 hi,  1.2 si,  0.0 st
KiB Mem : 52806067+total, 13078708 free, 19022579+used, 32475619+buff/cache
KiB Swap: 16777212 total,  9238660 free,  7538552 used. 49943820 avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 18126 oracle    20   0 40.424g  85984  81096 D 100.0  0.0 261:04.92 ora_lmhb_+
  3791 root      20   0  559488 538456   1256 R  98.4  0.1 200466:01 OSWatcher+
 50494 oracle    20   0 25.423g  20092  16208 S  78.3  0.0  15:04.19 ora_gcr0_+
 78063 root      20   0 9148872 980796 204400 S  59.7  0.2   1793:26 cvd
 10632 oracle    20   0 75.109g 226068 106936 R  51.2  0.0  69:49.58 oracle_10+
 30237 grid      20   0 1496616 248456 239596 R  38.8  0.0   1:31.48 mdb_m001_+
 30343 oracle    20   0 40.427g  58836  53212 D  34.1  0.0   0:45.05 oracle_30+
 30409 oracle    20   0 40.432g  93028  85200 D  34.1  0.0   0:45.08 oracle_30+
 30770 oracle    20   0 40.432g  54204  46388 D  34.1  0.0   0:41.85 oracle_30+
 30954 oracle    20   0 40.430g 127500 120588 R  34.1  0.0   0:39.95 oracle_30+
 31597 oracle    20   0 40.430g  92628  85284 D  34.1  0.0   0:44.41 oracle_31+
 31611 oracle    20   0 40.427g  83404  78556 D  34.1  0.0   0:43.91 oracle_31+
 35344 oracle    20   0 40.430g 194636 186380 D  34.1  0.0   0:19.36 oracle_35+
 35451 oracle    20   0 40.429g 120144 113716 R  34.1  0.0   0:22.43 oracle_35+
 35481 oracle    20   0 70.428g 188832 182148 D  34.1  0.0   0:22.09 oracle_35+
 35647 oracle    20   0 40.430g  85140  76884 D  34.1  0.0   0:20.79 oracle_35+
 35659 oracle    20   0 40.428g  50320  43720 D  34.1  0.0   0:18.25 oracle_35+
 35673 oracle    20   0 70.430g 159488 152744 R  34.1  0.0   0:20.44 oracle_35+
 35705 oracle    20   0 70.429g 179944 173044 R  34.1  0.0   0:20.34 oracle_35+
 35733 oracle    20   0 70.430g 247116 240332 D  34.1  0.0   0:19.20 oracle_35+
 35741 oracle    20   0 40.430g 175840 168356 D  34.1  0.0   0:14.28 oracle_35+
 35788 oracle    20   0 40.427g  38584  33764 D  34.1  0.0   0:18.99 oracle_35+
 35905 oracle    20   0 70.428g 206088 199340 D  34.1  0.0   0:17.22 oracle_35+
 35912 oracle    20   0 70.430g 142920 135696 R  34.1  0.0   0:17.74 oracle_35+
 35995 oracle    20   0 40.430g  88516  80928 R  34.1  0.0   0:16.12 oracle_35+
 36296 oracle    20   0 40.427g  39948  34844 D  34.1  0.0   0:15.16 oracle_36+
 36344 oracle    20   0 40.432g 156076 146832 D  34.1  0.0   0:14.70 oracle_36+
 30415 oracle    20   0 40.428g 113356 106828 D  33.3  0.0   0:44.83 oracle_30+
 30418 oracle    20   0 40.429g 140608 134384 R  33.3  0.0   0:43.92 oracle_30+
 30422 oracle    20   0 40.430g 123492 115764 R  33.3  0.0   0:45.53 oracle_30+
 30533 oracle    20   0 40.427g  46708  41888 R  33.3  0.0   0:44.93 oracle_30+
 30552 oracle    20   0 40.428g  80352  73660 D  33.3  0.0   0:43.09 oracle_30+
 30563 oracle    20   0 40.432g 188572 179856 R  33.3  0.0   0:42.17 oracle_30+
 30612 oracle    20   0 40.428g  87052  80980 D  33.3  0.0   0:42.81 oracle_30+
 30635 oracle    20   0 40.430g 252528 244268 R  33.3  0.0   0:42.07 oracle_30+
 30827 oracle    20   0 40.430g 234660 226300 D  33.3  0.0   0:41.51 oracle_30+
 30887 oracle    20   0 40.427g  80448  74752 D  33.3  0.0   0:41.19 oracle_30+
 31624 oracle    20   0 40.430g 213588 205844 R  33.3  0.0   0:45.05 oracle_31+


# grep reassembles CLIENT1_netstat_24.11.07.2300.dat
    1776713 packet reassembles failed
    1776713 packet reassembles failed
...
    1776713 packet reassembles failed
    1776713 packet reassembles failed

Note:
网络无重启失败.

系统调用函数xlog_gxxx

因为osw里显示的ps 的WCHAN只有6个字符,我们不确认完整函数。在/boot/System.map中没有找到函数,不过操作系统上与xlog相关的当然不是postgresql的WAL日志,而应该是与XFS文件系统相关. 首先确认了现场的环境确实文件系统XFS。 查找xfs中以xlog_g开头的函数 https://github.com/torvalds/linux/blob/master/fs/xfs/xfs_log.c

XFS是基于日志的文件系统,和数据库有点像,也是异步提交事务的,之前会写Log buffer。XFS 为日志维护一个环形缓冲区,在已知的案例中发生过当 XFS 日志缓冲区空间不足时,XFS 文件系统中就会出现死锁。在 while 循环中等待检查​​日志中的可用空间,xlog_grant_ head_check( )、xlog_grant_ head_wait( )。
Another feature of the XFS transaction subsystem is that most transactions are asynchronous. That is, they don’t commit to disk until either a log buffer is filled (a log buffer can hold multiple transactions) or a synchronous operation forces the log buffers holding the transactions to disk. This means that XFS is doing aggregation of transactions in memory – batching them, if you like – to minimise the impact of the log IO on transaction throughput.

如xlog_grant_head_wait() does not return and system hangs (BZ#2007413)和RHEL官方的《System hangs due to xfs lockups 》

Resolution
RHEL7.4: This issue has been resolved via errata RHSA-2017:1842 : Upgrade to kernel-3.10.0-693.el7. The fix is part of RHEL7.4GA and later.
RHEL7.3.z: Update kernel to 3.10.0-514.48.1.el7 (released with RHSA-2018:1129) or later, which contains fixes for this issue.
RHEL7.2.z: Update kernel to 3.10.0-327.66.1.el7 (released with RHSA-2018:1216) or later, which contains fixes for this issue.

Root Cause
Vmcore analysis reveals that the system experienced a deadlock scenario within xfs, due to the use of xfs workqueues.
Basically, the tasks for 2 xfs workqueues end up waiting on each other, deadlocking much of an xfs filesystem and causing file operations to become stuck.

而客户的环境是RHEL 7.3, 在bug的范围.

客户反馈问题频繁重启,环境的变化就是部署了DSG数据同步工具, DSG和XFS有什么关系?本地也没安装dsg客户端程序。 结合前两次的故障,出现过文件系统耗尽,当时产品较大的是ASM Audit trace文件。

ASM AUDIT trace file内容

action 'begin dbms_diskgroup.read(:1, :2, :3, :4); end;'
privilege 'sysasm'
database user 'DSG'

Note:
该环境ASM 审计日志除了,ORACLE软件自有的连接,如 grid用户使用sysasm 有asm agent执行ASM DISKGROUP 状态的启动时间的检查。另外在12C后的FLEX ASM 还会有client user: root 使用CRSD创建的使用CRSUSER__ASM_001 (database user  sysasm role) ACTION “CONNECT“ or “dbms_diskgroup.open/ read” 或asmcmd命令  . client user: oracle使用database instance创建的 ORACLE_001(sysdba role)的查询。 除了这些外最多的就是DSG外部连接从ASM 实例读取归档的操作。 该机制不同于OGG的外部抽取。

为什么产生ASM Audit Files操作系统文件 .aud

无论是否启用数据库审计,Oracle总是审计某些高级用相关的操作,并将其写入操作系统审计文件。这是Oracle预期行为。

AUDIT_SYS_OPERATIONS Set To FALSE Yet Audit Files Are Generated (Doc ID 308066.1)

Users find audit file are generated at Audit_File_Dest location. They can see database instance has the following setting .

1. audit_sys_operations=FALSE
2. audit_file_dest=/oracle10g/10.1.0/rdbms/audit
3. audit_trail=NONE

Although the audit_sys_operations is set to FALSE, the audit files were still been generated audit_file_dest.

出于安全追责考虑,在禁用数据库审计的情况,下面这些操作默认总是将会记录到审计日志(到操作系统上):

  • 1, 使用管理员权限的用户连接实例,如sysoper, sysdba, sysasm
  • 2,   实例启动与关闭

sys audit审计日志的写入路径

-- 11g and before
select name, value , isdefault from v$parameter where name like 'audit%'
NAME                                     VALUE                                    ISDEFAULT
---------------------------------------- ---------------------------------------- ------------------
audit_sys_operations                     FALSE                                    TRUE
audit_file_dest                          /u01/app/11.2.0.4/grid/rdbms/audit       TRUE
audit_syslog_level                                                                TRUE

[grid@11g-node2 ~]$ cd /u01/app/11.2.0.4/grid/rdbms/audit
[grid@11g-node2 audit]$ ls
+ASM2_ora_10000_20240715151144939212143795.aud +ASM2_ora_17311_20240528095529202416143795.aud +ASM2_ora_24696_20240522024529233412143795.aud +ASM2_ora_32194_20240826165648662968143795.aud
+ASM2_ora_10001_20240522200529197414143795.aud +ASM2_ora_1731_20240707231648543005143795.aud +ASM2_ora_24696_20240612033529785180143795.aud +ASM2_ora_32194_20241011015648565985143795.aud
+ASM2_ora_10001_20240707121648576518143795.aud +ASM2_ora_1731_20240828074648631371143795.aud +ASM2_ora_24696_20240710123737456912143795.aud +ASM2_ora_32195_20240401173048276589143795.aud
...
-- 12c
select name, value , isdefault from v$parameter where name like 'audit%'
NAME                                     VALUE                                              ISDEFAULT
---------------------------------------- -------------------------------------------------- ------------------
audit_sys_operations                     TRUE                                               TRUE
audit_file_dest                          /u01/app/12.2.0/grid/rdbms/audit                   TRUE
audit_syslog_level                                                                          TRUE
audit_trail                              NONE                                               TRUE

-- 19c and above
select name, value , isdefault from v$parameter where name like 'audit%'

NAME                                     VALUE                                              ISDEFAULT
---------------------------------------- -------------------------------------------------- ------------------
audit_sys_operations                     TRUE                                               TRUE
audit_file_dest                          /u01/app/19.0.0/grid/rdbms/audit                   TRUE
audit_syslog_level                       LOCAL0.INFO                                        TRUE
audit_trail                              NONE                                               TRUE

[root@db1 ~]# cat /etc/rsyslog.conf
# rsyslog configuration file
# Log anything (except mail) of level info or higher.
# Don't log private authentication messages!
*.info;mail.none;authpriv.none;cron.none;local0.none;local1.none;local2.none            /var/log/messages
local0.info                     /var/log/oraasmaudit.log
local1.info                     /var/log/oraiosaudit.log
local2.info                     /var/log/oraapxaudit.log
...

[root@db1 ~]# cd /var/log/
[root@db1 log]# tail oraasmaudit.log
...
Nov 13 20:02:02 db1 journal[2428558]: Oracle Audit[2428558]: LENGTH : '234' ACTION :[7] 'CONNECT' DATABASE USER:[1] '/' PRIVILEGE :[6] 'SYSDBA' CLIENT USER:[4]ID:[0] '' SESSIONID:[10] '4294967295' USERHOST:[3] 'db1' CLIENT ADDRESS:[0] '' ACTION NUMBER:[3] '100'
Nov 13 20:02:12 db1 journal[2428769]: Oracle Audit[2428769]: LENGTH : '236' ACTION :[7] 'CONNECT' DATABASE USER:[1] '/' PRIVILEGE :[6] 'SYSDBA' CLIENT USER:[6]DBID:[0] '' SESSIONID:[10] '4294967295' USERHOST:[3] 'db1' CLIENT ADDRESS:[0] '' ACTION NUMBER:[3] '100'
Nov 13 20:03:25 db1 journal[2430576]: Oracle Audit[2430576]: LENGTH : '239' ACTION :[7] 'CONNECT' DATABASE USER:[1] '/' PRIVILEGE :[6] 'SYSASM' CLIENT USER:[4]0' DBID:[0] '' SESSIONID:[10] '4294967295' USERHOST:[3] 'db1' CLIENT ADDRESS:[0] '' ACTION NUMBER:[3] '100'

[root@db1 log]# date
Wed Nov 13 20:03:52 CST 2024

Note:
audit_trail是审计开启后的配置,但是对于sysdaba等管理用户无论是否开启审计,都会写操作系统日志。如果我们无法彻底禁用这类audit日志,像本案例怀疑与XFS文件系统bug相关,又无计划升级操作系统内核时,可以考虑创建一个EXT4文件系统,修改audit_file_dest到EXT4文件系统观察。

注意在19c后如果发现audit_file_dest没有像之前的版本一样创建日志,是因为默认audit_syslog_level中的参数指向了LOCAL0, rsyslog.conf中可以找到文件的指向/var/log/oraasmaudit.log。 从该文件中可以找到审计信息。也算是一种三权分立吧。如果是19c及以后的配置修改audit trace目录可以修改rsyslog.conf配置文件。

小结:

目前从安装和停止DSG进程 与故障出现的频率较为匹配,怀疑是DSG运行过程中增加的ASM instance使用sysasm访问要写audit日志,如果高并发的访问,可能会触发XFS log buffer耗尽或其它相关的与写XFS LOG相关的调用时死锁。system cpu升高,部分数据库进程进入”D”状态, 从而导致数据库后台进程间通信挂起或TIMEOUT,后实例重启。





        
打赏

, , , , ,

对不起,这篇文章暂时关闭评论。