首页 » ORACLE 9i-23ai » Troubleshooting Oracle RAC hang due to DBreplay capture write wcr file on NFS

Troubleshooting Oracle RAC hang due to DBreplay capture write wcr file on NFS

我之前写过一篇关于<Oracle到国产数据库的数据库负载重放(RAT or DBreply)>的文章。在将 Oracle 迁移到国产数据库的过程中,为了进行负载重演,一些国产数据库需要在 Oracle 源库上进行 DBReplay 的 capture。然而,这种操作对源生产环境会带来一定风险,不仅有 license 商务风险,还存在可用性风险,比如空间或性能问题。在之前的文档中,我建议将 capture 产生的 WCR rec 文件存储在高吞吐量且稳定的文件系统设备上。然而还是因为没有足够的重视wcr文件存储,最近还是因为 WCR 文件存放在NFS写入问题,导致了源数据库的全库 hang 死等重大故障。这里简单记录一下这个事件。

数据库负载高告警, 查看活动会话。

SQL>@ase

USERNAME        SID EVENT                MACHINE              MODULE               STATUS                  et SQL_ID          SQL_EXEC_ID WAI_SECINW ROW_WAIT_OBJ# SQLTEXT                        BS          CH# OSUSER        SEQ# HEX
----------- ------- -------------------- -------------------- -------------------- ---------------- --------- --------------- ----------- ---------- ------------- ------------------------------ ---------- ---- ---------- ------- ------------------
anbob          5094 WCR: capture file IO xxxxx-2              JDBC Thin Client     KILLED                3882                             0:3878                -1                                :               sam          12675
anbob         12401 WCR: capture file IO xxxxx-3              JDBC Thin Client     KILLED                3883                             0:3876                -1                                :               sam            912
anbob         21895 WCR: capture file IO xxxxxam-176-57       JDBC Thin Client     ACTIVE                3883                             0:3881           6155337                                :               sam            899
anbob         12606 WCR: capture file IO xxxxx-2              JDBC Thin Client     KILLED                3886                             0:3876           6153649                                :               sam            346
anbob          1804 WCR: capture file IO xxxxxam-176-59       JDBC Thin Client     ACTIVE                3889                             0:3879           6155337                                :               sam            585
anbob         26084 WCR: capture file IO xxxxx-1              JDBC Thin Client     ACTIVE                3889                             0:3881           6155337                                :               sam            559
anbob         27205 WCR: capture file IO xxxxxam-39-92        JDBC Thin Client     ACTIVE                3891                             0:3891           6155337                                :               sam           2906
anbob          6652 WCR: capture file IO xxxxx-2              JDBC Thin Client     KILLED                3898                             0:3877                -1                                :               sam           2787
anbob         12412 WCR: capture file IO xxxxx1               oracle               KILLED                3899                             0:3877                -1                                :                             9212
anbob          1551 WCR: capture file IO xxxxxam-39-92        JDBC Thin Client     ACTIVE                3899                             0:3882           6155337                                :  
...
1598 rows selected.

Note:
超过1000多个活动会话,等待WCR: capture file IO write事件,并没有blocking session,而是在等待IO。 为了尽快恢复业务,第一时间采取kill 会话,停止dbreplay capture。但是OS kill 进程没有释放和stop capture挂起。

检查操作进程状态

$ ps -eo pid,tid,class,rtprio,ni,pri,psr,pcpu,stat,state,wchan:40,comm,cmd,start_time|awk '{if($10=="D") print $0}'
   561    561 TS       -   0  19 128  0.0 Ds   D wait_on_page_bit                         oracle_561_anbt oracleanbtb2 (LOCAL=NO)     Aug28
   995    995 TS       -   0  19  41  0.4 Ds   D nfs_wait_on_request                      oracle_995_anbt oracleanbtb2 (LOCAL=NO)     Aug28
  1021   1021 TS       -   0  19 127  3.0 Ds   D nfs_wait_on_request                      oracle_1021_anb oracleanbtb2 (LOCAL=NO)     Aug28
  1118   1118 TS       -   0  19 135  0.2 Ds   D wait_on_page_bit                         oracle_1118_anb oracleanbtb2 (LOCAL=NO)     Aug28
  1125   1125 TS       -   0  19 136  1.7 Ds   D nfs_wait_on_request                      oracle_1125_anb oracleanbtb2 (LOCAL=NO)     Aug28
  1166   1166 TS       -   0  19  13  0.0 Ds   D nfs_wait_on_request                      oracle_1166_anb oracleanbtb2 (LOCAL=NO)     17:35
  1169   1169 TS       -   0  19  74  0.0 Ds   D nfs_wait_on_request                      oracle_1169_anb oracleanbtb2 (LOCAL=NO)     18:50
  1349   1349 TS       -   0  19  26  0.1 Ds   D nfs_wait_on_request                      oracle_1349_anb oracleanbtb2 (LOCAL=NO)     18:38
  1446   1446 TS       -   0  19 134  0.0 Ds   D nfs_wait_on_request                      oracle_1446_anb oracleanbtb2 (LOCAL=NO)     Aug28
  1499   1499 TS       -   0  19 133  0.0 Ds   D nfs_wait_on_request                      oracle_1499_anb oracleanbtb2 (LOCAL=NO)     Aug28
  1628   1628 TS       -   0  19 124  0.0 Ds   D wait_on_page_bit                         oracle_1628_anb oracleanbtb2 (LOCAL=NO)     Aug28
  1631   1631 TS       -   0  19  13  0.0 Ds   D nfs_wait_on_request                      oracle_1631_anb oracleanbtb2 (LOCAL=NO)     18:23
  2106   2106 TS       -   0  19 130  0.4 Ds   D nfs_wait_on_request                      oracle_2106_anb oracleanbtb2 (LOCAL=NO)     Aug28
  2165   2165 TS       -   0  19 137  0.0 Ds   D nfs_wait_on_request                      oracle_2165_anb oracleanbtb2 (LOCAL=NO)     Aug28
  2170   2170 TS       -   0  19   9  0.0 Ds   D nfs_wait_on_request                      oracle_2170_anb oracleanbtb2 (LOCAL=NO)     18:41
  2311   2311 TS       -   0  19  91  0.1 Ds   D nfs_wait_on_request                      oracle_2311_anb oracleanbtb2 (LOCAL=NO)     Aug28
  2422   2422 TS       -   0  19  73  0.0 Ds   D nfs_wait_on_request                      oracle_2422_anb oracleanbtb2 (LOCAL=NO)     18:35
  2479   2479 TS       -   0  19 151  0.3 Ds   D nfs_wait_on_request                      oracle_2479_anb oracleanbtb2 (LOCAL=NO)     Aug28
  2598   2598 TS       -   0  19  85  0.1 Ds   D nfs_wait_on_request                      oracle_2598_anb oracleanbtb2 (LOCAL=NO)     Aug28
  2654   2654 TS       -   0  19 129  0.0 Ds   D wait_on_page_bit                         oracle_2654_anb oracleanbtb2 (LOCAL=NO)     Aug28
  2728   2728 TS       -   0  19  68  0.0 Ds   D nfs_wait_on_request                      oracle_2728_anb oracleanbtb2 (LOCAL=NO)     12:52
  3221   3221 TS       -   0  19  78  0.0 Ds   D nfs_wait_on_request                      oracle_3221_anb oracleanbtb2 (LOCAL=NO)     Aug28
  3260   3260 TS       -   0  19 127  0.0 Ds   D nfs_wait_on_request                      oracle_3260_anb oracleanbtb2 (LOCAL=NO)     07:51
  3366   3366 TS       -   0  19  71  0.0 Ds   D nfs_wait_on_request                      oracle_3366_anb oracleanbtb2 (LOCAL=NO)     18:43
  3439   3439 TS       -   0  19  61  0.2 Ds   D nfs_wait_on_request                      oracle_3439_anb oracleanbtb2 (LOCAL=NO)     Aug28
...

Note:
注意当前大量进程是D状态,不可中断状态,当前的call stack是nfs_wait_on_request。 说明是在等NFS I/O确认, 该信息从osw的ps中同样记录。仅wchan可能长度不够无法显示完整函数,当时也可以做perf 记录OS信息。


nfs_wait_on_request & wait_on_page_bit

fs/nfs/pagelist.c
460 /**
461  * nfs_wait_on_request - Wait for a request to complete.
462  * @req: request to wait upon.
463  *
464  * Interruptible by fatal signals only.
465  * The user is responsible for holding a count on the request.
466  */
467 int
468 nfs_wait_on_request(struct nfs_page *req)
469 {
470-->  return wait_on_bit(&req->wb_flags, PG_BUSY,
471             nfs_wait_bit_uninterruptible,
472             TASK_UNINTERRUPTIBLE);
473 }

Note:
nfs_wait_on_request 是在等待NFS I/O 完成确认, 是NFS io slow还是hang 呢?

问题时间点

Note:
问题是在18:51左右开始,显然业务负载是降低了,并非业务量增加,操作系统层可以使用iostat, nfsiostat-sysstat确认。

nfsiostat

在osw 5.1就引入了nfsiostat(nfsiostat-sysstat),因为环境有完善的监控,OSW可以查看历史的i/o情况.

zzz ***Fri Aug 30 18:51:35 CST 2024

xxxx.com:/tmp_xxx  mounted on /wcr:

   op/s         rpc bklog
  67.00    0.00
read:            ops/s             kB/s           kB/op         retrans         avg RTT (ms)    avg exe (ms)
                  0.000   0.000   0.000       0 (0.0%)    0.000   0.000
write:           ops/s             kB/s           kB/op         retrans         avg RTT (ms)    avg exe (ms)
                 67.000 20800.832       310.460       0 (0.0%)   17.493  50.194

zzz ***Fri Aug 30 18:51:50 CST 2024

xxxx.com:/tmp_xxx  mounted on /wcr:

   op/s         rpc bklog
   0.00    0.00
read:            ops/s             kB/s           kB/op         retrans         avg RTT (ms)    avg exe (ms)
                  0.000   0.000   0.000       0 (0.0%)    0.000   0.000
write:           ops/s             kB/s           kB/op         retrans         avg RTT (ms)    avg exe (ms)
                  0.000   0.000   0.000       0 (0.0%)    0.000   0.000

Note:
NFS write接近为0,说明并非slow,而是彻底不可用。

没有找到相关bug, 仅下面一个已知问题.

RHEL7: One program writing to NFS hangs under nfs_wait_bit_uninterruptible while the rest of NFS remains responsive.
kernel earlier than kernel-3.10.0-327.42.1.el7
在执行 NFS 写入时,在某些情况下,从用户空间到内核的数据复制可能会复制 0 个字节。如果发生这种情况,NFS 客户端层可能会陷入无限期等待 0 长度写入完成的困境,而这种情况永远不会发生。这个问题已经在 NUMA 系统上出现并重现,因此在 NUMA 系统上遇到此问题可能更为常见。此问题的修复方法是确保 NFS 客户端有效跳过 0 长度写入,而不等待其完成。已知以下补丁可以修复此问题

WCR: capture file IO write 等待事件
当启用了Database Replay功能后,Server process将其相关的信息如LOGON、LOGOFF、SQL执行等信息记录存放在PGA中 ,主要是WCR Capture PG、WCR Capture PGA, 当PGA中的历史记录达到一定cache size时,Server Process负责写出这些数据到进程对应的WCR文件中,在写出这些WCR文件时Server Process会进入’WCR: capture file IO write’等待事件。

DBMS_WORKLOAD_CAPTURE.START_CAPTURE(): Starting database capture 
DBMS_WORKLOAD_CAPTURE.FINISH_CAPTURE(): Stopped database capture

Oracle Database 23c Enterprise Edition Release 23.0.0.0.0 - Beta
Version 23.1.0.0.0

USERNAME             INST_NAME            HOST_NAME                  I# SID   SERIAL#  VERSION    STARTED  SPID       OPID  CPID            SADDR            PADDR
-------------------- -------------------- ------------------------- --- ----- -------- ---------- -------- ---------- ----- --------------- ---------------- ----------------
SYS                  CDB$ROOT-anbob1       db1                         1 12    13645    23.0.0.0.0 20240815 2312929    144   2312928         00000000C4795900 00000000C43ED168

--相关wait event
SQL> @sed wcr
Show wait event descriptions matching %wcr%..

EVENT# EVENT_NAME                                              WAIT_CLASS           PARAMETER1                PARAMETER2                PARAMETER3                ENQUEUE_NAME                   REQ_REASON                       REQ_DESCRIPTION
------ ------------------------------------------------------- -------------------- ------------------------- ------------------------- ------------------------- ------------------------------ -------------------------------- ----------------------------------------------------------------------------------------------------
  2114 WCR: RAC message context busy                           Other
  2116 WCR: Sync context busy                                  Other
  2115 WCR: capture file IO write                              Other
   556 WCR: replay client notify                               Idle                 who am I
   557 WCR: replay clock                                       Idle                 wait for scn's hi 4 bytes wait for scn's lo 4 bytes schedule capture ID
   558 WCR: replay lock order                                  Application          wait for scn's hi 4 bytes wait for scn's lo 4 bytes
   559 WCR: replay paused                                      Idle
  2118 latch: WCR: processes HT                                Other                address                   number                    why
  2117 latch: WCR: sync                                        Other                address                   number                    why

9 rows selected.

--相关参数
SQL> @pd wcr
Show all parameters and session values from x$ksppi/x$ksppcv...

       NUM N_HEX NAME                                                     VALUE                          DESCRIPTION
---------- ----- -------------------------------------------------------- ------------------------------ ---------------------------------------------------------------------------------------------------
      5982  175E _wcr_control                                             0                              Oracle internal test WCR parameter used ONLY for testing!
      5987  1763 _wcr_seq_cache_size                                      65535                          Oracle internal: Set the replay cache size for WRR$_REPLAY_SEQ_DATA.
      5988  1764 _wcr_grv_cache_size                                      65535                          Oracle internal: Set the replay cache size for WRR$_REPLAY_DATA.
      5989  1765 _wcr_test_action                                         0                              Oracle internal test WCR parameter used for test actions

SQL> @pd capture
Show all parameters and session values from x$ksppi/x$ksppcv...

       NUM N_HEX NAME                                                     VALUE                          DESCRIPTION
---------- ----- -------------------------------------------------------- ------------------------------ ---------------------------------------------------------------------------------------------------
      1076   434 _diag_proc_stack_capture_type                            1                              hung process diagnostic API stack capture type
      3908   F44 _kgl_kqr_cap_so_stacks                                   0                              capture stacks for library and row cache state objects
      4212  1074 _kxscio_cap_stacks                                       FALSE                          capture location when kxscio is set to null
      4409  1139 _use_shared_capture                                      FALSE                          Enable GoldenGate Extracts to use Shared Capture Service
      4424  1148 _pgactx_cap_stacks                                       FALSE                          capture stacks for setting pgactx
      4685  124D _rowsource_profiling_statistics                          TRUE                           if TRUE, Oracle will capture active row sources in v$active_session_history
      4722  1272 _cursor_bind_capture_area_size                           400                            maximum size of the cursor bind capture area
      4723  1273 cursor_bind_capture_destination                          memory+disk                    Allowed destination for captured bind variables
      4725  1275 _cursor_bind_capture_interval                            900                            interval (in seconds) between two bind capture for a cursor
      4891  131B optimizer_capture_sql_plan_baselines                     FALSE                          automatic capture of SQL plan baselines for repeatable statements
      4892  131C optimizer_use_sql_plan_baselines                         TRUE                           use of SQL plan baselines for captured sql statements
      5008  1390 _cdc_subscription_owner                                                                 Change Data Capture subscription_owner
      5021  139D _zonemap_usage_tracking                                  TRUE                           Control the capture of zonemap usage statistics
      5218  1462 _optimizer_inmemory_capture_stored_stats                 TRUE                           optimizer capture and store statistics for in-memory tables
      5274  149A optimizer_capture_sql_quarantine                         FALSE                          enable automatic creation/update of sql quarantine configuration
      5291  14AB _autosts_capture_control                                 3                              internal controls for auto STS capture
      5312  14C0 _kksfra_cap_stacks                                       FALSE                          capture stacks for setting kksfra
      5404  151C _sscr_dir                                                                               Session State Capture and Restore DIRectory object
      5405  151D _sscr_osdir                                                                             Session State Capture and Restore OS DIRectory
      5462  1556 _capture_publisher_flow_control_threshold                0                              Flow control threshold for capture publishers
      5463  1557 _buffered_publisher_flow_control_threshold               0                              Flow control threshold for buffered publishers except capture
      5986  1762 _capture_buffer_size                                     65536                          To set the size of the PGA I/O recording buffers
      5993  1769 _report_capture_cycle_time                               60                             Time (in sec) between two cycles of report capture daemon
      5994  176A _report_capture_dbtime_percent_cutoff                    50                             100X Percent of system db time daemon is allowed over 10 cycles
      5995  176B _report_capture_timeband_length                          1                              Length of time band (in hours) in the reports time bands table
      5996  176C _report_capture_recharge_window                          10                             No of report capture cycles after which db time is recharged
      6041  1799 _diag_conf_cap_enabled                                   TRUE                           Parameter to enable/disable Diag Configuration Capture
      6042  179A _diag_patch_cap_enabled                                  TRUE                           Parameter to enable/disable Diag Patch Configuration Capture
      6333  18BD _upgrade_capture_noops                                   TRUE                           do not skip capturing noops during upgrade
      6335  18BF _capture_pgadep                                          0                              capture statements at this pgadep
      6342  18C6 _enable_cdb_upgrade_capture                              FALSE                          Enable capture of CDB upgrade
      6358  18D6 _enable_ignorable_error_capture                          FALSE                          Capture statement if error is ignorable
      6359  18D7 _disable_marking_noops                                   FALSE                          Don't mark any captured statements as no-ops.
      6367  18DF _enable_noncdb_capture                                   FALSE                          Parameter to enable capture in non-CDB
      6368  18E0 _enable_capture_select                                   FALSE                          Parameter to enable capture of SELECT statements

35 rows selected.


推荐用什么存储作为 capture directory?

A very broad and general recommendation would be that the storage must be decent enough to handle the write throughput and not cause a slow down. Generally it should be of same specification as that used to store the datafiles. With insufficient storage throughput you are likely to see many waits for events such as: ‘WCR: capture file IO write’ when the capture starts to queue up waiting to write to the files.

1) You can’t use ASM for storing the DB capture files.ASM was the only shared directory I had at that point.
2) If you run capture from one instance it will capture the workload of all the instances.

推荐使用和oracle 数据文件相同性能的存储,如果性能不足会出现 ‘WCR: capture file IO write’ 等待,对于RAC集群存在workload capture  file的目录必须位于共享文件系统上,否则start_capture时会报错(我测试直接创建在home下,相同的目录名,但是并未报错)。可以从共享存储划一块ASM disk作为filesystem, NFS给另一节点。

相关跟踪event

-- 打开
SQL> alter session set events 'trace[rat_wcr.*] disk=high';

EXEC DBMS_WORKLOAD_CAPTURE.ADD_FILTER ('filter user1', 'USER', 'anbob');

BEGIN 
dbms_workload_capture.start_capture( 'CAPTURE-anbob-2024' ,DIR=>'[empty  directory]',DURATION=>60); 
END;
/

-- 关闭
SQL> alter session set events 'trace[rat_wcr.*] off';

有IDLE SESSION, FINISH_CAPTURE停不下来?
ONLINE doc

By default, FINISH_CAPTURE waits for 30 seconds to receive a successful acknowledgement from all sessions in the database cluster before timing out.

All sessions that either were in the middle of executing a user request or received a new user request, while FINISH_CAPTURE was waiting for acknowledgements, flush their buffers and send back their acknowledgement to FINISH_CAPTURE.

If a database session remains idle (waiting for the next user request) throughout the duration of FINISH_CAPTURE, the session might have unflushed capture buffers and does not send it’s acknowledgement to FINISH_CAPTURE.

To avoid this, do not have sessions that remain idle (waiting for the next user request) while invoking FINISH_CAPTURE. Either close the database session(s) before running FINISH_CAPTURE or send new database requests to those sessions during FINISH_CAPTURE.

测试

RAC 所有节点创建相同目录  /home/oracle/wcr, --注意并不是共享目录. 但是目录其他节点不存在或目录不为空是不允许的。

-- env Oracle 23c 2-nodes RAC

SQL> create directory wcrpath as '/home/oracle/wcr';
Directory created.

SQL> GRANT WRITE ON DIRECTORY  WCRPATH TO PUBLIC;
Grant succeeded.

SQL> alter session set events 'trace[rat_wcr.*] disk=high';
Session altered.

SQL> EXEC DBMS_WORKLOAD_CAPTURE.ADD_FILTER ('filter user1', 'USER', 'anbob');
PL/SQL procedure successfully completed.

SQL> alter session set events 'trace[rat_wcr.*] off';
Session altered.

SQL> @i

USERNAME             INST_NAME            HOST_NAME                  I# SID   SERIAL#  VERSION    STARTED  SPID       OPID  CPID            SADDR            PADDR
-------------------- -------------------- ------------------------- --- ----- -------- ---------- -------- ---------- ----- --------------- ---------------- ----------------
SYS                  TEST-anbob1           db1                         1 1595  9139     23.0.0.0.0 20240907 4161328    141   4161327         000000010B90B198 00000001085107A8


SQL> @t
TRACEFILE
------------------------------------------------------------------------------------------------------------- 
/u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_ora_4161328.trc

SQL> select sid,username,status,last_call_et from v$session where type='USER';

       SID USERNAME                                                                STATUS   LAST_CALL_ET
---------- ----------------------------------------------------------------------- -------- ------------
      1108 ANBOB                                                                   INACTIVE         1287
      1595 SYS                                                                     ACTIVE              0
      1831 ANBOB                                                                   INACTIVE           43

SQL> select id,name,status from DBA_WORKLOAD_CAPTURES;

        ID NAME                                                                    STATUS
---------- ----------------------------------------------------------------------- ----------------------------------------
         5 CAPTURE2024                                                             IN PROGRESS

SQL> exec DBMS_WORKLOAD_CAPTURE.FINISH_CAPTURE();

PL/SQL procedure successfully completed.

SQL> select id,name,status from DBA_WORKLOAD_CAPTURES;

        ID NAME                                                                    STATUS
---------- ----------------------------------------------------------------------- ----------------------------------------
         5 CAPTURE2024                                                             COMPLETED

SQL> select sid,username,status,last_call_et from v$session where type='USER';

       SID USERNAME                                                                STATUS   LAST_CALL_ET
---------- ----------------------------------------------------------------------- -------- ------------
      1108 ANBOB                                                                   INACTIVE         1411 --keep idle
      1595 SYS                                                                     ACTIVE              0
      1831 ANBOB                                                                   INACTIVE          167

SQL>select id,name,status,start_time,end_time,duration_secs  from DBA_WORKLOAD_CAPTURES

        ID NAME                           STATUS                                   START_TIME          END_TIME            DURATION_SECS
---------- ------------------------------ ---------------------------------------- ------------------- ------------------- -------------
         5 CAPTURE2024                    COMPLETED                                2024-09-07 11:10:09 2024-09-07 11:12:06           117

[oracle@db2 wcr]$ tree
.
├── cap
└── capfiles
    └── inst2
        ├── aa
        ├── ab
        ├── ac
        ├── ad
        ├── ae
        ├── af
        ├── ag
        ├── ah
        ├── ai
        └── aj

13 directories, 0 files

[oracle@db1 wcr]$ tree
.
├── cap
│   ├── wcr_cr.html
│   ├── wcr_fcapture.wmd
│   └── wcr_scapture.wmd
└── capfiles
    └── inst1
        ├── aa
        │   ├── wcr_dsd4xh0000000.rec
        │   ├── wcr_dsd4yh0000001.rec
        │   └── wcr_dsd62h0000002.rec
        ├── ab
        ├── ac
        ├── ad
        ├── ae
        ├── af
        ├── ag
        ├── ah
        ├── ai
        └── aj

13 directories, 6 files


[oracle@db1 wcr]$ find . -type f -exec  ls -l {} \;
-rw-rw-rw- 1 oracle asmadmin 171 Sep  7 19:10 ./cap/wcr_scapture.wmd
-rw-rw-rw- 1 oracle asmadmin 279 Sep  7 19:13 ./cap/wcr_fcapture.wmd
-rw-rw-rw- 1 oracle asmadmin 33304 Sep  7 19:13 ./cap/wcr_cr.html
-rw-rw-rw- 1 oracle asmadmin 1970 Sep  7 19:10 ./capfiles/inst1/aa/wcr_dsd4xh0000000.rec
-rw-rw-rw- 1 oracle asmadmin 5848 Sep  7 19:12 ./capfiles/inst1/aa/wcr_dsd62h0000002.rec
-rw-rw-rw- 1 oracle asmadmin 6755 Sep  7 19:12 ./capfiles/inst1/aa/wcr_dsd4yh0000001.rec

Note:
测试RAC 环境 wcr目录不是共享存储,而且一直有session idle状态,FINISH_CAPTURE一样可以停止, 所以本次故障FINISH_CAPTURE中无法中止和server process无法kill -9 一样,原因是在操作系统进程是”D”状态,在系统态,无法干预导致。

小结:

因为NFS 文件系统挂载的远程存储不可写,导致写wcr rec文件hang, 源数据库的server进程hang并且等待WCR: capture file IO write, 断开NFS后,中断capture恢复正常。

打赏

, , ,

对不起,这篇文章暂时关闭评论。