Troubleshooting Oracle RAC hang due to DBreplay capture write wcr file on NFS
我之前写过一篇关于<Oracle到国产数据库的数据库负载重放(RAT or DBreply)>的文章。在将 Oracle 迁移到国产数据库的过程中,为了进行负载重演,一些国产数据库需要在 Oracle 源库上进行 DBReplay 的 capture。然而,这种操作对源生产环境会带来一定风险,不仅有 license 商务风险,还存在可用性风险,比如空间或性能问题。在之前的文档中,我建议将 capture 产生的 WCR rec 文件存储在高吞吐量且稳定的文件系统设备上。然而还是因为没有足够的重视wcr文件存储,最近还是因为 WCR 文件存放在NFS写入问题,导致了源数据库的全库 hang 死等重大故障。这里简单记录一下这个事件。
数据库负载高告警, 查看活动会话。
SQL>@ase USERNAME SID EVENT MACHINE MODULE STATUS et SQL_ID SQL_EXEC_ID WAI_SECINW ROW_WAIT_OBJ# SQLTEXT BS CH# OSUSER SEQ# HEX ----------- ------- -------------------- -------------------- -------------------- ---------------- --------- --------------- ----------- ---------- ------------- ------------------------------ ---------- ---- ---------- ------- ------------------ anbob 5094 WCR: capture file IO xxxxx-2 JDBC Thin Client KILLED 3882 0:3878 -1 : sam 12675 anbob 12401 WCR: capture file IO xxxxx-3 JDBC Thin Client KILLED 3883 0:3876 -1 : sam 912 anbob 21895 WCR: capture file IO xxxxxam-176-57 JDBC Thin Client ACTIVE 3883 0:3881 6155337 : sam 899 anbob 12606 WCR: capture file IO xxxxx-2 JDBC Thin Client KILLED 3886 0:3876 6153649 : sam 346 anbob 1804 WCR: capture file IO xxxxxam-176-59 JDBC Thin Client ACTIVE 3889 0:3879 6155337 : sam 585 anbob 26084 WCR: capture file IO xxxxx-1 JDBC Thin Client ACTIVE 3889 0:3881 6155337 : sam 559 anbob 27205 WCR: capture file IO xxxxxam-39-92 JDBC Thin Client ACTIVE 3891 0:3891 6155337 : sam 2906 anbob 6652 WCR: capture file IO xxxxx-2 JDBC Thin Client KILLED 3898 0:3877 -1 : sam 2787 anbob 12412 WCR: capture file IO xxxxx1 oracle KILLED 3899 0:3877 -1 : 9212 anbob 1551 WCR: capture file IO xxxxxam-39-92 JDBC Thin Client ACTIVE 3899 0:3882 6155337 : ... 1598 rows selected.
Note:
超过1000多个活动会话,等待WCR: capture file IO write事件,并没有blocking session,而是在等待IO。 为了尽快恢复业务,第一时间采取kill 会话,停止dbreplay capture。但是OS kill 进程没有释放和stop capture挂起。
检查操作进程状态
$ ps -eo pid,tid,class,rtprio,ni,pri,psr,pcpu,stat,state,wchan:40,comm,cmd,start_time|awk '{if($10=="D") print $0}' 561 561 TS - 0 19 128 0.0 Ds D wait_on_page_bit oracle_561_anbt oracleanbtb2 (LOCAL=NO) Aug28 995 995 TS - 0 19 41 0.4 Ds D nfs_wait_on_request oracle_995_anbt oracleanbtb2 (LOCAL=NO) Aug28 1021 1021 TS - 0 19 127 3.0 Ds D nfs_wait_on_request oracle_1021_anb oracleanbtb2 (LOCAL=NO) Aug28 1118 1118 TS - 0 19 135 0.2 Ds D wait_on_page_bit oracle_1118_anb oracleanbtb2 (LOCAL=NO) Aug28 1125 1125 TS - 0 19 136 1.7 Ds D nfs_wait_on_request oracle_1125_anb oracleanbtb2 (LOCAL=NO) Aug28 1166 1166 TS - 0 19 13 0.0 Ds D nfs_wait_on_request oracle_1166_anb oracleanbtb2 (LOCAL=NO) 17:35 1169 1169 TS - 0 19 74 0.0 Ds D nfs_wait_on_request oracle_1169_anb oracleanbtb2 (LOCAL=NO) 18:50 1349 1349 TS - 0 19 26 0.1 Ds D nfs_wait_on_request oracle_1349_anb oracleanbtb2 (LOCAL=NO) 18:38 1446 1446 TS - 0 19 134 0.0 Ds D nfs_wait_on_request oracle_1446_anb oracleanbtb2 (LOCAL=NO) Aug28 1499 1499 TS - 0 19 133 0.0 Ds D nfs_wait_on_request oracle_1499_anb oracleanbtb2 (LOCAL=NO) Aug28 1628 1628 TS - 0 19 124 0.0 Ds D wait_on_page_bit oracle_1628_anb oracleanbtb2 (LOCAL=NO) Aug28 1631 1631 TS - 0 19 13 0.0 Ds D nfs_wait_on_request oracle_1631_anb oracleanbtb2 (LOCAL=NO) 18:23 2106 2106 TS - 0 19 130 0.4 Ds D nfs_wait_on_request oracle_2106_anb oracleanbtb2 (LOCAL=NO) Aug28 2165 2165 TS - 0 19 137 0.0 Ds D nfs_wait_on_request oracle_2165_anb oracleanbtb2 (LOCAL=NO) Aug28 2170 2170 TS - 0 19 9 0.0 Ds D nfs_wait_on_request oracle_2170_anb oracleanbtb2 (LOCAL=NO) 18:41 2311 2311 TS - 0 19 91 0.1 Ds D nfs_wait_on_request oracle_2311_anb oracleanbtb2 (LOCAL=NO) Aug28 2422 2422 TS - 0 19 73 0.0 Ds D nfs_wait_on_request oracle_2422_anb oracleanbtb2 (LOCAL=NO) 18:35 2479 2479 TS - 0 19 151 0.3 Ds D nfs_wait_on_request oracle_2479_anb oracleanbtb2 (LOCAL=NO) Aug28 2598 2598 TS - 0 19 85 0.1 Ds D nfs_wait_on_request oracle_2598_anb oracleanbtb2 (LOCAL=NO) Aug28 2654 2654 TS - 0 19 129 0.0 Ds D wait_on_page_bit oracle_2654_anb oracleanbtb2 (LOCAL=NO) Aug28 2728 2728 TS - 0 19 68 0.0 Ds D nfs_wait_on_request oracle_2728_anb oracleanbtb2 (LOCAL=NO) 12:52 3221 3221 TS - 0 19 78 0.0 Ds D nfs_wait_on_request oracle_3221_anb oracleanbtb2 (LOCAL=NO) Aug28 3260 3260 TS - 0 19 127 0.0 Ds D nfs_wait_on_request oracle_3260_anb oracleanbtb2 (LOCAL=NO) 07:51 3366 3366 TS - 0 19 71 0.0 Ds D nfs_wait_on_request oracle_3366_anb oracleanbtb2 (LOCAL=NO) 18:43 3439 3439 TS - 0 19 61 0.2 Ds D nfs_wait_on_request oracle_3439_anb oracleanbtb2 (LOCAL=NO) Aug28 ...
Note:
注意当前大量进程是D状态,不可中断状态,当前的call stack是nfs_wait_on_request。 说明是在等NFS I/O确认, 该信息从osw的ps中同样记录。仅wchan可能长度不够无法显示完整函数,当时也可以做perf 记录OS信息。
nfs_wait_on_request & wait_on_page_bit
fs/nfs/pagelist.c 460 /** 461 * nfs_wait_on_request - Wait for a request to complete. 462 * @req: request to wait upon. 463 * 464 * Interruptible by fatal signals only. 465 * The user is responsible for holding a count on the request. 466 */ 467 int 468 nfs_wait_on_request(struct nfs_page *req) 469 { 470--> return wait_on_bit(&req->wb_flags, PG_BUSY, 471 nfs_wait_bit_uninterruptible, 472 TASK_UNINTERRUPTIBLE); 473 }
Note:
nfs_wait_on_request 是在等待NFS I/O 完成确认, 是NFS io slow还是hang 呢?
问题时间点
Note:
问题是在18:51左右开始,显然业务负载是降低了,并非业务量增加,操作系统层可以使用iostat, nfsiostat-sysstat确认。
nfsiostat
在osw 5.1就引入了nfsiostat(nfsiostat-sysstat),因为环境有完善的监控,OSW可以查看历史的i/o情况.
zzz ***Fri Aug 30 18:51:35 CST 2024 xxxx.com:/tmp_xxx mounted on /wcr: op/s rpc bklog 67.00 0.00 read: ops/s kB/s kB/op retrans avg RTT (ms) avg exe (ms) 0.000 0.000 0.000 0 (0.0%) 0.000 0.000 write: ops/s kB/s kB/op retrans avg RTT (ms) avg exe (ms) 67.000 20800.832 310.460 0 (0.0%) 17.493 50.194 zzz ***Fri Aug 30 18:51:50 CST 2024 xxxx.com:/tmp_xxx mounted on /wcr: op/s rpc bklog 0.00 0.00 read: ops/s kB/s kB/op retrans avg RTT (ms) avg exe (ms) 0.000 0.000 0.000 0 (0.0%) 0.000 0.000 write: ops/s kB/s kB/op retrans avg RTT (ms) avg exe (ms) 0.000 0.000 0.000 0 (0.0%) 0.000 0.000
Note:
NFS write接近为0,说明并非slow,而是彻底不可用。
没有找到相关bug, 仅下面一个已知问题.
RHEL7: One program writing to NFS hangs under nfs_wait_bit_uninterruptible while the rest of NFS remains responsive.
kernel earlier than kernel-3.10.0-327.42.1.el7
在执行 NFS 写入时,在某些情况下,从用户空间到内核的数据复制可能会复制 0 个字节。如果发生这种情况,NFS 客户端层可能会陷入无限期等待 0 长度写入完成的困境,而这种情况永远不会发生。这个问题已经在 NUMA 系统上出现并重现,因此在 NUMA 系统上遇到此问题可能更为常见。此问题的修复方法是确保 NFS 客户端有效跳过 0 长度写入,而不等待其完成。已知以下补丁可以修复此问题
WCR: capture file IO write 等待事件
当启用了Database Replay功能后,Server process将其相关的信息如LOGON、LOGOFF、SQL执行等信息记录存放在PGA中 ,主要是WCR Capture PG、WCR Capture PGA, 当PGA中的历史记录达到一定cache size时,Server Process负责写出这些数据到进程对应的WCR文件中,在写出这些WCR文件时Server Process会进入’WCR: capture file IO write’等待事件。
DBMS_WORKLOAD_CAPTURE.START_CAPTURE(): Starting database capture
DBMS_WORKLOAD_CAPTURE.FINISH_CAPTURE(): Stopped database capture
Oracle Database 23c Enterprise Edition Release 23.0.0.0.0 - Beta
Version 23.1.0.0.0
USERNAME INST_NAME HOST_NAME I# SID SERIAL# VERSION STARTED SPID OPID CPID SADDR PADDR
-------------------- -------------------- ------------------------- --- ----- -------- ---------- -------- ---------- ----- --------------- ---------------- ----------------
SYS CDB$ROOT-anbob1 db1 1 12 13645 23.0.0.0.0 20240815 2312929 144 2312928 00000000C4795900 00000000C43ED168
--相关wait event
SQL> @sed wcr
Show wait event descriptions matching %wcr%..
EVENT# EVENT_NAME WAIT_CLASS PARAMETER1 PARAMETER2 PARAMETER3 ENQUEUE_NAME REQ_REASON REQ_DESCRIPTION
------ ------------------------------------------------------- -------------------- ------------------------- ------------------------- ------------------------- ------------------------------ -------------------------------- ----------------------------------------------------------------------------------------------------
2114 WCR: RAC message context busy Other
2116 WCR: Sync context busy Other
2115 WCR: capture file IO write Other
556 WCR: replay client notify Idle who am I
557 WCR: replay clock Idle wait for scn's hi 4 bytes wait for scn's lo 4 bytes schedule capture ID
558 WCR: replay lock order Application wait for scn's hi 4 bytes wait for scn's lo 4 bytes
559 WCR: replay paused Idle
2118 latch: WCR: processes HT Other address number why
2117 latch: WCR: sync Other address number why
9 rows selected.
--相关参数
SQL> @pd wcr
Show all parameters and session values from x$ksppi/x$ksppcv...
NUM N_HEX NAME VALUE DESCRIPTION
---------- ----- -------------------------------------------------------- ------------------------------ ---------------------------------------------------------------------------------------------------
5982 175E _wcr_control 0 Oracle internal test WCR parameter used ONLY for testing!
5987 1763 _wcr_seq_cache_size 65535 Oracle internal: Set the replay cache size for WRR$_REPLAY_SEQ_DATA.
5988 1764 _wcr_grv_cache_size 65535 Oracle internal: Set the replay cache size for WRR$_REPLAY_DATA.
5989 1765 _wcr_test_action 0 Oracle internal test WCR parameter used for test actions
SQL> @pd capture
Show all parameters and session values from x$ksppi/x$ksppcv...
NUM N_HEX NAME VALUE DESCRIPTION
---------- ----- -------------------------------------------------------- ------------------------------ ---------------------------------------------------------------------------------------------------
1076 434 _diag_proc_stack_capture_type 1 hung process diagnostic API stack capture type
3908 F44 _kgl_kqr_cap_so_stacks 0 capture stacks for library and row cache state objects
4212 1074 _kxscio_cap_stacks FALSE capture location when kxscio is set to null
4409 1139 _use_shared_capture FALSE Enable GoldenGate Extracts to use Shared Capture Service
4424 1148 _pgactx_cap_stacks FALSE capture stacks for setting pgactx
4685 124D _rowsource_profiling_statistics TRUE if TRUE, Oracle will capture active row sources in v$active_session_history
4722 1272 _cursor_bind_capture_area_size 400 maximum size of the cursor bind capture area
4723 1273 cursor_bind_capture_destination memory+disk Allowed destination for captured bind variables
4725 1275 _cursor_bind_capture_interval 900 interval (in seconds) between two bind capture for a cursor
4891 131B optimizer_capture_sql_plan_baselines FALSE automatic capture of SQL plan baselines for repeatable statements
4892 131C optimizer_use_sql_plan_baselines TRUE use of SQL plan baselines for captured sql statements
5008 1390 _cdc_subscription_owner Change Data Capture subscription_owner
5021 139D _zonemap_usage_tracking TRUE Control the capture of zonemap usage statistics
5218 1462 _optimizer_inmemory_capture_stored_stats TRUE optimizer capture and store statistics for in-memory tables
5274 149A optimizer_capture_sql_quarantine FALSE enable automatic creation/update of sql quarantine configuration
5291 14AB _autosts_capture_control 3 internal controls for auto STS capture
5312 14C0 _kksfra_cap_stacks FALSE capture stacks for setting kksfra
5404 151C _sscr_dir Session State Capture and Restore DIRectory object
5405 151D _sscr_osdir Session State Capture and Restore OS DIRectory
5462 1556 _capture_publisher_flow_control_threshold 0 Flow control threshold for capture publishers
5463 1557 _buffered_publisher_flow_control_threshold 0 Flow control threshold for buffered publishers except capture
5986 1762 _capture_buffer_size 65536 To set the size of the PGA I/O recording buffers
5993 1769 _report_capture_cycle_time 60 Time (in sec) between two cycles of report capture daemon
5994 176A _report_capture_dbtime_percent_cutoff 50 100X Percent of system db time daemon is allowed over 10 cycles
5995 176B _report_capture_timeband_length 1 Length of time band (in hours) in the reports time bands table
5996 176C _report_capture_recharge_window 10 No of report capture cycles after which db time is recharged
6041 1799 _diag_conf_cap_enabled TRUE Parameter to enable/disable Diag Configuration Capture
6042 179A _diag_patch_cap_enabled TRUE Parameter to enable/disable Diag Patch Configuration Capture
6333 18BD _upgrade_capture_noops TRUE do not skip capturing noops during upgrade
6335 18BF _capture_pgadep 0 capture statements at this pgadep
6342 18C6 _enable_cdb_upgrade_capture FALSE Enable capture of CDB upgrade
6358 18D6 _enable_ignorable_error_capture FALSE Capture statement if error is ignorable
6359 18D7 _disable_marking_noops FALSE Don't mark any captured statements as no-ops.
6367 18DF _enable_noncdb_capture FALSE Parameter to enable capture in non-CDB
6368 18E0 _enable_capture_select FALSE Parameter to enable capture of SELECT statements
35 rows selected.
推荐用什么存储作为 capture directory?
A very broad and general recommendation would be that the storage must be decent enough to handle the write throughput and not cause a slow down. Generally it should be of same specification as that used to store the datafiles. With insufficient storage throughput you are likely to see many waits for events such as: ‘WCR: capture file IO write’ when the capture starts to queue up waiting to write to the files.
推荐使用和oracle 数据文件相同性能的存储,如果性能不足会出现 ‘WCR: capture file IO write’ 等待,对于RAC集群存在workload capture file的目录必须位于共享文件系统上,否则start_capture时会报错(我测试直接创建在home下,相同的目录名,但是并未报错)。可以从共享存储划一块ASM disk作为filesystem, NFS给另一节点。
相关跟踪event
-- 打开 SQL> alter session set events 'trace[rat_wcr.*] disk=high'; EXEC DBMS_WORKLOAD_CAPTURE.ADD_FILTER ('filter user1', 'USER', 'anbob'); BEGIN dbms_workload_capture.start_capture( 'CAPTURE-anbob-2024' ,DIR=>'[empty directory]',DURATION=>60); END; / -- 关闭 SQL> alter session set events 'trace[rat_wcr.*] off';
有IDLE SESSION, FINISH_CAPTURE停不下来?
ONLINE doc
By default, FINISH_CAPTURE waits for 30 seconds to receive a successful acknowledgement from all sessions in the database cluster before timing out.
All sessions that either were in the middle of executing a user request or received a new user request, while FINISH_CAPTURE was waiting for acknowledgements, flush their buffers and send back their acknowledgement to FINISH_CAPTURE.
If a database session remains idle (waiting for the next user request) throughout the duration of FINISH_CAPTURE, the session might have unflushed capture buffers and does not send it’s acknowledgement to FINISH_CAPTURE.
To avoid this, do not have sessions that remain idle (waiting for the next user request) while invoking FINISH_CAPTURE. Either close the database session(s) before running FINISH_CAPTURE or send new database requests to those sessions during FINISH_CAPTURE.
测试
RAC 所有节点创建相同目录 /home/oracle/wcr, --注意并不是共享目录. 但是目录其他节点不存在或目录不为空是不允许的。 -- env Oracle 23c 2-nodes RAC SQL> create directory wcrpath as '/home/oracle/wcr'; Directory created. SQL> GRANT WRITE ON DIRECTORY WCRPATH TO PUBLIC; Grant succeeded. SQL> alter session set events 'trace[rat_wcr.*] disk=high'; Session altered. SQL> EXEC DBMS_WORKLOAD_CAPTURE.ADD_FILTER ('filter user1', 'USER', 'anbob'); PL/SQL procedure successfully completed. SQL> alter session set events 'trace[rat_wcr.*] off'; Session altered. SQL> @i USERNAME INST_NAME HOST_NAME I# SID SERIAL# VERSION STARTED SPID OPID CPID SADDR PADDR -------------------- -------------------- ------------------------- --- ----- -------- ---------- -------- ---------- ----- --------------- ---------------- ---------------- SYS TEST-anbob1 db1 1 1595 9139 23.0.0.0.0 20240907 4161328 141 4161327 000000010B90B198 00000001085107A8 SQL> @t TRACEFILE ------------------------------------------------------------------------------------------------------------- /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_ora_4161328.trc SQL> select sid,username,status,last_call_et from v$session where type='USER'; SID USERNAME STATUS LAST_CALL_ET ---------- ----------------------------------------------------------------------- -------- ------------ 1108 ANBOB INACTIVE 1287 1595 SYS ACTIVE 0 1831 ANBOB INACTIVE 43 SQL> select id,name,status from DBA_WORKLOAD_CAPTURES; ID NAME STATUS ---------- ----------------------------------------------------------------------- ---------------------------------------- 5 CAPTURE2024 IN PROGRESS SQL> exec DBMS_WORKLOAD_CAPTURE.FINISH_CAPTURE(); PL/SQL procedure successfully completed. SQL> select id,name,status from DBA_WORKLOAD_CAPTURES; ID NAME STATUS ---------- ----------------------------------------------------------------------- ---------------------------------------- 5 CAPTURE2024 COMPLETED SQL> select sid,username,status,last_call_et from v$session where type='USER'; SID USERNAME STATUS LAST_CALL_ET ---------- ----------------------------------------------------------------------- -------- ------------ 1108 ANBOB INACTIVE 1411 --keep idle 1595 SYS ACTIVE 0 1831 ANBOB INACTIVE 167 SQL>select id,name,status,start_time,end_time,duration_secs from DBA_WORKLOAD_CAPTURES ID NAME STATUS START_TIME END_TIME DURATION_SECS ---------- ------------------------------ ---------------------------------------- ------------------- ------------------- ------------- 5 CAPTURE2024 COMPLETED 2024-09-07 11:10:09 2024-09-07 11:12:06 117 [oracle@db2 wcr]$ tree . ├── cap └── capfiles └── inst2 ├── aa ├── ab ├── ac ├── ad ├── ae ├── af ├── ag ├── ah ├── ai └── aj 13 directories, 0 files [oracle@db1 wcr]$ tree . ├── cap │ ├── wcr_cr.html │ ├── wcr_fcapture.wmd │ └── wcr_scapture.wmd └── capfiles └── inst1 ├── aa │ ├── wcr_dsd4xh0000000.rec │ ├── wcr_dsd4yh0000001.rec │ └── wcr_dsd62h0000002.rec ├── ab ├── ac ├── ad ├── ae ├── af ├── ag ├── ah ├── ai └── aj 13 directories, 6 files [oracle@db1 wcr]$ find . -type f -exec ls -l {} \; -rw-rw-rw- 1 oracle asmadmin 171 Sep 7 19:10 ./cap/wcr_scapture.wmd -rw-rw-rw- 1 oracle asmadmin 279 Sep 7 19:13 ./cap/wcr_fcapture.wmd -rw-rw-rw- 1 oracle asmadmin 33304 Sep 7 19:13 ./cap/wcr_cr.html -rw-rw-rw- 1 oracle asmadmin 1970 Sep 7 19:10 ./capfiles/inst1/aa/wcr_dsd4xh0000000.rec -rw-rw-rw- 1 oracle asmadmin 5848 Sep 7 19:12 ./capfiles/inst1/aa/wcr_dsd62h0000002.rec -rw-rw-rw- 1 oracle asmadmin 6755 Sep 7 19:12 ./capfiles/inst1/aa/wcr_dsd4yh0000001.rec
Note:
测试RAC 环境 wcr目录不是共享存储,而且一直有session idle状态,FINISH_CAPTURE一样可以停止, 所以本次故障FINISH_CAPTURE中无法中止和server process无法kill -9 一样,原因是在操作系统进程是”D”状态,在系统态,无法干预导致。
小结:
因为NFS 文件系统挂载的远程存储不可写,导致写wcr rec文件hang, 源数据库的server进程hang并且等待WCR: capture file IO write, 断开NFS后,中断capture恢复正常。
对不起,这篇文章暂时关闭评论。