Troubleshooting Oracle 19c cascade Dataguard Gap ORA-03135: connection lost contact
最近客户一套oracle 19.14 standalone Database做的cascade dataguard环境,暂且认为是A->B->C三台单实例, 但总是A->B的延迟,从oracle 12c后引入real time cascade,所以如果依赖该特性对延迟要求较高, 分析A->B延迟发现,B库总时会出现GAP,并且未自动FAL,需要人为干预, 并且主库alert日志报错出现:
ORA-03135: connection lost contact
TT02 (PID:64459): Error 3135 for LNO:3 to“xxx”。
术语:
主库 A: primary
备库 B: standby, cascading standby
备库 C: cascaded standby, terminal standby
# primary db alert log
Thread 1 advanced to log sequence 1850 (LGWR switch), current SCN: 526688858
Current log# 5 seq# 1850 mem# 0: /data/oradata/dbpri/redo05.log
ARC2 (PID:64455): Archived Log entry 3643 added for T-1.S-1849 ID 0x7f4d8988 LAD:1
2023-09-27 17:00:37.525000 +08:00
ORA-16040: Destination archive log file is locked.
2023-09-27 17:03:38.360000 +08:00
Thread 1 advanced to log sequence 1851 (LGWR switch), current SCN: 526722565
Current log# 1 seq# 1851 mem# 0: /data/oradata/dbpri/redo01.log
ARC2 (PID:64455): Archived Log entry 3645 added for T-1.S-1850 ID 0x7f4d8988 LAD:1
2023-09-27 17:06:46.382000 +08:00
Thread 1 advanced to log sequence 1852 (LGWR switch), current SCN: 526725568
Current log# 2 seq# 1852 mem# 0: /data/oradata/dbpri/redo02.log
ARC3 (PID:64457): Archived Log entry 3647 added for T-1.S-1851 ID 0x7f4d8988 LAD:1
2023-09-27 17:12:47.132000 +08:00
TT02 (PID:55751): Attempting LAD:2 network reconnect (3135)
TT02 (PID:55751): LAD:2 network reconnect abandoned
Errors in file /u01/app/oracle/diag/rdbms/dbpri/dbpri/trace/dbpri_tt02_55751.trc:
ORA-03135: connection lost contact
TT02 (PID:55751): Error 3135 for LNO:2 to 'dbpri_dg'
Errors in file /u01/app/oracle/diag/rdbms/dbpri/dbpri/trace/dbpri_tt02_55751.trc:
ORA-03135: connection lost contact
Errors in file /u01/app/oracle/diag/rdbms/dbpri/dbpri/trace/dbpri_tt02_55751.trc:
ORA-03135: connection lost contact
2023-09-27 17:13:45.154000 +08:00
ALTER SYSTEM SET log_archive_dest_state_2='ENABLE' SCOPE=MEMORY SID='*';
# standby db alert log
2023-09-27 17:07:53.208000 +08:00
rfs (PID:54402): krsr_rfs_atc: Identified database type as 'PHYSICAL STANDBY': Client is ASYNC (PID:55751)
rfs (PID:54402): Primary database is in MAXIMUM PERFORMANCE mode
rfs (PID:54402): Selected LNO:21 for T-1.S-1852 dbid 2135749111 branch 1132752761
2023-09-27 17:08:40.846000 +08:00
rfs (PID:49769): Selected LNO:23 for T-1.S-1851 dbid 2135749111 branch 1132752761
ARC0 (PID:39255): Archived Log entry 2384 added for T-1.S-1851 ID 0x7f4d8988 LAD:1
PR00 (PID:39609): Media Recovery Waiting for T-1.S-1852 (in transit)
Recovery of Online Redo Log: Thread 1 Group 21 Seq 1852 Reading mem 0
Mem# 0: /data/oradata/dbpri/std_redo21.log
2023-09-27 17:12:42.821000 +08:00
rfs (PID:54870): krsr_rfs_atc: Identified database type as 'PHYSICAL STANDBY': Client is Foreground (PID:29595)
rfs (PID:54872): krsr_rfs_atc: Identified database type as 'PHYSICAL STANDBY': Client is ASYNC (PID:20495)
rfs (PID:54872): Primary database is in MAXIMUM PERFORMANCE mode
2023-09-27 17:12:45.268000 +08:00
rfs (PID:54872): Killing 1 processes (PIDS:54402) (RFS live redo by DBID/thread for ASYNC) in order to receive ASYNC. Requested by OS process 54872
Process termination requested for pid 54402 [source = rdbms], [info = 2] [request issued by pid: 54872, uid: 10000]
2023-09-27 17:12:47.318000 +08:00
rfs (PID:54872): Selected LNO:21 for T-1.S-1852 dbid 2135749111 branch 1132752761
2023-09-27 17:13:40.967000 +08:00
rfs (PID:54956): krsr_rfs_atc: Identified database type as 'PHYSICAL STANDBY': Client is Foreground (PID:64447)
rfs (PID:54958): krsr_rfs_atc: Identified database type as 'PHYSICAL STANDBY': Client is ASYNC (PID:56380)
rfs (PID:54958): Primary database is in MAXIMUM PERFORMANCE mode
2023-09-27 17:13:43.100000 +08:00
rfs (PID:54958): Killing 1 processes (PIDS:54872) (RFS live redo by DBID/thread for ASYNC) in order to receive ASYNC. Requested by OS process 54958
Process termination requested for pid 54872 [source = rdbms], [info = 2] [request issued by pid: 54958, uid: 10000]
2023-09-27 17:13:45.116000 +08:00
rfs (PID:54958): Selected LNO:21 for T-1.S-1852 dbid 2135749111 branch 1132752761
TT00 (PID:39257): Error ORA-235 occurred during an un-locked control file
TT00 (PID:39257): transaction. This error can be ignored. The control
TT00 (PID:39257): file transaction will be retried.
2023-09-27 17:16:44.736000 +08:00
PR00 (PID:39609): Media Recovery Waiting for T-1.S-1853
ARC2 (PID:39263): Archived Log entry 2387 added for T-1.S-1852 ID 0x7f4d8988 LAD:1
rfs (PID:55195): krsr_rfs_atc: Identified database type as 'PHYSICAL STANDBY': Client is ASYNC (PID:55751)
rfs (PID:55195): Primary database is in MAXIMUM PERFORMANCE mode
rfs (PID:55195): Selected LNO:21 for T-1.S-1853 dbid 2135749111 branch 1132752761
TT02 (PID:41027): Cannot find SRL for T-1.S-1853
2023-09-27 17:16:45.745000 +08:00
Recovery of Online Redo Log: Thread 1 Group 21 Seq 1853 Reading mem 0
Mem# 0: /data/oradata/ABTADB/std_redo21.log
2023-09-27 17:18:48.397000 +08:00
rfs (PID:55361): krsr_rfs_atc: Identified database type as 'PHYSICAL STANDBY': Client is Foreground (PID:29595)
rfs (PID:55363): krsr_rfs_atc: Identified database type as 'PHYSICAL STANDBY': Client is ASYNC (PID:20495)
rfs (PID:55363): Primary database is in MAXIMUM PERFORMANCE mode
2023-09-27 17:18:50.837000 +08:00
rfs (PID:55363): Killing 1 processes (PIDS:55195) (RFS live redo by DBID/thread for ASYNC) in order to receive ASYNC. Requested by OS process 55363
Process termination requested for pid 55195 [source = rdbms], [info = 2] [request issued by pid: 55363, uid: 10000]
2023-09-27 17:18:52.886000 +08:00
rfs (PID:55363): Selected LNO:21 for T-1.S-1853 dbid 2135749111 branch 1132752761
2023-09-27 17:19:45.251000 +08:00
rfs (PID:55440): krsr_rfs_atc: Identified database type as 'PHYSICAL STANDBY': Client is Foreground (PID:64447)
rfs (PID:55442): krsr_rfs_atc: Identified database type as 'PHYSICAL STANDBY': Client is ASYNC (PID:57052)
rfs (PID:55442): Primary database is in MAXIMUM PERFORMANCE mode
2023-09-27 17:19:47.384000 +08:00
rfs (PID:55442): Killing 1 processes (PIDS:55363) (RFS live redo by DBID/thread for ASYNC) in order to receive ASYNC. Requested by OS process 55442
Process termination requested for pid 55363 [source = rdbms], [info = 2] [request issued by pid: 55442, uid: 10000]
2023-09-27 17:19:49.399000 +08:00
rfs (PID:55442): Selected LNO:21 for T-1.S-1853 dbid 2135749111 branch 1132752761
TT00 (PID:39257): Error ORA-235 occurred during an un-locked control file
TT00 (PID:39257): transaction. This error can be ignored. The control
TT00 (PID:39257): file transaction will be retried.
2023-09-27 17:22:57.141000 +08:00
ALTER SYSTEM SET service_names='abtadb_dg' SCOPE=BOTH;
Note:
B库会自动kill进程,A库出现ora-3135 , TT正在switch的进程中断,这中情况后就会产生sequence# 1852 的archivelog gap,并且auto FAL 并不会去主库拉,但后续sequence #1853会继续接受。对于缺失的archive log 需要人为处理如#1852。
TT trace file
*** 2023-09-20T15:50:28.159364+08:00 Log read is SYNCHRONOUS though disk_asynch_io is enabled! krr_seed_readahead:1Async IO is not on. Not seeding any prefetch. krrxflg 0x1893. *** 2023-09-20T15:56:16.797785+08:00 krsb_stream_send: Error 3135 sending stream IOV krsu_upi_status: RFS network connection lost at host 'abtadb_dg', error=3135 krsb_stream_dispatch: No active stream for remote target:2, returning krsu_upi_status: Error 3135 writing archive log file at host 'abtadb_dg' *** 2023-09-20 15:56:16.798013 [krsh.c:6364] Attempting LAD:2 network reconnect (3135) *** 2023-09-20 15:56:16.798051 [krsh.c:6364] LAD:2 network reconnect abandoned at 0x7fff56a37b60 placed krsl.c@6809 ORA-03135: connection lost contact *** 2023-09-20 15:56:16.798154 [krsh.c:6364] Error 3135 for LNO:1 to 'abtadb_dg' *** 2023-09-20 15:56:16.798168 [krsi.c:8745] krsi_dst_fail_caller: LAD:2 err:3135 force:0 blast:1 at 0x7fff56a38fc0 placed krsw.c@9233 ORA-03135: connection lost contact at 0x7fff56a38fc0 placed krsw.c@9233 ORA-03135: connection lost contact
通常建议配置
# pri
SQL> ALTER SYSTEM SET log_archive_config='dg_config=(ORCL_PRIM,ORCL_STBY1,ORCL_STBY2)' SCOPE=SPFILE; SQL> ALTER SYSTEM SET log_archive_dest_1='LOCATION=/oradb/app/oracle/archive/orcl_prim valid_for=(all_logfiles,all_roles) db_unique_name=ORCL_PRIM' SCOPE=SPFILE; SQL> ALTER SYSTEM SET log_archive_dest_2='service=ORCL_STBY1 async valid_for=(online_logfiles,primary_role) db_unique_name=ORCL_STBY1' SCOPE=SPFILE; SQL> ALTER SYSTEM SET log_archive_dest_3='service=ORCL_STBY2 async valid_for=(standby_logfiles,standby_role) db_unique_name=ORCL_STBY2' SCOPE=SPFILE; SQL> ALTER SYSTEM SET log_archive_dest_state_3='DEFER' SCOPE=SPFILE; SQL> ALTER SYSTEM SET fal_server='ORCL_STBY1' SCOPE=SPFILE; SQL> ALTER SYSTEM SET fal_client='ORCL_PRIM' SCOPE=SPFILE;
# stby_dg1
*.log_archive_config='dg_config=(ORCL_PRIM,ORCL_STBY1,ORCL_STBY2)' *.log_archive_dest_1='LOCATION=/oradb/app/oracle/archive/orcl_stby1 valid_for=(all_logfiles,all_roles) db_unique_name=ORCL_STBY1' *.log_archive_dest_2='service=ORCL_PRIM async valid_for=(online_logfiles,primary_role) db_unique_name=ORCL_PRIM' *.log_archive_dest_3='service=ORCL_STBY2 async valid_for=(standby_logfiles,standby_role) db_unique_name=ORCL_STBY2' *.log_archive_dest_state_3='ENABLE'
# stby_dg2
*.log_archive_config='dg_config=(ORCL_PRIM,ORCL_STBY1,ORCL_STBY2)' *.log_archive_dest_1='LOCATION=/oradb/app/oracle/archive/orcl_stby2'
配置条件
1. Primary, Cascading and Cascaded Standby Database db_unique_name must be present in the dg_config of log_archive_config on all the Databases
2. Setup log_archive_dest_n on the cascading Standby Database to serve the cascaded (terminal) Standby Databases using the Attribute ‘valid_for=(STANDBY_LOGFILES,STANDBY_ROLE)’
3. You can toggle between Real-Time and Non Real-Time Cascading using the Log Transport Method.
ASYNC = Real-Time Cascading
这套库的配置
config | log_archive_dest_2 | log_archive_dest_3 | fal_server | fal_client | |
a | anbob anbobdg1 |
service=anbobdg1 sync valid_for= (online_logfiles,primary_role) db_unique_name=anbobdg1 |
空 | anbobDG1 | anbob |
b | anbob anbobdg2 anbobdg1 |
service=anbob sync
valid_for= (all_logfiles, primary_roles) db_unique_name=anbob |
service=anbobdg2 async valid_for= (all_logfiles,all_roles) db_unique_name=anbobdg2 |
anbob | anbobDG1 |
c | anbobdg1 anbobdg2 |
service=anbobdg1 sync valid_for= (online_logfiles,standby_role) db_unique_name=anbobdg1 |
空 | anbobDG1 | anbobDG2 |
导致Archive Gaps 常见原因:
Problem in the primary instance.
Primary instance is unable to identify the standby site. unable to resolve the service name
Standby TNS service entry is deactivated, deferred, or removed from the primary LOG_ARCHIVE_DEST_n parameter.
Network Disconnects or stop of Log Transport Services
The standby listener is down
Outages of the Standby Database
Misconfigurations of Log Transport Services
I/O-Issues on the Standby Site
Manual Deletion of Archive Logs before they are applied to the Standby
Insufficient Bandwidth in the Network between the Primary and Standby Site
ORA-3135排除方法
确认库之间无firewall
sqlplus创建一个idle session
sqlnet.ora 配置探测SQLNET.EXPIRE_TIME=10
确认sqlnet.ora中无SQLNET.SEND_TIMEOUT
archive_lag_target配置8分钟自动切归档
操作系统配置keepalived 探测,ss -otn查看进程是否生效
如何Trace跟踪Remote File Server (RFS) Process?
1, 在primary side的db alert log查找ARCn或TTn进程
2, 在操作系统查找进程号
3,在standby找对应的rfs进程
select process,pid,client_pid from v$managed_standby;
4, 使用oradebug给启动进程10046跟踪
配置sqlnet trace
Edit the SQLNET.ORA file for the primary database and add
TRACE_LEVEL_SERVER = 16 TRACE_DIRECTORY_SERVER = <DIRECTORY> TRACE_TIMESTAMP_SERVER = ON DIAG_ADR_ENABLED=off <==Required for version 11g and newer
kill TTn/ARCn进程后会自动重启该进程,非致命进程。
NOTE : To enable SQLNET tracing refer,
step by Step Method to Enable SQLNET(Server Tracing) Dynamically on Dataguard (Doc ID 2010476.1)
临时解决方法:
重启TT进程
1, 从alert log中找到报错TTn进程
2,从操作系统kill TTn进程
3,switch logfile在主库,传给备库后RFS通常会发现GAP,触发auto FAL,自动去primary 拉GAP 归档继续重用。
如果依旧无法恢复,
# ON standby
kill rfs进程,并重启 Managed Recovery
SQL> alter database recover managed standby database cancel; SQL> alter database recover managed standby database [using current logfile] disconnect;
#on Primary
重启用 Log Transport Services:
SQL> alter system set log_archive_dest_state_2='defer'; SQL> alter system set log_archive_dest_state_2='enable' ;
19.18以前存在Bug 34019478 DataGuard: ORA-3135/ ORA-16198 due to hung i/o operation , 解决方法
Increase the default value of below parameters
data_guard_max_io_time to 600 data_guard_max_longio_time to 600
解决方法
在把C库的dest_2清空后发现恢复正常。后来发现是C库的dest配置错误,导致A与C同时向B传,B库轮训的kill rfs进程。 抛开这套环境的配置的合理性,解决与复现这个问题如下:
# cascaded standby
#解决方法 ALTER SYSTEM SET log_archive_dest_2='service=anbobdb_dg async valid_for=(online_logfiles,primary_role) db_unique_name=anbobdb_dg' SCOPE=BOTH; #错误配置 ALTER SYSTEM SET log_archive_dest_2='service=anbobdb_dg async valid_for=(standby_logfile,standby_role) db_unique_name=anbobdb_dg' SCOPE=BOTH;
对不起,这篇文章暂时关闭评论。