首页 » ORACLE 9i-23ai » 移除DataGuard Standby配置导致Primary启动失败

移除DataGuard Standby配置导致Primary启动失败

Datagaurd是oracle 数据库常用的容灾保护方案,但standby不再需要或删除后,在primary 需要清理之前的配置,通常认为很容易,比如禁用日志传输,no force logging,  清理log_archive _destN \LOG_ARCHIVE_CONFIG相关的参数等。是否很容易? 但里面隐藏着一个风险。上周一个政F行业客户遇到个案例,主库Oracle RAC 11.2.0.4原来有dataguard,但是清理后,直到有1个节点重启后实例无法启动,简单记录。

DB alert log

Tue Jul 25 22:27:41 2023
Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_m000_58781.trc  (incident=296583):
ORA-01578: ORACLE data block corrupted (file # 403, block # 2)
ORA-01110: data file 403: '+DATAVG/anbob/tempfile/temp2.dbf'
Tue Jul 25 22:27:41 2023
Sweep [inc][296583]: completed
Tue Jul 25 22:37:41 2023
Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_m000_64973.trc  (incident=296584):
ORA-01578: ORACLE data block corrupted (file # 403, block # 2)
ORA-01110: data file 403: '+DATAVG/anbob/tempfile/temp2.dbf'
Tue Jul 25 22:37:43 2023
Sweep [inc][296584]: completed
Tue Jul 25 22:48:18 2023
IPC Send timeout detected. Receiver ospid 25486 [
Tue Jul 25 22:48:18 2023
Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_lms2_25486.trc:
IPC Send timeout detected. Receiver ospid 25478 [
Tue Jul 25 22:48:18 2023
Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_lms0_25478.trc:
IPC Send timeout detected. Receiver ospid 25490 [
Tue Jul 25 22:48:19 2023
...
Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_lms2_25486.trc:
IPC Send timeout detected. Receiver ospid 25482 [
Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_lms1_25482.trc:
IPC Send timeout detected. Receiver ospid 25486 [
Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_lms2_25486.trc:
IPC Send timeout detected. Receiver ospid 25486 [
Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_lms2_25486.trc:
Tue Jul 25 22:48:53 2023
IPC Send timeout detected. Receiver ospid 25482 [
Tue Jul 25 22:48:53 2023
Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_lms1_25482.trc:
Tue Jul 25 22:49:07 2023
Detected an inconsistent instance membership by instance 1
Tue Jul 25 22:49:08 2023
Received an instance abort message from instance 1
Please check instance 1 alert and LMON trace files for detail.
LMD0 (ospid: 25476): terminating the instance due to error 481
Tue Jul 25 22:49:08 2023
System state dump requested by (instance=2, osid=25476 (LMD0)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_diag_25464_20230725224908.trc
Tue Jul 25 22:49:09 2023
ORA-1092 : opitsk aborting process
Tue Jul 25 22:49:09 2023
License high water mark = 1404
Instance terminated by LMD0, pid = 25476
USER (ospid: 8175): terminating the instance
Instance terminated by USER, pid = 8175
Wed Jul 26 22:14:00 2023
Adjusting the default value of parameter parallel_max_servers
from 2560 to 1970 due to the value of parameter processes (2000)
Starting ORACLE instance (normal)
...
...
ARC3 started with pid=47, OS id=11845
ARC1: Archival started
ARC2: Archival started
ARC1: Becoming the 'no FAL' ARCH
ARC1: Becoming the 'no SRL' ARCH
ARC2: Becoming the heartbeat ARCH
WARNING: The 'LOG_ARCHIVE_CONFIG' init.ora parameter settings are inconsistent with another started instance.  
This may be caused by the 'DB_UNIQUE_NAME' init.ora parameter being specified
differently on one or more of the other RAC instances;
 theDB_UNIQUE_NAME parameter value MUST be identical for allinstances of the database.
Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_lgwr_11631.trc:
ORA-16188: LOG_ARCHIVE_CONFIG settings inconsistent with previously started instance
LGWR (ospid: 11631): terminating the instance due to error 16188
Wed Jul 26 23:06:28 2023
System state dump requested by (instance=2, osid=11631 (LGWR)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_diag_11578_20230726230628.trc
Dumping diagnostic data in directory=[cdmp_20230726230628], requested by (instance=2, osid=11631 (LGWR)), summary=[abnormal instance termination].
Instance terminated by LGWR, pid = 11631

Note:
这里实际有3个问题我用三种颜色表示, 绿色是tempfile发现了坏块, 橙色是RAC 节点间出现IPC通信错误, 红色是实例异常crash,CRS自动拉起后启动失败,提示ORA-16188: LOG_ARCHIVE_CONFIG 节点间参数不一致。 前两个问题不在本篇的描述范围,关键是ora-16188, 想起了前不久《Troubleshooting Oracle Exadata X5 db instance mount fail with ORA-01105 & ORA-01154》那个故障,也是两个实例间不兼容无法启动,当时还排查过这个参数。接下来分析看一下当前参数

LOG_ARCHIVE_CONFIG参数

SQL> show parameter pfile

NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
spfile				     string	 +DATAVG/anbob/spfileanbob.ora
SQL> show parameter config

NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
dg_broker_config_file1		     string	 /u01/app/oracle/product/11.2.0/db_1/dbs/dr1anbob.dat
dg_broker_config_file2		     string	 /u01/app/oracle/product/11.2.0/db_1/dbs/dr2anbob.dat
log_archive_config		     string

Note:
对比了RAC所有实例使用的是同一 spfile, 另一存活实例参数与spfile参数相同,都是null 或”, 通过create pfile能发现有*.log_archive_config=” 的配置。

ORA-16188

LOG_ARCHIVE_CONFIG settings are inconsistent with the previously started instance.
Cause
The settings for the LOG_ARCHIVE_CONFIG parameter were inconsistent with the settings of a previously started instance. 
The settings for this parameter must be exactly the same for all instances.

Action
Ensure that all instances use the exact same LOG_ARCHIVE_CONFIG settings.

Note:
这里给ORACLE做个宣传,oracle已悄悄上线了error-help功能, 该功能确实早就应该有,在线可以查询错误编码,增加到google浏览器的自定义搜索引擎,可以快速的搜索错误代码,有些内容比oerr程序 提示更加丰富。页面如Ora-16188

也许你从网上能搜到或Standby Instance Crash or Failed to Startup with ORA-16188 (Doc ID 1580482.1) ,对于standby启动相关的错误的解决方法
Cause:
Setting log_archive_config to null in RAC standby database and Startup of node 2 fails with ORA-16188.

Alter system set log_archive_config=” scope=both sid=’*’

Solution:
This is due to v$dataguard_config not updated between instances.
To fix this do the below in any one of the nodes.
alter system set log_archive_config=NODG_CONFIG scope=both sid=’*’;
And retry the standby setup.

上面这种修改为”的方式,可能会导致实例v$dataguard_config 可能没有及时更新,提示参数不一致,似乎是个bug, 从oracle 12c已经修复了该问题, 这之前可能更新值为NODG_CONFIG, 我们在幸存的节点修改参数.

SQL> alter system set log_archive_config=NODG_CONFIG  scope=both sid='*';
alter system set log_archive_config=NODG_CONFIG  scope=both sid='*'
*
ERROR at line 1:
ORA-02097: parameter cannot be modified because specified value is invalid
ORA-16135: Invalid LOG_ARCHIVE_CONFIG modification while in protected mode

Note:
注意这里提示修改的值是无效值,无语,既然开始提示2实例参数不一致,那重启幸存实例肯定就可以解决了吧? 这里因为时间紧急,客户反馈可以重启幸存节点,但如果重启就尽快因为20分钟后有批处理作业。于是决定重启, 至于这个nodg_config值是无效值,这里其实忽略了一个问题.

重启幸存实例

-- 备份了一个spfile
SQL> alter  system reset log_archive_config scope=spfile;
System altered.

SQL> shut immediate

SQL> startup 
ORACLE instance started.

Total System Global Area 2.2448E+11 bytes
Fixed Size		    2266664 bytes
Variable Size		 3.4897E+10 bytes
Database Buffers	 1.8952E+11 bytes
Redo Buffers		   64827392 bytes
Database mounted.
ORA-03113: end-of-file on communication channel
Process ID: 26774
Session ID: 2065 Serial number: 3

Note:
重启以后坏了,一个也无法启动了,查看db alert log

DB ALERT LOG

Lost write protection disabled
Completed: ALTER DATABASE MOUNT /* db agent *//* {0:7:13143} */
ALTER DATABASE OPEN /* db agent *//* {0:7:13143} */
This instance was first to open
Picked broadcast on commit scheme to generate SCNs
LGWR: STARTING ARCH PROCESSES
Wed Jul 26 23:39:09 2023
ARC0 started with pid=44, OS id=27434 
Wed Jul 26 23:39:10 2023
ARC0: Archival started
LGWR: STARTING ARCH PROCESSES COMPLETE
ARC0: STARTING ARCH PROCESSES
LGWR: Primary database is in MAXIMUM AVAILABILITY mode
LGWR: Destination LOG_ARCHIVE_DEST_1 is not serviced by LGWR
LGWR: Minimum of 1 LGWR standby database required
Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_lgwr_27288.trc:
ORA-16072: a minimum of one standby database destination is required
Wed Jul 26 23:39:10 2023
ARC1 started with pid=45, OS id=27436 
LGWR (ospid: 27288): terminating the instance due to error 16072
Wed Jul 26 23:39:10 2023
System state dump requested by (instance=1, osid=27288 (LGWR)), summary=[abnormal instance te
System State dumped to trace file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_diag_2
Dumping diagnostic data in directory=[cdmp_20230726233910], requested by (instance=1, osid=27
Instance terminated by LGWR, pid = 27288

Note:
注意这里的提示已经很明确,主库目前是最大可用模式,至少需要1个standby, 而我们这个库已经清理了所有的standby ,只是之前一直未重启过, 所以这个雷留到了现在。 知道原因解决就容易了。不过我们再回看上面提示nodg_config报错时提到的:
ORA-02097: parameter cannot be modified because specified value is invalid
ORA-16135: Invalid LOG_ARCHIVE_CONFIG modification while in protected mode

ORA-16135

Cause: The LOG_ARCHIVE_CONFIG parameter can not be modified while any RAC instance is open in either maximum protection or maximum availability mode. Also, the parameter can not be modified in such way that would cause all destinations to fail while in maximum protection mode.

Action: Make the modification before the database is opened by any instance.
这个错误时忽略了检查一下原来database 的保护模式. 以后禁用dg或清理参数前记的检查, 既然一个dg standby都没有,也只能重启所有节点,切到最大性能模式。

select database_name,database_role,force_logging,PROTECTION_MODE from v$database;

修改保护模式

SQL> startup mount
ORACLE instance started.

Total System Global Area 2.2448E+11 bytes
Fixed Size		    2266664 bytes
Variable Size		 3.4897E+10 bytes
Database Buffers	 1.8952E+11 bytes
Redo Buffers		   64827392 bytes
Database mounted.
SQL> alter database set standby to maximize performance;
Database altered.

SQL> alter database open;
Database altered.

Summary:
当清理或删除Standby时,在主库需要检查当前的保护模式,如最大可用和最大保护模式至少要有1 standby, 在12c之前RAC 修改LOG_ARCHIVE_CONFIG参数记的使用NODG_CONFIG而不是”,否则可能会出现后期实例重启无法启动的风险。

打赏

对不起,这篇文章暂时关闭评论。