移除DataGuard Standby配置导致Primary启动失败
Datagaurd是oracle 数据库常用的容灾保护方案,但standby不再需要或删除后,在primary 需要清理之前的配置,通常认为很容易,比如禁用日志传输,no force logging, 清理log_archive _destN \LOG_ARCHIVE_CONFIG相关的参数等。是否很容易? 但里面隐藏着一个风险。上周一个政F行业客户遇到个案例,主库Oracle RAC 11.2.0.4原来有dataguard,但是清理后,直到有1个节点重启后实例无法启动,简单记录。
DB alert log
Tue Jul 25 22:27:41 2023 Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_m000_58781.trc (incident=296583): ORA-01578: ORACLE data block corrupted (file # 403, block # 2) ORA-01110: data file 403: '+DATAVG/anbob/tempfile/temp2.dbf' Tue Jul 25 22:27:41 2023 Sweep [inc][296583]: completed Tue Jul 25 22:37:41 2023 Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_m000_64973.trc (incident=296584): ORA-01578: ORACLE data block corrupted (file # 403, block # 2) ORA-01110: data file 403: '+DATAVG/anbob/tempfile/temp2.dbf' Tue Jul 25 22:37:43 2023 Sweep [inc][296584]: completed Tue Jul 25 22:48:18 2023 IPC Send timeout detected. Receiver ospid 25486 [ Tue Jul 25 22:48:18 2023 Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_lms2_25486.trc: IPC Send timeout detected. Receiver ospid 25478 [ Tue Jul 25 22:48:18 2023 Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_lms0_25478.trc: IPC Send timeout detected. Receiver ospid 25490 [ Tue Jul 25 22:48:19 2023 ... Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_lms2_25486.trc: IPC Send timeout detected. Receiver ospid 25482 [ Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_lms1_25482.trc: IPC Send timeout detected. Receiver ospid 25486 [ Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_lms2_25486.trc: IPC Send timeout detected. Receiver ospid 25486 [ Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_lms2_25486.trc: Tue Jul 25 22:48:53 2023 IPC Send timeout detected. Receiver ospid 25482 [ Tue Jul 25 22:48:53 2023 Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_lms1_25482.trc: Tue Jul 25 22:49:07 2023 Detected an inconsistent instance membership by instance 1 Tue Jul 25 22:49:08 2023 Received an instance abort message from instance 1 Please check instance 1 alert and LMON trace files for detail. LMD0 (ospid: 25476): terminating the instance due to error 481 Tue Jul 25 22:49:08 2023 System state dump requested by (instance=2, osid=25476 (LMD0)), summary=[abnormal instance termination]. System State dumped to trace file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_diag_25464_20230725224908.trc Tue Jul 25 22:49:09 2023 ORA-1092 : opitsk aborting process Tue Jul 25 22:49:09 2023 License high water mark = 1404 Instance terminated by LMD0, pid = 25476 USER (ospid: 8175): terminating the instance Instance terminated by USER, pid = 8175 Wed Jul 26 22:14:00 2023 Adjusting the default value of parameter parallel_max_servers from 2560 to 1970 due to the value of parameter processes (2000) Starting ORACLE instance (normal) ... ... ARC3 started with pid=47, OS id=11845 ARC1: Archival started ARC2: Archival started ARC1: Becoming the 'no FAL' ARCH ARC1: Becoming the 'no SRL' ARCH ARC2: Becoming the heartbeat ARCH WARNING: The 'LOG_ARCHIVE_CONFIG' init.ora parameter settings are inconsistent with another started instance. This may be caused by the 'DB_UNIQUE_NAME' init.ora parameter being specified differently on one or more of the other RAC instances; theDB_UNIQUE_NAME parameter value MUST be identical for allinstances of the database. Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_lgwr_11631.trc: ORA-16188: LOG_ARCHIVE_CONFIG settings inconsistent with previously started instance LGWR (ospid: 11631): terminating the instance due to error 16188 Wed Jul 26 23:06:28 2023 System state dump requested by (instance=2, osid=11631 (LGWR)), summary=[abnormal instance termination]. System State dumped to trace file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_diag_11578_20230726230628.trc Dumping diagnostic data in directory=[cdmp_20230726230628], requested by (instance=2, osid=11631 (LGWR)), summary=[abnormal instance termination]. Instance terminated by LGWR, pid = 11631
Note:
这里实际有3个问题我用三种颜色表示, 绿色是tempfile发现了坏块, 橙色是RAC 节点间出现IPC通信错误, 红色是实例异常crash,CRS自动拉起后启动失败,提示ORA-16188: LOG_ARCHIVE_CONFIG 节点间参数不一致。 前两个问题不在本篇的描述范围,关键是ora-16188, 想起了前不久《Troubleshooting Oracle Exadata X5 db instance mount fail with ORA-01105 & ORA-01154》那个故障,也是两个实例间不兼容无法启动,当时还排查过这个参数。接下来分析看一下当前参数
LOG_ARCHIVE_CONFIG参数
SQL> show parameter pfile NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ spfile string +DATAVG/anbob/spfileanbob.ora SQL> show parameter config NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ dg_broker_config_file1 string /u01/app/oracle/product/11.2.0/db_1/dbs/dr1anbob.dat dg_broker_config_file2 string /u01/app/oracle/product/11.2.0/db_1/dbs/dr2anbob.dat log_archive_config string
Note:
对比了RAC所有实例使用的是同一 spfile, 另一存活实例参数与spfile参数相同,都是null 或”, 通过create pfile能发现有*.log_archive_config=” 的配置。
ORA-16188
LOG_ARCHIVE_CONFIG settings are inconsistent with the previously started instance. Cause The settings for the LOG_ARCHIVE_CONFIG parameter were inconsistent with the settings of a previously started instance. The settings for this parameter must be exactly the same for all instances. Action Ensure that all instances use the exact same LOG_ARCHIVE_CONFIG settings.
Note:
这里给ORACLE做个宣传,oracle已悄悄上线了error-help功能, 该功能确实早就应该有,在线可以查询错误编码,增加到google浏览器的自定义搜索引擎,可以快速的搜索错误代码,有些内容比oerr程序 提示更加丰富。页面如Ora-16188
也许你从网上能搜到或Standby Instance Crash or Failed to Startup with ORA-16188 (Doc ID 1580482.1) ,对于standby启动相关的错误的解决方法
Cause:
Setting log_archive_config to null in RAC standby database and Startup of node 2 fails with ORA-16188.
Alter system set log_archive_config=” scope=both sid=’*’
Solution:
This is due to v$dataguard_config not updated between instances.
To fix this do the below in any one of the nodes.
alter system set log_archive_config=NODG_CONFIG scope=both sid=’*’;
And retry the standby setup.
上面这种修改为”的方式,可能会导致实例v$dataguard_config 可能没有及时更新,提示参数不一致,似乎是个bug, 从oracle 12c已经修复了该问题, 这之前可能更新值为NODG_CONFIG, 我们在幸存的节点修改参数.
SQL> alter system set log_archive_config=NODG_CONFIG scope=both sid='*'; alter system set log_archive_config=NODG_CONFIG scope=both sid='*' * ERROR at line 1: ORA-02097: parameter cannot be modified because specified value is invalid ORA-16135: Invalid LOG_ARCHIVE_CONFIG modification while in protected mode
Note:
注意这里提示修改的值是无效值,无语,既然开始提示2实例参数不一致,那重启幸存实例肯定就可以解决了吧? 这里因为时间紧急,客户反馈可以重启幸存节点,但如果重启就尽快因为20分钟后有批处理作业。于是决定重启, 至于这个nodg_config值是无效值,这里其实忽略了一个问题.
重启幸存实例
-- 备份了一个spfile SQL> alter system reset log_archive_config scope=spfile; System altered. SQL> shut immediate SQL> startup ORACLE instance started. Total System Global Area 2.2448E+11 bytes Fixed Size 2266664 bytes Variable Size 3.4897E+10 bytes Database Buffers 1.8952E+11 bytes Redo Buffers 64827392 bytes Database mounted. ORA-03113: end-of-file on communication channel Process ID: 26774 Session ID: 2065 Serial number: 3
Note:
重启以后坏了,一个也无法启动了,查看db alert log
DB ALERT LOG
Lost write protection disabled Completed: ALTER DATABASE MOUNT /* db agent *//* {0:7:13143} */ ALTER DATABASE OPEN /* db agent *//* {0:7:13143} */ This instance was first to open Picked broadcast on commit scheme to generate SCNs LGWR: STARTING ARCH PROCESSES Wed Jul 26 23:39:09 2023 ARC0 started with pid=44, OS id=27434 Wed Jul 26 23:39:10 2023 ARC0: Archival started LGWR: STARTING ARCH PROCESSES COMPLETE ARC0: STARTING ARCH PROCESSES LGWR: Primary database is in MAXIMUM AVAILABILITY mode LGWR: Destination LOG_ARCHIVE_DEST_1 is not serviced by LGWR LGWR: Minimum of 1 LGWR standby database required Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_lgwr_27288.trc: ORA-16072: a minimum of one standby database destination is required Wed Jul 26 23:39:10 2023 ARC1 started with pid=45, OS id=27436 LGWR (ospid: 27288): terminating the instance due to error 16072 Wed Jul 26 23:39:10 2023 System state dump requested by (instance=1, osid=27288 (LGWR)), summary=[abnormal instance te System State dumped to trace file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_diag_2 Dumping diagnostic data in directory=[cdmp_20230726233910], requested by (instance=1, osid=27 Instance terminated by LGWR, pid = 27288
Note:
注意这里的提示已经很明确,主库目前是最大可用模式,至少需要1个standby, 而我们这个库已经清理了所有的standby ,只是之前一直未重启过, 所以这个雷留到了现在。 知道原因解决就容易了。不过我们再回看上面提示nodg_config报错时提到的:
ORA-02097: parameter cannot be modified because specified value is invalid
ORA-16135: Invalid LOG_ARCHIVE_CONFIG modification while in protected mode
ORA-16135
Cause: The LOG_ARCHIVE_CONFIG parameter can not be modified while any RAC instance is open in either maximum protection or maximum availability mode. Also, the parameter can not be modified in such way that would cause all destinations to fail while in maximum protection mode.
Action: Make the modification before the database is opened by any instance.
这个错误时忽略了检查一下原来database 的保护模式. 以后禁用dg或清理参数前记的检查, 既然一个dg standby都没有,也只能重启所有节点,切到最大性能模式。
select database_name,database_role,force_logging,PROTECTION_MODE from v$database;
修改保护模式
SQL> startup mount ORACLE instance started. Total System Global Area 2.2448E+11 bytes Fixed Size 2266664 bytes Variable Size 3.4897E+10 bytes Database Buffers 1.8952E+11 bytes Redo Buffers 64827392 bytes Database mounted. SQL> alter database set standby to maximize performance; Database altered. SQL> alter database open; Database altered.
Summary:
当清理或删除Standby时,在主库需要检查当前的保护模式,如最大可用和最大保护模式至少要有1 standby, 在12c之前RAC 修改LOG_ARCHIVE_CONFIG参数记的使用NODG_CONFIG而不是”,否则可能会出现后期实例重启无法启动的风险。
对不起,这篇文章暂时关闭评论。