Oracle12c R2注意事项:DB password file in ASM (DataGuard环境重建PWD)
之前在 Oracle 12c 关于密码(password)的几个新特性小结曾经记录过12C 密码文件可以放到ASM DG的特性, 最近处理一个案例时就遇到了,简单记录一套修复他人搭建的问题RAC+ DG的过程, 环境Oracle 12cR2 2-nodes RAC+DG。
1, db alert log of primary side
2018-12-25 09:32:18.111000 +08:00 Error 1034 received logging on to the standby 2018-12-25 09:37:18.335000 +08:00 Error 1034 received logging on to the standby
tip:
提示目标实例未启动,后确认备库有一个节点1启动,节点2关闭,理论上Standby 端不应该因部分实例停而影响DG的传输。
2. 确认DG 配置 on primary side
SQL>show parameter log_archive NAME VALUES ----------------- -------------------------------------------------------------------------------------------- log_archive_dest_2 SERVICE=STD_ANBOB2 LGWR ASYNC VALID_FOR=(ONLINE_LOGFILES,PRIMARY_ROLE) DB_UNIQUE_NAME=stdANBOB oracle@DB01:/home/oracle> tnsping STD_ANBOB2 Used TNSNAMES adapter to resolve the alias Attempting to contact (DESCRIPTION = (ADDRESS_LIST = (ADDRESS = (PROTOCOL = TCP)(HOST = 133.96.43.38)(PORT = 1531))) (CONNECT_DATA = (SID = ANBOB2))) OK (0 msec)
tip:
使用了非默认的1531 port, 但是配置的只是节点2的IP,而且还使用的SID而不是service_name, 后来发现standby side的实例监听中只配置了SERVICE_NAME=SID的服务,无同名服务说明静态监听配置也是不正确的,需要先修复standby side上增加共同的服务名。
3, 修正primary side 传输SERVICE配置
oracle@DB01:/home/oracle> tnsping STD_ANBOB Used TNSNAMES adapter to resolve the alias Attempting to contact (DESCRIPTION = (ADDRESS_LIST = (ADDRESS = (PROTOCOL = TCP)(HOST = 133.96.43.137)(PORT = 1531)) (ADDRESS = (PROTOCOL = TCP)(HOST = 133.96.43.138)(PORT = 1531))) (CONNECT_DATA = (service_name = ANBOB.com))) OK (0 msec) SQL> alter system set log_archive_dest_2='SERVICE=STD_ANBOB LGWR ASYNC VALID_FOR=(ONLINE_LOGFILES,PRIMARY_ROLE) DB_UNIQUE_NAME=stdANBOB'; # Primary side db alert log Error 1017 received logging on to the standby ------------------------------------------------------------ Check that the primary and standby are using a password file and remote_login_passwordfile is set to SHARED or EXCLUSIVE, and that the SYS password is same in the password files. returning error ORA-16191 ------------------------------------------------------------ TT04: Attempting destination LOG_ARCHIVE_DEST_2 network reconnect (16191) TT04: Destination LOG_ARCHIVE_DEST_2 network reconnect abandoned Errors in file /oracle/app/oracle/diag/rdbms/ANBOB/anbob/trace/anbob_tt04_47574.trc: ORA-16191: Primary log shipping client not logged on standby Error 16191 for archive log file 5 to 'STD_ANBOB'
Tip:
增加了实例1,使用了服务名,支持STANDBY上的RAC部分实例关闭后的正常传输, 但是主库日志里又出现了ora-1017 ora-16191错误
4, STANDBY side ORA-7445 [kzsrgavs()+995]
# db alert log 2018-12-25 10:48:21.996000 +08:00 Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0x7FFD269AF8A7] [PC:0x2897843, kzsrgavs()+995] [flags: 0x0, count: 1] Errors in file /oracle/app/oracle/diag/rdbms/stdANBOB/ANBOB1/trace/ANBOB1_pr00_36293.trc (incident=64809): ORA-07445: exception encountered: core dump [kzsrgavs()+995] [SIGSEGV] [ADDR:0x7FFD269AF8A7] [PC:0x2897843] [Address not mapped to object] [] Incident details in: /oracle/app/oracle/diag/rdbms/stdANBOB/ANBOB1/incident/incdir_64809/ANBOB1_pr00_36293_i64809.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Dumping diagnostic data in directory=[cdmp_20181225104822], requested by (instance=1, osid=36293 (PR00)), summary=[incident=64809]. 2018-12-25 10:48:51.764000 +08:00 alter database recover managed standby database cancel 2018-12-25 10:48:53.175000 +08:00 Logmerger died, shutting down parallel recovery slaves 2018-12-25 10:49:26.184000 +08:00 CLMN: delete dead process - failed 2018-12-25 10:53:59.078000 +08:00 ORA-1013 signalled during: alter database recover managed standby database cancel... # ora-7445 trace file *** 2018-12-25T10:48:22.063388+08:00 dbkedDefDump(): Starting a non-incident diagnostic dump (flags=0x3, level=3, mask=0x0) [TOC00004] 3> ***** SQL Statement (None) ***** Current SQL information unavailable - no cursor. 3< ***** current_sql_statement ***** [TOC00004-END] [TOC00005] 3> ***** Call Stack Trace ***** ksedst()+119 < dbkedDefDump()+1200 < ksedmp()+259 < ssexhd()+3188 < sslssSynchHdlr()+39 < sslsshandler()+118 < __sighandler() < kzsrgavs()+995 < krsu_pre11_pwd_conn() < krsu_oci_est_sec_conn()+1594 < krsu_wupi_log()+446 < krsu_upi_logc()+366 < krsu_connect()+104 < krsf_send_fal_request()+607 < krsf_get_gap_sequence()+1266 < krr_read_disk()+446 < krr_read_buffer() < krr_parse_redo() < krr_do_media_recovery()+6603 < krddmr()+1356 < krd_do_mrp()+1078 < krd_logmerger_driver()+6425 < krp_slave_main() < ksvrdp_int()+2010 < opirip()+602 < opidrv()+602 < sou2o()+145 < opimai_real()+202 < ssthrdmain()+417 < main()+262
tips:
Standby 实例具然出现了ora-7445 [kzsrgavs()+995].
KZSRGAVS >>>>> (Kzsr)Gavs – Kernel Security Privileges Roles/Remote Password File ??
KRSU_PRE11_PWD_CONN >>>>>(Krs)U – Kernel Recovery Standby/Dataguard
猜测DG 的db password file不一致,说明当时备库的节点1可能密码文件和主库和备库2节点不一致,这点可以使用md5sum验证文件。于是习惯性的从Primary节点1的$ORACLE_HOME/dbs下, scp过去一份PWD file到standby 下,此时执行alter database recover managed standby database cancel操作可能会hang, 可以shutdown abort重新启动. 再次启动log应用。
# standby side db alert log
Managed Standby Recovery starting Real Time Apply 2018-12-25 11:07:57.610000 +08:00 Parallel Media Recovery started with 80 slaves Media Recovery Log +ARCHDG/STDANBOB/ARCHIVELOG/2018_12_25/thread_2_seq_1906.301.995795603 Media Recovery Log +ARCHDG/STDANBOB/ARCHIVELOG/2018_12_24/thread_1_seq_5110.280.995721757 Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0x7FFF32BC5EC7] [PC:0x2897843, kzsrgavs()+995] [flags: 0x0, count: 1] Errors in file /oracle/app/oracle/diag/rdbms/stdANBOB/ANBOB1/trace/ANBOB1_arc3_76413.trc (incident=128681): ORA-07445: exception encountered: core dump [kzsrgavs()+995] [SIGSEGV] [ADDR:0x7FFF32BC5EC7] [PC:0x2897843] [Address not mapped to object] [] Incident details in: /oracle/app/oracle/diag/rdbms/stdANBOB/ANBOB1/incident/incdir_128681/ANBOB1_arc3_76413_i128681.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. 2018-12-25 11:07:59.047000 +08:00 Dumping diagnostic data in directory=[cdmp_20181225110759], requested by (instance=1, osid=76413 (ARC3)), summary=[incident=128681]. 2018-12-25 11:08:00.972000 +08:00 ARCH: Detected ARCH process failure ARCH: STARTING ARCH PROCESSES Starting background process ARC3 ARC3 started with pid=167, OS id=29684 ARC3: Archival started ARCH: STARTING ARCH PROCESSES COMPLETE Media Recovery of Online Log [Thread=1, Seq=5111] Recovery of Online Redo Log: Thread 1 Group 52 Seq 5111 Reading mem 0 Mem# 0: +DATADG/STDANBOB/ONLINELOG/group_52.260.995277121 2018-12-25 11:09:01.424000 +08:00 Standby crash recovery failed to bring standby database to a consistent point because needed redo hasn't arrived yet. MRP: Wait timeout: thread 1 sequence# 5111 Media Recovery user canceled with status 16016 Managed Standby Recovery not using Real Time Apply Standby Crash Recovery aborted due to error 10877. Errors in file /oracle/app/oracle/diag/rdbms/stdANBOB/ANBOB1/trace/ANBOB1_ora_29171.trc: ORA-10877: error signaled in parallel recovery slave ORA-10877: error signaled in parallel recovery slave Recovery interrupted! Some recovered datafiles maybe left media fuzzy Media recovery may continue but open resetlogs may fail Errors in file /oracle/app/oracle/diag/rdbms/stdANBOB/ANBOB1/trace/ANBOB1_pr00_29289.trc: ORA-16043: Redo apply has been canceled. ORA-16016: archived log for thread 1 sequence# 5111 unavailable Completed Standby Crash Recovery.
Tip:
当时还以为redo archived log sequence# 5111有问题,手动传输应用一下。
5. 手动传输GAP ARCHIVED LOG file
RMAN> backup archivelog sequence between 5111 and 5122; # scp backuppiece file to standby side RMAN> catalog backuppiece '/home/oracle/bftllg3h_1_1'; RMAN> restore archivelog sequence between 5111 and 5122;
tip:
Standby 重启应用发现可以往下继续应用,但是很快把手动传输的日志应用完后,又ORA-7445挂起了, 此时判断还应该是密码文件问题。 难道密码文件不对,看来是忽略了12C 的password file in ASM的特性。
6, password file in ASM
oracle@DB01:/home/oracle> srvctl config database -d ANBOB
Database unique name: ANBOB
Database name: ANBOB
Oracle home: /oracle/app/oracle/product/12.2.0/db_1
Oracle user: oracle
Spfile: +DATADG/ANBOB/PARAMETERFILE/spfile.281.992037707
Password file: +DATADG/ANBOB/PASSWORD/pwdANBOB.269.992036729
Domain: com
Start options: open
Stop options: immediate
Database role: PRIMARY
Management policy: AUTOMATIC
Server pools:
Disk Groups: DATADG,ARCHDG
Mount point paths:
Services:
Type: RAC
Start concurrency:
Stop concurrency:
OSDBA group: dba
OSOPER group:
Database instances: ANBOB1,ANBOB2
...
Note:
Password file记录确认了当前的实例密码文件是在ASM DiskGroup中。如果该行记录值为空表明使用使用之前的本地ORACLE_HOME/dbs下的目录文件,现在解决进来就简单了。
grid@DB01:/home/grid>asmcmd ASMCMD> cp +DATADG/ANBOB/PASSWORD/pwdANBOB.269.992036729 /tmp/orapwd.file copying +DATADG/ANBOB/PASSWORD/pwdANBOB.269.992036729 -> /tmp/orapwd.file grid@DB01:/home/grid>ls -l /tmp/orapwd.file -rw-r----- 1 grid oinstall 3584 Dec 25 14:15 /tmp/orapwd.file # scp the orapwd.file to standby instance both.
7. REGISTER Standby DB in CRS
oracle@qdord01:/home/oracle> srvctl add database -db stdANBOB -oraclehome $ORACLE_HOME -spfile +DATADG/STDANBOB/PARAMETERFILE/spfile.435.995277297 -role PHYSICAL_STANDBY -startoption MOUNT -dbtype RAC -dbname ANBOB -diskgroup "datadg,archdg" oracle@qdord01:/home/oracle> srvctl add instance -db stdANBOB -instance ANBOB1 -node qdord01 oracle@qdord01:/home/oracle> srvctl add instance -db stdANBOB -instance ANBOB2 -node qdord02 grid@qdord01:/home/grid> srvctl start instance -d stdANBOB -n qdord01 grid@qdord01:/home/grid> srvctl start instance -d stdANBOB -n qdord02
Conclusion:
12C中为了在不同实例间统一的密码管理, 支持把DB密码(ASM password same)存储到ASM DISKGROUP中,这样在维护DG环境时,当同步密码文件时就要先确认一下密码的位置, 同样DG端也可以把密码存储到ASM中,然后使用srvctl modify database修改pwd路径. 这个案例通过在标准化DG配置中因密码不一致产生了各种错误,后重新同步密码文件解决,提醒有相同问题的同学可以少走弯路。
— enjoy —
对不起,这篇文章暂时关闭评论。