Troubleshooting Oracle Exadata X5 db instance mount fail with ORA-01105 & ORA-01154
最近一个Oracle Exadata x5 2节点RAC 11.2.0.4环境,每个节点中有2套DB实例。 Node1正常运行,计划性重启Node 2后,CRS启动正常,主机上1个db的 instance 2启动正常, 但另1个db的node2 db instance启动失败,提示下面的错误:
SQL> startup mount; ORACLE instance started. Total System Global Area 3.8482E+10 bytes Fixed Size 2261368 bytes Variable Size 1.7583E+10 bytes Database Buffers 2.0670E+10 bytes Redo Buffers 227811328 bytes ORA-01105: mount is incompatible with mounts by other instances ORA-01154: database busy. Open, close, mount, and dismount not allowed now
分析方法
检查db alert log.
确认2个实例使用的是同一个spfile
实例1 create pfile from memory 和spfile中参数对比
查看实例1的db alert log,查找最后一次dbinstance 启动显示的参数及后期alter system变更的参数
查找实例1启动后的其他所有实例alert log中alter system修改的参数
因为是mount阶段,尝试把controlfile 从ASM copy出来一份,手动启动一测试instance验证mount是否正常
10046跟踪,或dump contrlf, strace跟踪
检查socket file 在.oracle目录下
重启另一存活节点
db alert log
ALTER DATABASE MOUNT NOTE: Loaded library: OSS NOTE: Loaded library: System SUCCESS: diskgroup RECOC1 was mounted NOTE: dependency between database XXX and diskgroup resource ora.RECOC1.dg is established ORA-1105 signalled during: ALTER DATABASE MOUNT...
socket files
/tmp/.oracle, /var/tmp/.oracle or /usr/tmp/.oracle
因为node2 操作系统已发生过重启,且crs启动正常,并且同一node上另一db instance启动正常,排除socket文件问题.
sql trace跟踪
WAIT #139911440860624: nam='control file sequential read' ela= 413 file#=0 block#=816 blocks=1 obj#=-1 tim=1688619713189879
WAIT #139911440860624: nam='control file sequential read' ela= 382 file#=0 block#=96 blocks=1 obj#=-1 tim=1688619713190366
WAIT #139911440860624: nam='control file sequential read' ela= 291 file#=0 block#=423 blocks=1 obj#=-1 tim=1688619713190706
WAIT #139911440860624: nam='control file sequential read' ela= 337 file#=0 block#=47 blocks=1 obj#=-1 tim=1688619713191090
WAIT #139911440860624: nam='enq: DM - contention' ela= 139 name|mode=1145896964 type=0 type=3 obj#=-1 tim=1688619713191284
WAIT #139911440860624: nam='CSS initialization' ela= 2435 p1=0 p2=0 p3=0 obj#=-1 tim=1688619713193772
WAIT #139911440860624: nam='CSS operation: action' ela= 548 function_id=65 p2=0 p3=0 obj#=-1 tim=1688619713194360
WAIT #139911440860624: nam='control file sequential read' ela= 335 file#=0 block#=2 blocks=1 obj#=-1 tim=1688619713194747
WAIT #139911440860624: nam='enq: SW - contention' ela= 280 name|mode=1398210563 0=0 0=0 obj#=-1 tim=1688619713195185
WAIT #139911440860624: nam='rdbms ipc reply' ela= 312 from_process=21 timeout=910 p3=0 obj#=-1 tim=1688619713195538
WAIT #139911440860624: nam='rdbms ipc reply' ela= 145 from_process=22 timeout=910 p3=0 obj#=-1 tim=1688619713195724
WAIT #139911440860624: nam='rdbms ipc reply' ela= 188 from_process=23 timeout=910 p3=0 obj#=-1 tim=1688619713195962
WAIT #139911440860624: nam='rdbms ipc reply' ela= 145 from_process=24 timeout=910 p3=0 obj#=-1 tim=1688619713196138
WAIT #139911440860624: nam='rdbms ipc reply' ela= 94 from_process=25 timeout=910 p3=0 obj#=-1 tim=1688619713196262
WAIT #139911440860624: nam='rdbms ipc reply' ela= 198 from_process=27 timeout=910 p3=0 obj#=-1 tim=1688619713196507
WAIT #139911440860624: nam='rdbms ipc reply' ela= 95 from_process=26 timeout=910 p3=0 obj#=-1 tim=1688619713196643
WAIT #139911440860624: nam='rdbms ipc reply' ela= 141 from_process=28 timeout=910 p3=0 obj#=-1 tim=1688619713196823
WAIT #139911440860624: nam='rdbms ipc reply' ela= 105 from_process=29 timeout=910 p3=0 obj#=-1 tim=1688619713196965
WAIT #139911440860624: nam='rdbms ipc reply' ela= 1867 from_process=31 timeout=900 p3=0 obj#=-1 tim=1688619713198864
WAIT #139911440860624: nam='rdbms ipc reply' ela= 66 from_process=31 timeout=900 p3=0 obj#=-1 tim=1688619713198971
WAIT #139911440860624: nam='rdbms ipc reply' ela= 75 from_process=21 timeout=2147483647 p3=0 obj#=-1 tim=1688619713199090
EXEC #139911440860624:c=39068,e=35990216,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=0,tim=1688619713199245
ERROR #139911440860624:err=1105 tim=1688619713199262
WAIT #139911440860624: nam='Disk file operations I/O' ela= 58 FileOperation=8 fileno=0 filetype=8 obj#=-1 tim=1688619713199414
WAIT #139911440860624: nam='SQL*Net break/reset to client' ela= 7 driver id=1650815232 break?=1 p3=0 obj#=-1 tim=1688619713199449
WAIT #139911440860624: nam='SQL*Net break/reset to client' ela= 74 driver id=1650815232 break?=0 p3=0 obj#=-1 tim=1688619713199541
WAIT #139911440860624: nam='SQL*Net message to client' ela= 1 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=1688619713199554
strace跟踪发现
disconnect: sosstcpopen failed.boxname = 192.168.XX.21,port =5042 OS system dependent operation:poll_timeout failed with status: 115 OS failure message: Operation now in progress Failure occurred at: sosstcpconne
Note:
strace过程中发现有write trace file 上面报错信息,也在ora trace找到上面的内容,启动过程中有发现db到cellserver的private IB链路一半的链路报错,但controlfile read block尝试另一链路后正常, 使用ibping和rds-ping 在db server和cell server之间正常。这似乎是XD的问题,历史问题, 这个问题之前一直存在不影响启动,该问题先忽略,在X5 X7相同问题都存在, X8正常。
ORA-01154: database busy
Oracle RAC中有些certain parameters 必须在node间参数具有相同的值,如DRM( _gc_read_mostly_locking, _gc_policy_time _gc_undo_affinity etc..), log_archive_config, db_recovery_file_dest & db_recovery_file_dest_size , DB_FILES ,db_file_name_convert ,log_file_name_convert 等
否则可能报错:
ORA-01105: mount is incompatible with mounts by other instances
ORA-19808: recovery destination parameter mismatch
ORA-01105: mount is incompatible with mounts by other instances
ORA-01606: gc_files_to_locks not identical to that of another mounted instance
ORA-01105: mount is incompatible with mounts by other instances
ORA-01174: DB_FILES is 201 buts needs to be 200 to be compatible
ORA-01105: mount is incompatible with mounts by other instances
ORA-01677: standby file name conversion parameters differ from other instance
…
上面的报错 ORA-01154报错应该是主要的问题。
Oracle DB Error ORA-01154 database busy. Open, close, mount, and dismount not allowed now
The following in the cause of this error:
Some operation is in progress that expects the state of the instance to remain open or mounted.
Action you can take to resolve this issue: Wait for the operation to complete and try again. If attempting a normal database shutdown, try SHUTDOWN ABORT.
包括把controlfile copy出来mount正常和strace结果说明在读controlfile均正常, 不确认是什么原因另一实例持有了DB的某状态与现有mount冲突,也可能在另一实例database close中。做好准备后尝试重启存活节点1 db instance.
重启存活节点
尝试重启实例1 db instance..
SQL> alter system checkpoint SQL> shutdown immediate
Note:
挂起10分钟,查看db alert log
# shut immediate alert log 2023-07-06 23:10:58.025000 +08:00 Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_pmon_83663.trc: ORA-24756: transaction does not exist 2023-07-06 23:11:01.031000 +08:00 Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_pmon_83663.trc: ORA-24756: transaction does not exist 2023-07-06 23:11:04.036000 +08:00 Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_pmon_83663.trc: ORA-24756: transaction does not exist 2023-07-06 23:11:07.041000 +08:00 Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_pmon_83663.trc: ORA-24756: transaction does not exist 2023-07-06 23:11:10.046000 +08:00 Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_pmon_83663.trc: ORA-24756: transaction does not exist 2023-07-06 23:11:13.051000 +08:00 Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_pmon_83663.trc: ORA-24756: transaction does not exist 2023-07-06 23:11:16.056000 +08:00 Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_pmon_83663.trc: ORA-24756: transaction does not exist 2023-07-06 23:11:19.062000 +08:00 Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_pmon_83663.trc:
PMON trace file
# /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_pmon_83663.trc *** 2023-07-06 23:11:16.055 found process 0x959db8d98 pid=974 serial=246 ospid = 243846 dead *** 2023-07-06 23:11:16.056 kssxdl: error deleting SO: 0x927452318 = transaction (56) ---------------------------------------- SO: 0x927452318, type: 56, owner: 0x94ac658a8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x959db8d98, name=transaction, file=ktccts.h LINE:410, pg=0 ORA-24756: transaction does not exist *** 2023-07-06 23:11:19.061 found process 0x959db8d98 pid=974 serial=246 ospid = 243846 dead *** 2023-07-06 23:11:19.061 kssxdl: error deleting SO: 0x927452318 = transaction (56) ---------------------------------------- SO: 0x927452318, type: 56, owner: 0x94ac658a8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x959db8d98, name=transaction, file=ktccts.h LINE:410, pg=0 ORA-24756: transaction does not exist *** 2023-07-06 23:11:22.066 found process 0x959db8d98 pid=974 serial=246 ospid = 243846 dead *** 2023-07-06 23:11:22.067 kssxdl: error deleting SO: 0x927452318 = transaction (56) ---------------------------------------- SO: 0x927452318, type: 56, owner: 0x94ac658a8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x959db8d98, name=transaction, file=ktccts.h LINE:410, pg=0
Note:
MOS中查询相关报错PMON cleanup fails with ORA-24756: transaction does not exist (Doc ID 2299927.1),存在bug。
Solution
The only workaround was to restart the instance.
尝试shudown abort重启实例,再次启动实例正常open. 另一实例也启动正常。 大胆猜测一下,也许这两个事情有关连性。
启动后检查
DB server 检查
# dbmcli -e list alerthistory
CELL Server检查
# cellcli -e list alerthistory
— 一切正常 over —
对不起,这篇文章暂时关闭评论。