Troubleshooting 19c RAC CRS resource db show “UNKNOWN” state , srvctl start instance CRS-2680
有套ORACLE 19c RAC在使用crsctl 查看db resource时显示“UNKNOWN”, 但是用sqlplus 可以启动db 实例,srvctl status instance显示not running. 手动启动instance 使用srvctl 显示如下错误
[oracle@~]$ srvctl start instance -d -i INTS1 PRCR-1013 : Failed to start resource ora..db PRCR-1064 : Failed to start resource ora..db on node CRS-2680: Clean of 'ora..db' on '' failed CRS-5802: Unable to start the agent process
之前有做srvctl remove instance和database的操作情况依旧。
GI alert log
2020-12-05 15:52:11.687 [CRSD(7541)]CRS-2758: Resource 'ora.hmracdg.db' is in an unknown state. 2020-12-05 15:57:50.688 [CRSD(7541)]CRS-5828: Could not start agent '/u01/app/19.3.0/grid/bin/oraagent_oracle'. Details at (:CRSAGF00130:) {1:2872:4034} in /u01/app/grid/diag/crs/anbob1/crs/trace/crsd.trc. 2020-12-05 16:07:52.408 [CRSD(7541)]CRS-5828: Could not start agent '/u01/app/19.3.0/grid/bin/oraagent_oracle'. Details at (:CRSAGF00130:) {1:2872:4288} in /u01/app/grid/diag/crs/anbob1/crs/trace/crsd.trc.
crs log
2020-12-05 16:39:35.288 : CRSPE:585066240: [ INFO] {1:2872:5072} Expression Filter : ((LAST_SERVER == anbob01) AND (NAME == ora.scan1.vip)) 2020-12-05 16:39:35.291 :UiServer:578762496: [ INFO] {1:2872:5072} Done for ctx=0x7fa7e003bb10 2020-12-05 16:39:39.934 :GIPCHTHR:3034482432: gipchaDaemonWork: DaemonThread heart beat, time interval since last heartBeat 30830loopCount 28 2020-12-05 16:40:00.294 : CRSD:597673728: [ NONE] {1:2872:4988} {1:2872:4988} Created alert : (:CRSAGF00130:) : Failed to start the agent /u01/app/19.3.0/grid/bin/oraagent_oracle 2020-12-05 16:40:00.294 : AGFW:597673728: [ INFO] {1:2872:4988} Rejecting pending msgs for ora.anbob.db 1 1 2020-12-05 16:40:00.294 : AGFW:597673728: [ INFO] {1:2872:4988} Rejecting msg: 4100 2020-12-05 16:40:00.294 : AGFW:597673728: [ INFO] {1:2872:4988} Agfw Proxy Server sending the last reply to PE for message:RESOURCE_CLEAN[ora.anbob.db 1 1] ID 4100:11921 2020-12-05 16:40:00.294 : AGFW:597673728: [ INFO] {1:2872:4988} Can not stop the agent: /u01/app/19.3.0/grid/bin/oraagent_oracle because pid is not initialized 2020-12-05 16:40:00.294 : CRSPE:585066240: [ INFO] {1:2872:4988} Received reply to action [Clean] message ID: 11921 2020-12-05 16:40:00.294 : CRSPE:585066240: [ INFO] {1:2872:4988} RI [ora.anbob.db 1 1] new internal state: [STABLE] old value: [CLEANING] 2020-12-05 16:40:00.294 : CRSPE:585066240: [ INFO] {1:2872:4988} Fatal Error from AGFW Proxy: Unable to start the agent process 2020-12-05 16:40:00.294 : CRSPE:585066240: [ INFO] {1:2872:4988} CRS-2680: Clean of 'ora.anbob.db' on 'anbob01' failed 2020-12-05 16:40:00.294 : CRSPE:585066240: [ INFO] {1:2872:4988} Command [0x7fa7f446c290] has sent a progress reply:CRS-2680: Clean of 'ora.anbob.db' on 'anbob01' / for [ora.anbob.db] 2020-12-05 16:40:00.294 :UiServer:578762496: [ INFO] {1:2872:4988} Response: c4|5!ORDERk7|MESSAGEt57|CRS-2680: Clean of 'ora.anbob.db' on 'anbob01' failedk7|MSGTYPEt1|1k5|OBJIDt14|ora.anbob.dbk4|WAITt1|0 2020-12-05 16:40:00.295 : CRSPE:585066240: [ INFO] {1:2872:4988} Sequencer for [ora.anbob.db 1 1] has completed with error: CRS-5802: Unable to start the agent process 2020-12-05 16:40:00.295 : CRSPE:585066240: [ INFO] {1:2872:4988} Deleting RI-path from op-history:ora.anbob.db 1 1
oraagent启动失败,在11G 里需要检查
$ ls -ld /log//agent/crsd/oraagent_oracle drwxrwxrwt. 2 oracle oinstall 4096 Aug 22 10:52 /log//agent/crsd/oraagent_oracle
从Grid Infrastructure版本12.1.0.2开始,每个守护进程的pid文件不仅存在于//.pid, 也在/crsdata//output/.pid. 根据MOS 2028511.1 记录,查看/tmp下新生的oraagent*.out 文件
/tmp/oragent_nnnn.out
Oracle Clusterware infrastructure error in ORAAGENT (OS PID 4976): Error in an OS-dependent function or service Error category: -2, operation: open, location: SCLSB00009, OS error: 13 OS error message: Permission denied Additional information: Call to open daemon stdout/stderr file failed Oracle Clusterware infrastructure fatal error in ORAAGENT (OS PID 4976): Internal error (ID (:CLSB00126:)) - Failed to redirect daemon standard outputs using location /u01/app/grid/crsdata/anbob1/output and root name crsd_oraagent_oracle
cluvfy comp software -n all -verbose 因为只检查binary file, 未显示软件权限问题,手动检查/crsdata//output/ 下pid文件权限
发现当前目录的所有文件被chomd grid:oinstall *,和chmod 775 *, DBA对数据库应该有些敬畏之心,不要简单认为给777就ok, GRID_HOME也并非所有文件都grid owner, 当误操作时可以参考How to check and fix file permissions on Grid Infrastructure environment (Doc ID 1931142.1) 修正binary file, 然后参考正常节点修改错误节点。
pid在 GRID_HOME正常的权限如下:
-rw-r--r--. 1 root root 0 Jul 29 14:52 ./crs/init/lccn0 -rw-r--r--. 1 root root 5 Dec 1 08:35 ./crs/init/lccn0.pid -rw-r--r--. 1 root root 0 Jul 29 14:51 ./ctss/init/lccn0 -rw-r--r--. 1 root root 5 Dec 1 08:35 ./ctss/init/lccn0.pid -rw-r--r--. 1 grid oinstall 0 Jul 29 14:50 ./evm/init/lccn0 -rw-r--r--. 1 grid oinstall 5 Dec 1 08:34 ./evm/init/lccn0.pid -rw-r--r--. 1 grid oinstall 5 Dec 1 08:34 ./gipc/init/lccn0 -rw-r--r--. 1 grid oinstall 5 Dec 1 08:34 ./gipc/init/lccn0.pid -rw-r--r--. 1 grid oinstall 0 Jul 29 14:50 ./gpnp/init/lccn0 -rw-r--r--. 1 grid oinstall 5 Dec 1 08:34 ./gpnp/init/lccn0.pid -rw-r--r--. 1 grid oinstall 0 Jul 29 14:50 ./mdns/init/lccn0 -rw-r--r--. 1 grid oinstall 5 Dec 1 08:34 ./mdns/init/lccn0.pid -rw-r--r--. 1 root root 0 Jul 29 14:50 ./ohasd/init/lccn0 -rw-r--r--. 1 root root 5 Dec 1 08:34 ./ohasd/init/lccn0.pid -rw-r--r--. 1 root root 0 Jul 29 14:54 ./ologgerd/init/lccn0 -rw-r--r--. 1 root root 5 Dec 1 08:35 ./ologgerd/init/lccn0.pid -rw-r--r--. 1 root root 0 Jul 29 14:52 ./osysmond/init/lccn0 -rw-r--r--. 1 root root 5 Dec 1 08:35 ./osysmond/init/lccn0.pid
解决方法:
手动修改pid 文件权限后,重启crs 恢复正常
对不起,这篇文章暂时关闭评论。