After OS reboot, Ohasd(cssd) start fail is due to OLR corrupted
前几天帮助同事处理了个案例, 主机意外重启后数据库无法启动, 环境是11.2.0.3 standalone on aix, 用的是ASM.最后确认是OLR损坏, 因为当时没有记录具体流程,在这里只简单的记录.
# check init.ohasd process is running ps -ef | grep init.ohasd # if not the run as root /etc/init.d/init.ohasd run >/dev/null 2>&1
到这步发现HAS已启动,但是cssd都没启动, cssd和一些local OCR做为资源是随HAS启动的, 如果没有数据库使用ASM, 默认是不会自动启动cssd和asm的。 但是使用crsctl start res ora.cssd -init又提示没有资源, 查看GI alert log 如下
# gi alert [ohasd(7210)]CRS-8017:location: /etc/oracle/lastgasp has 4 reboot advisory log files, 0 were announced and 0 errors occurred 2015-10-18 12:51:23.952 [ohasd(7210)]CRS-2772:Server 'anbob' has been assigned to pool 'Free'. 2015-10-18 13:52:40.249 [ohasd(8085)]CRS-2112:The OLR service started on node anbob. 2015-10-18 13:52:40.314 [ohasd(8085)]CRS-2772:Server 'anbob' has been assigned to pool 'Free'. 2015-10-18 14:01:57.696 [ohasd(8668)]CRS-2112:The OLR service started on node anbob. 2015-10-18 14:01:57.761 [ohasd(8668)]CRS-2772:Server 'anbob' has been assigned to pool 'Free'. 2015-10-18 15:03:27.157 [ohasd(11026)]CRS-2112:The OLR service started on node anbob. 2015-10-18 15:03:27.202 [ohasd(11296)]CRS-1339:Oracle High Availability Service aborted due to an unexpected error [Failed to initialize Oracle Local Registry]. Details at (:OHAS00106:) in /oracle/app/oracle/product/11.2.0/grid/log/anbob/ohasd/ohasd.log. 2015-10-18 15:03:27.220 [ohasd(11026)]CRS-2772:Server 'anbob' has been assigned to pool 'Free'. 2015-10-18 15:23:08.435 [ohasd(12135)]CRS-2112:The OLR service started on node anbob. 2015-10-18 15:23:08.499 [ohasd(12135)]CRS-2772:Server 'anbob' has been assigned to pool 'Free'. # OHASD.LOG [grid@anbob anbob]$ vi /oracle/app/oracle/product/11.2.0/grid/log/anbob/ohasd/ohasd.log 2015-10-18 15:03:27.159: [ OCRSRV][1032845088]th_init: Local listener did not reach valid state 2015-10-18 15:03:27.159: [ CRSOCR][555579168] CAAOCR GET Debug sblevel Level[default]: 0 ... [ clsdmt][4118787840]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=anbobDBG_OHASD)) 2015-10-18 15:03:27.170: [ clsdmt][4118787840]PID for the Process [11026], connkey 8 2015-10-18 15:03:27.176: [ default][555579168] Ohasd Daemon Started. 2015-10-18 15:03:27.176: [ CLSLG][3747604224] Last Gasp Monitor thread started 2015-10-18 15:03:27.176: [ CLSLG][3747604224] processing Last Gasp disk location /etc/oracle/lastgasp 2015-10-18 15:03:27.177: [ CLSE][555579168]clse_get_auth_loc: Returning default authloc: /oracle/app/oracle/product/11.2.0/grid/auth/ohasd/anbob 2015-10-18 15:03:27.177: [ default][555579168] AuthLoc /oracle/app/oracle/product/11.2.0/grid/auth/ohasd/anbob 2015-10-18 15:03:27.177: [ default][555579168] PE Engine: NEW 2015-10-18 15:03:27.177: [ default][555579168] Using OCR batch ops : ENABLED 2015-10-18 15:03:27.177: [ default][555579168] RD registrations and Clusterization disabled. 2015-10-18 15:03:27.177: [ CLSLG][3747604224] monitoring new interface 0.0.0.0 2015-10-18 15:03:27.178: [ default][555579168][F-ALGO] getIpcPath returning (ADDRESS=(PROTOCOL=IPC)(KEY=OHASD_IPC_SOCKET_11)) 2015-10-18 15:03:27.178: [CLSFRAME][555579168] Inited lsf context 0x220efe0 2015-10-18 15:03:27.178: [CLSFRAME][555579168] Initing CLS Framework messaging 2015-10-18 15:03:27.179: [ CLSVER][555579168] Static Version 11.2.0.1.0 2015-10-18 15:03:27.179: [ default][555579168][F-ALGO] getIpcPath returning (ADDRESS=(PROTOCOL=IPC)(KEY=OHASD_IPC_SOCKET_11)) 2015-10-18 15:03:27.180: [UiServer][555579168] UI Comms initalize() 1 2015-10-18 15:03:27.181: [CLSFRAME][555579168] New Framework state: 2 2015-10-18 15:03:27.181: [CLSFRAME][555579168] M2M is starting... 2015-10-18 15:03:27.182: [ CRSCOMM][555579168] m_pClscCtx=0x22c8030m_pUgblm=0x22d32f0 2015-10-18 15:03:27.182: [ CRSCOMM][555579168] Starting send thread 2015-10-18 15:03:27.182: [ CRSCOMM][555579168] IPC Listener instantiated for: (ADDRESS=(PROTOCOL=IPC)(KEY=OHASD_IPC_SOCKET_11)) 2015-10-18 15:03:27.182: [ CRSCOMM][205494016] clsIpc: sendWork thread started. 2015-10-18 15:03:27.182: [ CRSCOMM][555579168] IPC Listener started listening. 2015-10-18 15:03:27.183: [ CRSCOMM][4097808128] IPCL thread started listening 2015-10-18 15:03:27.183: [CLSFRAME][555579168] Starting thread model named: AgfwProxySrvTM 2015-10-18 15:03:27.183: [CLSFRAME][555579168] Starting thread model named: OcrModuleTM 2015-10-18 15:03:27.183: [CLSFRAME][555579168] Starting thread model named: PolicyEngineTM 2015-10-18 15:03:27.184: [CLSFRAME][555579168] Starting thread model named: SharedThreadTM 2015-10-18 15:03:27.184: [CLSFRAME][555579168] Starting thread model named: UiServerTM 2015-10-18 15:03:27.184: [CLSFRAME][555579168] New Framework state: 3 2015-10-18 15:03:27.185: [ CRSRPT][3699324672] Enabled 2015-10-18 15:03:27.185: [ CRSPE][3701425920] PE Role|State Update: old role [INVALID] new [INVALID]; old state [Not yet initialized] new [Enabling: waiting for role] 2015-10-18 15:03:27.185: [ CRSSE][3699324672] SE module master election disabled 2015-10-18 15:03:27.185: [ CRSSE][3699324672] Master Change Event; New Master Node ID:0 This Node's ID:0 2015-10-18 15:03:27.189: [ CRSPE][3701425920] Sent request to write event sequence number 1400000 to repository 2015-10-18 15:03:27.189: [ CRSPE][3701425920] Reading (1) servers 2015-10-18 15:03:27.189: [ CRSPE][3701425920] There are no resource types to read. 2015-10-18 15:03:27.189: [ CRSPE][3701425920] There are no resources to read. 2015-10-18 15:03:27.191: [ CRSPE][3701425920] Wrote new event sequence to repository 2015-10-18 15:03:27.192: [ CRSPE][3701425920] Reading (1) server pools 2015-10-18 15:03:27.196: [ CRSPE][3701425920] Finished reading configuration. Parsing... 2015-10-18 15:03:27.202: [ CRSPE][3701425920] Parsing server pools... 2015-10-18 15:03:27.202: [ CRSOCR][1032845088] OCR context init failure. Error: PROCL-24: Error in the messaging layer Messaging error [18] 2015-10-18 15:03:27.202: [ default][1032845088] OLR initalization failured, rc=24 2015-10-18 15:03:27.202: [ CRSPE][3701425920] Parsed and validated SERVERPOOL: Free [min:0][max:-1][importance:0] NO SERVERS ASSIGNED 2015-10-18 15:03:27.202: [ default][1032845088]Created alert : (:OHAS00106:) : Failed to initialize Oracle Local Registry 2015-10-18 15:03:27.202: [ CRSPE][3701425920] Server pools parsed 2015-10-18 15:03:27.202: [ default][1032845088][PANIC] OHASD exiting; Could not init OLR 2015-10-18 15:03:27.202: [ CRSPE][3701425920] Server Pool Free has been registered 2015-10-18 15:03:27.202: [ default][1032845088] Done. 2015-10-18 15:03:27.202: [ CRSPE][3701425920] Cluster reboot took place. 2015-10-18 15:03:27.203: [ CRSPE][3701425920] Configuration has been parsed 2015-10-18 15:03:27.203: [ default][3697223424][F-ALGO] getIpcPath returning (ADDRESS=(PROTOCOL=IPC)(KEY=OHASD_UI_SOCKET)) 2015-10-18 15:03:27.204: [UiServer][3697223424] UI socket on: (ADDRESS=(PROTOCOL=IPC)(KEY=OHASD_UI_SOCKET)) 2015-10-18 15:03:27.204: [ default][3697223424][F-ALGO] getIpcPath returning (ADDRESS=(PROTOCOL=IPC)(KEY=CRSD_UI_SOCKET)) 2015-10-18 15:03:27.204: [UiServer][3697223424] UI socket on: (ADDRESS=(PROTOCOL=IPC)(KEY=CRSD_UI_SOCKET)) 2015-10-18 15:03:27.204: [UiServer][3695122176] UI comms listening for events. 2015-10-18 15:03:27.204: [CLSFRAME][3711932160] Module Enabling is complete
从上面可以看到OLR初始化失败,应该是olr miss或corrupted,因为是Standalone所有找个本地的olr备份还原一下应该就可以解决。
[grid@anbob ~]$ cd $ORACLE_HOME [grid@anbob grid]$ cd cdata [grid@anbob cdata]$ ls anbob localhost [grid@anbob cdata]$ cd anbob [grid@anbob anbob]$ ls backup_20140818_113812.olr [grid@anbob anbob]$ ls -lrt total 5260 -rwxr-xr-x 1 grid oinstall 5386240 Aug 18 2014 backup_20140818_113812.olr [grid@anbob anbob]$ ocrconfig -local -showbackup PROTL-25: Manual backups for the Oracle Local Registry are not available
可以看到有个2014年的一个olr备份,ocr备份存放在$GRID_HOME/cdata, 其它没有手动备份的。
[grid@anbob anbob]$ ls -lrt /etc/oracle/ total 24 drwxr-xr-x 3 root oinstall 4096 Oct 18 04:58 scls_scr drwxrwxr-x 5 root oinstall 4096 Oct 18 04:58 oprocd -rw-r--r-- 1 root root 0 Oct 18 04:58 olr.loc.orig -rw-r--r-- 1 root oinstall 130 Oct 18 04:58 olr.loc -rw-r--r-- 1 root root 16 Oct 18 04:58 ocr.loc.orig -rw-r----- 1 grid oinstall 95 Oct 18 04:58 ocr.loc drwxrwx--- 3 root oinstall 4096 Oct 18 13:33 lastgasp
# To restore OLR
crsctl stop has ocrconfig -local -restore /oracle/app/oracle/product/11.2.0/grid/cdata/anbob/backup_20140818_113812.olr crsctl start has crsctl start res ora.cssd -init
终于可以显示资源了,并且cssd已自动启动,剩下的工作就是手动启asm,mount asm磁盘组,启数据库
su - grid sqlplus / as sysasm # startup asm alter diskgroup mount su - oracle # startup database;
这里因为刚好有以前的备份,如果没有备份时可以参考NOTE 1539020.1, 重建OLR
In an environment with fresh GI installation:
1. Deconfig the existing clusterware, as this is only done on the problematic node, the other nodes should have Clusterware up and running, this command should only deconfig the clusterware configuration for local node, it should not touch/change OCR.
# /crs/install/rootcrs.pl -deconfig -force
2. Rerun root.sh
# /root.sh
对不起,这篇文章暂时关闭评论。