Exadata Instance crash ORA-600 [ksz_cln_proc1] and restart fail due to breakdown of one CellServer (案例)
前段时间有套Oracle Exadata Machine x2-2 2-nodes RAC 环境数据库两个db instance 前后crash, 并且自动启动失败,ASM 存储使用Normal 冗余,每个DG 3个CELL, 理论上坏一个CELL 应该是不会影响到数据库,简单记录分享一下这个故障和处理思路。
[oracle@kdexa1db01 (anbob1)oracle]$ sqlplus / as sysdba SQL*Plus: Release 11.2.0.3.0 Production on Fri Dec 1 09:51:38 2017 Copyright (c) 1982, 2011, Oracle. All rights reserved. Connected to an idle instance. SQL> startup ORA-00304: requested INSTANCE_NUMBER is busy
Note:
当前数据库是RAC环境,且当时节点2是活动状态,此时检查了与当前实例相关的进程是否存在,共享内存是否已释放,CRS是否在自动拉起DB instance, 参数文件是否真的istance_number重复。就在此时节点2也突然crash, 手动启动挂起.
数据库实例检查
-- db1 alert log Immediate Kill Session: sess: 0xa59411b58 OS pid: 32215 Fri Dec 01 09:40:17 2017 opiodr aborting process unknown ospid (32520) as a result of ORA-28 Fri Dec 01 09:40:22 2017 Errors in file /oracle/app/oracle/diag/rdbms/orarpt/anbob1/trace/anbob1_dia0_16581.trc (incident=2085703): ORA-32701: Possible hangs up to hang ID=109905 detected Incident details in: /oracle/app/oracle/diag/rdbms/orarpt/anbob1/incident/incdir_2085703/anbob1_dia0_16581_i2085703.trc DIA0 requesting termination of session sid:2334 with serial # 64947 (ospid:13691) on instance 2 due to a LOCAL, HIGH confidence hang with ID=109905. Hang Resolution Reason: Although the number of affected sessions did not justify automatic hang resolution initially, this previously ignored hang was automatically resolved. DIA0: Examine the alert log on instance 2 for session termination status of hang with ID=109905. Fri Dec 01 09:40:28 2017 DSKM process appears to be hung. Initiating system state dump. Fri Dec 01 09:40:28 2017 System state dump requested by (instance=1, osid=16565 (GEN0)), summary=[system state dump request (ksz_check_ds)]. System State dumped to trace file /oracle/app/oracle/diag/rdbms/orarpt/anbob1/trace/anbob1_diag_16567.trc Fri Dec 01 09:40:45 2017 Immediate Kill Session#: 2661, Serial#: 19795 Immediate Kill Session: sess: 0xa493a7190 OS pid: 1420 ... ... .. Fri Dec 01 09:41:02 2017 ORA-00020: maximum number of processes (2000) exceeded ORA-20 errors will not be written to the alert log for the next minute. Please look at trace files to see all the ORA-20 errors. Fri Dec 01 09:41:22 2017 Errors in file /oracle/app/oracle/diag/rdbms/orarpt/anbob1/trace/anbob1_dia0_16581.trc (incident=2085704): ORA-32701: Possible hangs up to hang ID=109909 detected Incident details in: /oracle/app/oracle/diag/rdbms/orarpt/anbob1/incident/incdir_2085704/anbob1_dia0_16581_i2085704.trc Fri Dec 01 09:41:40 2017 DIA0 terminating blocker (ospid: 1842 sid: 130 ser#: 63021) of hang with ID = 109906 requested by master DIA0 process on instance 1 Hang Resolution Reason: Although the number of affected sessions did not justify automatic hang resolution initially, this previously ignored hang was automatically resolved. by terminating session sid: 130 ospid: 1842 DIA0 successfully terminated session sid:130 ospid:1842 with status 31. DIA0 successfully resolved a LOCAL, HIGH confidence hang with ID=109906. DIA0 terminating blocker (ospid: 3355 sid: 2586 ser#: 14507) of hang with ID = 109907 requested by master DIA0 process on instance 1 Hang Resolution Reason: Although the number of affected sessions did not justify automatic hang resolution initially, this previously ignored hang was automatically resolved. by terminating session sid: 2586 ospid: 3355 DIA0 successfully terminated session sid:2586 ospid:3355 with status 31. DIA0 successfully resolved a LOCAL, HIGH confidence hang with ID=109907. Fri Dec 01 09:41:51 2017 DIA0 terminating blocker (ospid: 3311 sid: 2938 ser#: 12823) of hang with ID = 109908 requested by master DIA0 process on instance 1 Hang Resolution Reason: Although the number of affected sessions did not justify automatic hang resolution initially, this previously ignored hang was automatically resolved. by terminating session sid: 2938 ospid: 3311 DIA0 successfully terminated session sid:2938 ospid:3311 with status 31. DIA0 successfully resolved a LOCAL, HIGH confidence hang with ID=109908. DIA0 terminating blocker (ospid: 3333 sid: 899 ser#: 12861) of hang with ID = 109909 requested by master DIA0 process on instance 1 Hang Resolution Reason: Although the number of affected sessions did not justify automatic hang resolution initially, this previously ignored hang was automatically resolved. by terminating session sid: 899 ospid: 3333 Fri Dec 01 09:41:57 2017 Errors in file /oracle/app/oracle/diag/rdbms/orarpt/anbob1/trace/anbob1_pmon_16539.trc (incident=2069655): ORA-00600: internal error code, arguments: [ksz_cln_proc1], [0x94B89D2A0], [3], [], [], [], [], [], [], [], [], [] Incident details in: /oracle/app/oracle/diag/rdbms/orarpt/anbob1/incident/incdir_2069655/anbob1_pmon_16539_i2069655.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Errors in file /oracle/app/oracle/diag/rdbms/orarpt/anbob1/trace/anbob1_pmon_16539.trc: ORA-00600: internal error code, arguments: [ksz_cln_proc1], [0x94B89D2A0], [3], [], [], [], [], [], [], [], [], [] PMON (ospid: 16539): terminating the instance due to error 472 Fri Dec 01 09:42:04 2017 ORA-1092 : opitsk aborting process Instance terminated by PMON, pid = 16539 USER (ospid: 23180): terminating the instance Termination issued to instance processes. Waiting for the processes to exit Fri Dec 01 09:42:13 2017 Instance terminated by USER, pid = 23180 Fri Dec 01 09:43:48 2017 Starting ORACLE instance (normal) ... ... ... Fri Dec 01 09:55:15 2017 MMON started with pid=48, OS id=18350 Fri Dec 01 09:55:15 2017 MMNL started with pid=50, OS id=18356 Fri Dec 01 09:55:28 2017 USER (ospid: 14048): terminating the instance due to error 304 Instance terminated by USER, pid = 14048 Fri Dec 01 09:57:23 2017 ORA-01565: Unable to open Spfile ?/dbs/spfile@.ora.
Note:
从db alert log可以看出, DSKM( DiskMon)进程开始出现了HANG,并且数据库因无法处理事务,逐渐积压导致数据库连接数满,随后出现了ORA-600 [ksz_cln_proc1] 内部错误,通常是因为ASM 反应慢或因为其它原因导致DSKM进程hang时,当PMON 清理终止的会话资源申请隔离请求失败也会提示这个600错误,当前版本有2个相关BUG。 还有能显示启动失败ORA-304,和spfile不可读,种种说明当前的ASM 出现问题,DB已经无法从ASM读写数据。在上面我的检查instance number是否重复时,create pfile from spfile就报出了spfile读取错误。
–DB1 ASM ALERT LOG
ri Dec 01 09:42:00 2017 NOTE: ASM client anbob1:anbob disconnected unexpectedly. NOTE: check client alert log. NOTE: Trace records dumped in trace file /oracle/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ora_16769.trc Fri Dec 01 09:45:33 2017 WARNING: client [anbob1:anbob] cleanup delayed; waited 213s, pid 16769 mbr 0x1 Fri Dec 01 09:46:33 2017 WARNING: client [anbob1:anbob] cleanup delayed; waited 273s, pid 16769 mbr 0x1 Fri Dec 01 09:47:34 2017 WARNING: client [anbob1:anbob] cleanup delayed; waited 333s, pid 16769 mbr 0x1 Fri Dec 01 09:48:34 2017 WARNING: client [anbob1:anbob] cleanup delayed; waited 393s, pid 16769 mbr 0x1 Fri Dec 01 09:48:55 2017 NOTE: timeout waiting for prior umbilllicus process to exit; 300s Fri Dec 01 09:49:34 2017 WARNING: client [anbob1:anbob] cleanup delayed; waited 453s, pid 16769 mbr 0x1 Fri Dec 01 09:50:34 2017 WARNING: client [anbob1:anbob] cleanup delayed; waited 513s, pid 16769 mbr 0x1 Fri Dec 01 09:51:34 2017 WARNING: client [anbob1:anbob] cleanup delayed; waited 573s, pid 16769 mbr 0x1 Fri Dec 01 09:52:34 2017 WARNING: client [anbob1:anbob] cleanup delayed; waited 633s, pid 16769 mbr 0x1 Fri Dec 01 09:53:34 2017 WARNING: client [anbob1:anbob] cleanup delayed; waited 693s, pid 16769 mbr 0x1 Fri Dec 01 09:54:34 2017 WARNING: client [anbob1:anbob] cleanup delayed; waited 753s, pid 16769 mbr 0x1 Fri Dec 01 09:54:48 2017 NOTE: timeout waiting for prior umbilllicus process to exit; 300s Fri Dec 01 09:55:34 2017 WARNING: client [anbob1:anbob] cleanup delayed; waited 813s, pid 16769 mbr 0x1 Fri Dec 01 09:56:34 2017 WARNING: client [anbob1:anbob] cleanup delayed; waited 873s, pid 16769 mbr 0x1 Fri Dec 01 09:57:32 2017 Errors in file /oracle/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ora_14282.trc: ORA-27603: Cell storage I/O error, I/O failed on disk o/192.168.160.37/DATA_CD_05_kdexa1cel03 at offset 0 for data length 4096 ORA-27626: Exadata error: 201 (Generic I/O error) WARNING: Read Failed. group:1 disk:45 AU:0 offset:0 size:4096 path:o/192.168.160.37/DATA_CD_05_kdexa1cel03 incarnation:0xe9681366 asynchronous result:'I/O error' subsys:OSS iop:0x7f163de0fa80 bufp:0x7f163ddbe400 osderr:0xc9 osderr1:0x0 Errors in file /oracle/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ora_14282.trc: ORA-27603: Cell storage I/O error, I/O failed on disk o/192.168.160.37/DATA_CD_07_kdexa1cel03 at offset 0 for data length 4096 ORA-27626: Exadata error: 201 (Generic I/O error) WARNING: Read Failed. group:1 disk:41 AU:0 offset:0 size:4096 path:o/192.168.160.37/DATA_CD_07_kdexa1cel03 incarnation:0xe9681362 asynchronous result:'I/O error' subsys:OSS iop:0x7f163de13680 bufp:0x7f163de78800 osderr:0xc9 osderr1:0x0 Errors in file /oracle/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ora_14282.trc: ORA-27603: Cell storage I/O error, I/O failed on disk o/192.168.160.37/DATA_CD_10_kdexa1cel03 at offset 0 for data length 4096 ORA-27626: Exadata error: 201 (Generic I/O error) WARNING: Read Failed. group:1 disk:43 AU:0 offset:0 size:4096 path:o/192.168.160.37/DATA_CD_10_kdexa1cel03 ... ... ... Fri Dec 01 09:58:45 2017 NOTE: PST update grp = 1 completed successfully! NOTE: initiating PST update: grp = 1, dsk = 38/0xe968135f, mask = 0x7e, op = clear GMON updating disk modes for group 1 at 105 for pid 16, osid 13436 NOTE: group DATA: updated PST location: disk 0012 (PST copy 0) NOTE: group DATA: updated PST location: disk 0008 (PST copy 1) NOTE: cache closing disk 38 of grp 1: DATA_CD_03_KDEXA1CEL03 Fri Dec 01 09:58:45 2017 NOTE: process _b000_+asm1 (6940) initiating offline of disk 36.3915912029 (DATA_CD_08_KDEXA1CEL03) with mask 0x7e[0x15] in group 1 NOTE: process _b000_+asm1 (6940) initiating offline of disk 37.3915912030 (DATA_CD_00_KDEXA1CEL03) with mask 0x7e[0x15] in group 1 NOTE: process _b000_+asm1 (6940) initiating offline of disk 39.3915912032 (DATA_CD_09_KDEXA1CEL03) with mask 0x7e[0x15] in group 1 NOTE: process _b000_+asm1 (6940) initiating offline of disk 40.3915912033 (DATA_CD_04_KDEXA1CEL03) with mask 0x7e[0x15] in group 1 NOTE: process _b000_+asm1 (6940) initiating offline of disk 41.3915912034 (DATA_CD_07_KDEXA1CEL03) with mask 0x7e[0x15] in group 1 NOTE: process _b000_+asm1 (6940) initiating offline of disk 42.3915912035 (DATA_CD_11_KDEXA1CEL03) with mask 0x7e[0x15] in group 1 NOTE: process _b000_+asm1 (6940) initiating offline of disk 43.3915912036 (DATA_CD_10_KDEXA1CEL03) with mask 0x7e[0x15] in group 1 NOTE: process _b000_+asm1 (6940) initiating offline of disk 44.3915912037 (DATA_CD_01_KDEXA1CEL03) with mask 0x7e[0x15] in group 1 NOTE: process _b000_+asm1 (6940) initiating offline of disk 45.3915912038 (DATA_CD_05_KDEXA1CEL03) with mask 0x7e[0x15] in group 1 NOTE: process _b000_+asm1 (6940) initiating offline of disk 46.3915912039 (DATA_CD_06_KDEXA1CEL03) with mask 0x7e[0x15] in group 1 NOTE: process _b000_+asm1 (6940) initiating offline of disk 47.3915912040 (DATA_CD_02_KDEXA1CEL03) with mask 0x7e[0x15] in group 1 WARNING: Disk 36 (DATA_CD_08_KDEXA1CEL03) in group 1 in mode 0x15 is now being taken offline on ASM inst 1 WARNING: Disk 37 (DATA_CD_00_KDEXA1CEL03) in group 1 in mode 0x15 is now being taken offline on ASM inst 1 WARNING: Disk 39 (DATA_CD_09_KDEXA1CEL03) in group 1 in mode 0x15 is now being taken offline on ASM inst 1 WARNING: Disk 40 (DATA_CD_04_KDEXA1CEL03) in group 1 in mode 0x15 is now being taken offline on ASM inst 1 WARNING: Disk 41 (DATA_CD_07_KDEXA1CEL03) in group 1 in mode 0x15 is now being taken offline on ASM inst 1 WARNING: Disk 42 (DATA_CD_11_KDEXA1CEL03) in group 1 in mode 0x15 is now being taken offline on ASM inst 1 WARNING: Disk 43 (DATA_CD_10_KDEXA1CEL03) in group 1 in mode 0x15 is now being taken offline on ASM inst 1 WARNING: Disk 44 (DATA_CD_01_KDEXA1CEL03) in group 1 in mode 0x15 is now being taken offline on ASM inst 1 WARNING: Disk 45 (DATA_CD_05_KDEXA1CEL03) in group 1 in mode 0x15 is now being taken offline on ASM inst 1 WARNING: Disk 46 (DATA_CD_06_KDEXA1CEL03) in group 1 in mode 0x15 is now being taken offline on ASM inst 1 WARNING: Disk 47 (DATA_CD_02_KDEXA1CEL03) in group 1 in mode 0x15 is now being taken offline on ASM inst 1 WARNING: Disk 38 (DATA_CD_03_KDEXA1CEL03) in group 1 in mode 0x1 is now being taken offline on ASM inst 1 NOTE: initiating PST update: grp = 1, dsk = 36/0xe968135d, mask = 0x6a, op = clear
Note:
ASM在处理client进程清理时延时,随后报出了Cell storage I/O error,CELL03(192.168.160.37)读写错误,及cell03相关的DISK全部OFFLINE。v$asm_disk视图中mount_status显示为”MISSING” , 可以在下面的脚本在数据库中查询:
col PATH for a20 col DG_NAME for a15 col DG_STATE for a10 col FAILGROUP for a15 col name for a20 set lines 200 pages 10000 select dg.name dg_name, dg.state dg_state, dg.type, d.disk_number dsk_no, d.path, d.mount_status, d.FAILGROUP,d.name, d.state from v$asm_diskgroup dg, v$asm_disk d where dg.group_number=d.group_number order by dg_name, dsk_no;
— db2 db alert log
-- node2 db alert log Fri Dec 01 09:45:14 2017 ORA-00020: maximum number of processes (2000) exceeded ORA-20 errors will not be written to the alert log for the next minute. Please look at trace files to see all the ORA-20 errors. Fri Dec 01 09:45:38 2017 Immediate Kill Session#: 191, Serial#: 8141 ... Immediate Kill Session: sess: 0xa51130b58 OS pid: 18573 Immediate Kill Session#: 2808, Serial#: 13305 Immediate Kill Session: sess: 0xa31411a00 OS pid: 21909 Immediate Kill Session#: 2948, Serial#: 11987 Immediate Kill Session: sess: 0xa4140b1d8 OS pid: 21913 Fri Dec 01 09:45:39 2017 opiodr aborting process unknown ospid (14209) as a result of ORA-28 Fri Dec 01 09:46:39 2017 DSKM process appears to be hung. Initiating system state dump. Fri Dec 01 09:47:15 2017 IPC Send timeout detected. Sender: ospid 21743 [oracle@kdexa1db02.hebei.mobile.com (TNS V1-V3)] Receiver: inst 1 binc 429466025 ospid 16596 Fri Dec 01 09:47:16 2017 IPC Send timeout detected. Sender: ospid 29713 [oracle@kdexa1db02.hebei.mobile.com] Receiver: inst 1 binc 429465961 ospid 16585 ... IPC Send timeout detected. Sender: ospid 29739 [oracle@kdexa1db02.hebei.mobile.com] Receiver: inst 1 binc 429465961 ospid 16585 Fri Dec 01 09:47:16 2017 IPC Send timeout detected. Sender: ospid 29709 [oracle@kdexa1db02.hebei.mobile.com] Receiver: inst 1 binc 429466025 ospid 16591 IPC Send timeout to 1.0 inc 4 for msg type 12 from opid 201 IPC Send timeout to 1.1 inc 4 for msg type 36 from opid 148 IPC Send timeout: Terminating pid 201 osid 29739 IPC Send timeout: Terminating pid 148 osid 29709 ... IPC Send timeout detected. Sender: ospid 29785 [oracle@kdexa1db02.hebei.mobile.com] Receiver: inst 1 binc 429465961 ospid 16585 Fri Dec 01 09:47:16 2017 IPC Send timeout detected. Sender: ospid 29783 [oracle@kdexa1db02.hebei.mobile.com] Receiver: inst 1 binc 429465961 ospid 16585 IPC Send timeout to 1.0 inc 4 for msg type 12 from opid 230 Fri Dec 01 09:47:16 2017 IPC Send timeout detected. Sender: ospid 29789 [oracle@kdexa1db02.hebei.mobile.com] Receiver: inst 1 binc 429466025 ospid 16596 IPC Send timeout to 1.0 inc 4 for msg type 12 from opid 227 IPC Send timeout to 1.2 inc 4 for msg type 32 from opid 232 IPC Send timeout: Terminating pid 230 osid 29785 IPC Send timeout: Terminating pid 227 osid 29783 IPC Send timeout: Terminating pid 232 osid 29789 ... Receiver: inst 1 binc 429466025 ospid 16591 IPC Send timeout to 1.1 inc 4 for msg type 32 from opid 321 IPC Send timeout: Terminating pid 321 osid 29832 Fri Dec 01 09:47:18 2017 IPC Send timeout detected. Sender: ospid 29836 [oracle@kdexa1db02.hebei.mobile.com] Receiver: inst 1 binc 429465961 ospid 16585 IPC Send timeout to 1.0 inc 4 for msg type 12 from opid 340 IPC Send timeout: Terminating pid 340 osid 29836 Fri Dec 01 09:47:37 2017 IPC Send timeout to 1.2 inc 4 for msg type 32 from opid 55 IPC Send timeout: Terminating pid 55 osid 21743 Fri Dec 01 09:48:08 2017 Errors in file /oracle/app/oracle/diag/rdbms/orarpt/orarpt2/trace/orarpt2_pmon_18433.trc (incident=594684): ORA-00600: internal error code, arguments: [ksz_cln_proc1], [0x9FD5C5C10], [3], [], [], [], [], [], [], [], [], [] Incident details in: /oracle/app/oracle/diag/rdbms/orarpt/orarpt2/incident/incdir_594684/orarpt2_pmon_18433_i594684.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Errors in file /oracle/app/oracle/diag/rdbms/orarpt/orarpt2/trace/orarpt2_pmon_18433.trc: ORA-00600: internal error code, arguments: [ksz_cln_proc1], [0x9FD5C5C10], [3], [], [], [], [], [], [], [], [], [] PMON (ospid: 18433): terminating the instance due to error 472 Fri Dec 01 09:48:09 2017 ORA-1092 : opitsk aborting process Fri Dec 01 09:48:10 2017 ORA-1092 : opitsk aborting process Fri Dec 01 09:48:10 2017 License high water mark = 1902 Instance terminated by PMON, pid = 18433 USER (ospid: 3194): terminating the instance Instance terminated by USER, pid = 3194
登录CellSever03 查看Alert log
[root@kdexa1cel03 localhost]# cd trace /var/log/oracle/diag/asm/cell/localhost/trace [root@kdexa1cel03 trace]# ls -l alert.log -rw-rw---- 1 root celladmin 773 Feb 2 2012 alert.log [root@kdexa1cel03 trace]# cat alert.log cat: alert.log: Input/output error [root@kdexa1cel03 trace]# more alert.log Bus error [root@kdexa1cel03 ~]# ls ls: reading directory .: Input/output error
Note:
当前的CellServer03 本地文件系统出现了异常, 此时为了尽快恢复业务, 因为当前DB和ASM之间无法通信,决定重启了节点1的CRS。 重启节点1 CRS后,数据库实例1恢复正常并且实例2之前启动hang也恢复正常, 只是因为少了一个CELL,ASM Diskgroup的总存储和可用存储少了一些,注意当前ASMDG 并不会立即DROP或rebalance, 需要调整DISK_REPAIR_TIME参数。
DISK_REPAIR_TIME
The DISK_REPAIR_TIME attribute of the disk group controls the maximum acceptable outage duration. Once one or more disks become unavailable to ASM, it will wait for up to the interval specified for DISK_REPAIR_TIME for the disk(s) to come online. If the disk(s) come back online within this interval, a resync operation will occur, where only the extents that were modified while the disks were offline are written to the disks once back online. If the disk(s) do not come back within this interval, ASM will initiate a forced drop of the disk(s), which will trigger a rebalance, in order to restore redundancy using the surviving disks. Once the disk(s) are back online, they will be added to the diskgroup, with all existing extents on those disks being ignored/discarded, and another rebalance will begin. In other words, the DISK_REPAIR_TIME value is the acceptable time of duration during you need to fix the failure. This setting is also the countdown timer of ASM to drop the disk(s) that have been taken offline. The default setting for DISK_REPAIR_TIME is 3.6 hours.
(1). 检查所有磁盘组的disk_repair_time,登录到ASM实例并执行以下查询:
SQL> select dg.name, a.value from v$asm_diskgroup dg, v$asm_attribute a where dg.group_number=a.group_number and a.name='disk_repair_time';
(2). 修改所有磁盘组的disk_repair_time:
SQL> ALTER DISKGROUP <DISKGROUP NAME> SET ATTRIBUTE 'DISK_REPAIR_TIME'='8.5H';
如果属于CELL的磁盘都已经OFFLINE像本案例这样,就不能再使用上面的方法,因为Cell本身已经不可用, 需要使用下面的方法延长disk_repair_time:
SQL> ALTER DISKGROUP <DISKGROUP NAME> OFFLINE DISKS IN FAILGROUP <FAILGROUP NAME> DROP AFTER 24H;
验证所有磁盘是否都已修改
SQL> select dg.name ,a.value from v$asm_diskgroup 2 dg, v$asm_attribute a where dg.group_number=a.group_number and 3 a.name ='disk_repair_time'; NAME VALUE ------------------------------ ------------------------------- DATA 24h DBFS_DG 24h FLASHTS 24h RECO 24h
CellServer03 检查
1,机器面板无报错;
2,登录ILOM
[root@kdexa1db01 ~]# ssh 192.168.60.xx Password: Oracle(R) Integrated Lights Out Manager Version 3.0.16.10.d r74499 Copyright (c) 2012, Oracle and/or its affiliates. All rights reserved. -> show /SP/faultmgmt /SP/faultmgmt Targets: shell Properties: Commands: cd show -> start /SP/faultmgmt/shell Are you sure you want to start /SP/faultmgmt/shell (y/n)? y faultmgmtsp> fmadm faulty -a No faults found faultmgmtsp>
3, 检查OS layer log
[root@kdexa1cel03 /]# dmesg sd 0:2:1:0: rejecting I/O to offline device sd 0:2:1:0: rejecting I/O to offline device sd 0:2:1:0: rejecting I/O to offline device sd 0:2:1:0: rejecting I/O to offline device sd 0:2:1:0: rejecting I/O to offline device sd 0:2:1:0: rejecting I/O to offline device ... [root@kdexa1cel03 /]# lsscsi [0:0:20:0] enclosu SUN HYDE12 0341 - [0:2:0:0] disk LSI MR9261-8i 2.13 /dev/sda [0:2:1:0] disk LSI MR9261-8i 2.13 /dev/sdb [0:2:2:0] disk LSI MR9261-8i 2.13 /dev/sdc [0:2:3:0] disk LSI MR9261-8i 2.13 /dev/sdd [0:2:4:0] disk LSI MR9261-8i 2.13 /dev/sde [0:2:5:0] disk LSI MR9261-8i 2.13 /dev/sdf [0:2:6:0] disk LSI MR9261-8i 2.13 /dev/sdg [0:2:7:0] disk LSI MR9261-8i 2.13 /dev/sdh [0:2:8:0] disk LSI MR9261-8i 2.13 /dev/sdi [0:2:9:0] disk LSI MR9261-8i 2.13 /dev/sdj [0:2:10:0] disk LSI MR9261-8i 2.13 /dev/sdk [0:2:11:0] disk LSI MR9261-8i 2.13 /dev/sdl [1:0:0:0] disk Unigen PSA4000 1100 /dev/sdm [8:0:0:0] disk ATA MARVELL SD88SA02 D21Y /dev/sdn [8:0:1:0] disk ATA MARVELL SD88SA02 D21Y /dev/sdo [8:0:2:0] disk ATA MARVELL SD88SA02 D21Y /dev/sdp [8:0:3:0] disk ATA MARVELL SD88SA02 D21Y /dev/sdq [9:0:0:0] disk ATA MARVELL SD88SA02 D21Y /dev/sdr ... [root@kdexa1cel03 mptsas]# cat /proc/mdstat Personalities : [raid1] md4 : active raid1 sda1[2](F) sdb1[1] 120384 blocks [2/1] [_U] md5 : active raid1 sda5[0] sdb5[2](F) 10485696 blocks [2/1] [U_] md6 : active raid1 sda6[2](F) sdb6[1] 10485696 blocks [2/1] [_U] md7 : active raid1 sda7[0] sdb7[2](F) 3145664 blocks [2/1] [U_] md8 : active raid1 sda8[2](F) sdb8[1] 3145664 blocks [2/1] [_U] md2 : active raid1 sda9[0] sdb9[2](F) 2097088 blocks [2/1] [U_] md11 : active raid1 sda11[2](F) sdb11[1] 5242752 blocks [2/1] [_U] md1 : active raid1 sda10[2](F) sdb10[1] 714752 blocks [2/1] [_U] unused devices: [root@kdexa1cel03 mptsas]# mdadm --detail /dev/md6 /dev/md6: Version : 0.90 Creation Time : Thu Feb 2 15:16:27 2012 Raid Level : raid1 Array Size : 10485696 (10.00 GiB 10.74 GB) Used Dev Size : 10485696 (10.00 GiB 10.74 GB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 6 Persistence : Superblock is persistent Update Time : Fri Dec 1 13:49:39 2017 State : clean, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 1 Spare Devices : 0 Number Major Minor RaidDevice State 0 0 0 0 removed 1 8 22 1 active sync /dev/sdb6 2 8 6 - faulty spare /dev/sda6
当前的文件系统出现异常,好多日志无法查看,当前情况解决方法也只能是重启操作系统。
通常LINUX中出现了BUS Error错误,意为的可能无法正常关机, 尝试了各种关机命令后果然, 需要去机房按电源。但是主机上如果无标签,就需要确认一下哪台是对应CELL03主机,可以使用点亮CELL主机前置灯的方法,同样更换硬盘时,也有点亮硬盘灯的方法。
使用点亮灯识别CELL主机的方法:
方法1
CellCLI> alter cell led on; Cell kdexa1cel03 Locator LED switched on CellCLI> alter cell led off; Cell kdexa1cel03 Locator LED switched off
方法2
[root@kdexa1db01 ~]# ssh kdexa1cel03-ilom Password: Oracle(R) Integrated Lights Out Manager Version 3.0.16.10.d r74499 Copyright (c) 2012, Oracle and/or its affiliates. All rights reserved. -> set /SYS/LOCATE value=fast_blink Set 'value' to 'fast_blink' -> set /SYS/LOCATE value=off Set 'value' to 'off'
然后按主机电源重启cell03 主机后,主机恢复正常,cell server也恢复正常,自动回归ASM diskgroup。– 前提未超过DISK_REPAIR_TIME
收集日志,提交SR
1, cell alert log
# /var/log/oracle/diag/asm/cell/localhost/trace
2, sundiag
# /opt/oracle.SupportTools/sundiag.sh
3, OS hardware log
# dmesg or cat /var/log/dmesg
# cat /var/log/messages
4, ILOM Snapshot
使用web 登录ILOM 省略
使用shell :
# ssh kdexa1cel03-ilom Password: Oracle(R) Integrated Lights Out Manager Version 3.0.16.10.d r74499 Copyright (c) 2012, Oracle and/or its affiliates. All rights reserved. -> set /SP/diag/snapshot dataset=normal -> /SP/diag/snapshot dump_uri=sftp://root:welcome@192.168.60.42/tmp -> cd /SP/diag/snapshot -> show
多次检查状态,几分钟后会显示”Snapshot Complete” 会有个压缩文件生成并已FTP 到上面远程主机的/tmp 目录中。
5, sos report as root user
# sosreport
Summary:
因为cell03存储主机的文件系统异常,导致ASM Hang,数据库实例crash, 虽然是NORMAL级别的冗余,但是数据库实例此时不能于ASM通信,重启CRS进程恢复,可使用剩余的2条CELL继续为数据库提供服务。 在延长了disk_repair_time时间后,等待时间后强置重启CELL03主机操作系统后,一切恢复。 后EXADATA hardware部门并未发现硬件问题,Oracle Linux部门怀疑是文件系统问题, 说是可以使用e2fsck解决,未尝试. 但也未确认是否是因为MD层的软RAID问题。
对不起,这篇文章暂时关闭评论。