Troubleshooting Oracle ASM diskgroup dismount with ORA-15335 ORA-15066 ORA-15196 when delete instance use DBCA
前几日我们遇到了一个问题,环境为Oracle 11.2.0.4 2节点RAC的情况下,同事在使用DBCA删除一个已经损坏的数据库实例时,意外地导致了当前唯一存活的数据库实例崩溃。进一步的检查发现,ASM磁盘组不可用,而ASM警报日志显示了ASM磁盘文件头损坏、ASM元数据损坏以及ORA-15196: 无效的ASM块头的错误。为什么删除数据库实例会导致ASM磁盘组不可用,并且发现ASM元数据损坏呢?
ASM alert LOG
WARNING: cache read a corrupt block: group=3(DATA) dsk=3 blk=3 disk=3 (DATA_0003) incarn=3916228183 au=0 blk=3 count=1 Errors in file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_84719.trc: ORA-15196: invalid ASM block header [kfc.c:26368] [endian_kfbh] [2147483648] [3] [0 != 1] NOTE: a corrupted block from group DATA was dumped to /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_84719.trc WARNING: cache read (retry) a corrupt block: group=3(DATA) dsk=3 blk=3 disk=3 (DATA_0003) incarn=3916228183 au=0 blk=1 count=1 Errors in file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_84719.trc: ORA-15196: invalid ASM block header [kfc.c:26368] [endian_kfbh] [2147483648] [3] [0 != 1] ORA-15196: invalid ASM block header [kfc.c:26368] [endian_kfbh] [2147483648] [3] [0 != 1] ERROR: cache failed to read group=3(DATA) dsk=0 blk=1 from disk(s): 0(DATA_0000) ORA-15196: invalid ASM block header [kfc.c:26368] [endian_kfbh] [2147483648] [3] [0 != 1] ORA-15196: invalid ASM block header [kfc.c:26368] [endian_kfbh] [2147483648] [3] [0 != 1] NOTE: cache initiating offline of disk 3 group DATA NOTE: process _user84719_+asm2 (84719) initiating offline of disk 3.3916228183 (DATA_0000) with mask 0x7e in group 3 NOTE: initiating PST update: grp = 3, dsk = 0/0xe96ce657, mask = 0x6a, op = clear Thu Apr 11 18:10:30 2024 GMON updating disk modes for group 3 at 17 for pid 34, osid 84719 ERROR: Disk 3 cannot be offlined, since diskgroup has external redundancy. ERROR: too many offline disks in PST (grp 3) Thu Apr 11 18:10:30 2024 NOTE: cache dismounting (not clean) group 3/0xAD3C16E3 (DATA) NOTE: messaging CKPT to quiesce pins Unix process pid: 110697, image: oracle@xryhxyw-db02 (B000) WARNING: Offline for disk DATA_0003 in mode 0x7f failed. Thu Apr 11 18:10:30 2024 NOTE: halting all I/Os to diskgroup 3 (DATA) Errors in file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_84719.trc (incident=217233): ORA-15335: ASM metadata corruption detected in disk group 'DATA' ORA-15130: diskgroup "DATA" is being dismounted ORA-15066: offlining disk "DATA_0003" in group "DATA" may result in a data loss ORA-15196: invalid ASM block header [kfc.c:26368] [endian_kfbh] [2147483648] [3] [0 != 1] ORA-15196: invalid ASM block header [kfc.c:26368] [endian_kfbh] [2147483648] [3] [0 != 1] Incident details in: /u01/app/grid/diag/asm/+asm/+ASM2/incident/incdir_217233/+ASM2_ora_84719_i217233.trc
Note:
这里显示DATA磁盘组DISK 3的盘头的metadata 元数据损坏,au 0 block 3通常是AT(Allocation Table), 因为是外部冗余无法只offline磁盘而是offline diskgroup.
验证磁盘
#kfod disks=all status=true -------------------------------------------------------------------------------- Disk Size Path User Group ================================================================================ 1: 1048576 Mb /dev/mapper/arch01 grid asmadmin 2: 1048576 Mb /dev/mapper/data01 grid asmadmin 3: 1048576 Mb /dev/mapper/data02 grid asmadmin 4: 1048576 Mb /dev/mapper/data03 grid asmadmin 5: 1048576 Mb /dev/mapper/data04 grid asmadmin 6: 1048576 Mb /dev/mapper/data05 grid asmadmin 7: 1048576 Mb /dev/mapper/data06 grid asmadmin 8: 1048576 Mb /dev/mapper/data07 grid asmadmin 9: 20480 Mb /dev/mapper/ocr01 grid asmadmin 10: 20480 Mb /dev/mapper/ocr02 grid asmadmin 11: 20480 Mb /dev/mapper/ocr03 grid asmadmin -------------------------------------------------------------------------------- ORACLE_SID ORACLE_HOME ================================================================================ grid@db002:/home/grid #kfed read /dev/mapper/data02 kfbh.endian: 0 ; 0x000: 0x00 kfbh.hard: 0 ; 0x001: 0x00 kfbh.type: 0 ; 0x002: KFBTYP_INVALID kfbh.datfmt: 0 ; 0x003: 0x00 kfbh.block.blk: 0 ; 0x004: blk=0 kfbh.block.obj: 0 ; 0x008: file=0 kfbh.check: 0 ; 0x00c: 0x00000000 kfbh.fcn.base: 0 ; 0x010: 0x00000000 kfbh.fcn.wrap: 0 ; 0x014: 0x00000000 kfbh.spare1: 0 ; 0x018: 0x00000000 kfbh.spare2: 0 ; 0x01c: 0x00000000 7FBC09B75400 00000000 00000000 00000000 00000000 [.............. grid@db002:/home/grid #kfed read /dev/mapper/data03 kfbh.endian: 0 ; 0x000: 0x00 kfbh.hard: 0 ; 0x001: 0x00 kfbh.type: 0 ; 0x002: KFBTYP_INVALID kfbh.datfmt: 0 ; 0x003: 0x00 kfbh.block.blk: 0 ; 0x004: blk=0 kfbh.block.obj: 0 ; 0x008: file=0 kfbh.check: 0 ; 0x00c: 0x00000000 kfbh.fcn.base: 0 ; 0x010: 0x00000000 kfbh.fcn.wrap: 0 ; 0x014: 0x00000000 kfbh.spare1: 0 ; 0x018: 0x00000000 grid@db002:/home/grid #kfed read /dev/mapper/data04 kfbh.endian: 0 ; 0x000: 0x00 kfbh.hard: 0 ; 0x001: 0x00 kfbh.type: 0 ; 0x002: KFBTYP_INVALID kfbh.datfmt: 0 ; 0x003: 0x00 kfbh.block.blk: 0 ; 0x004: blk=0 kfbh.block.obj: 0 ; 0x008: file=0 kfbh.check: 0 ; 0x00c: 0x00000000 kfbh.fcn.base: 0 ; 0x010: 0x00000000 kfbh.fcn.wrap: 0 ; 0x014: 0x00000000 kfbh.spare1: 0 ; 0x018: 0x00000000 kfbh.spare2: 0 ; 0x01c: 0x00000000 grid@db002:/home/grid #kfed read /dev/mapper/data06 kfbh.endian: 1 ; 0x000: 0x01 kfbh.hard: 130 ; 0x001: 0x82 kfbh.type: 1 ; 0x002: KFBTYP_DISKHEAD kfbh.datfmt: 1 ; 0x003: 0x01 kfbh.block.blk: 0 ; 0x004: blk=0 kfbh.block.obj: 2147483649 ; 0x008: disk=1 kfbh.check: 402677388 ; 0x00c: 0x18005e8c kfbh.fcn.base: 0 ; 0x010: 0x00000000 kfbh.fcn.wrap: 0 ; 0x014: 0x00000000 kfbh.spare1: 0 ; 0x018: 0x00000000 kfbh.spare2: 0 ; 0x01c: 0x00000000 kfdhdb.driver.provstr: ORCLDISK ; 0x000: length=8 kfdhdb.driver.reserved[0]: 0 ; 0x008: 0x00000000 kfdhdb.driver.reserved[1]: 0 ; 0x00c: 0x00000000 kfdhdb.driver.reserved[2]: 0 ; 0x010: 0x00000000 kfdhdb.driver.reserved[3]: 0 ; 0x014: 0x00000000 kfdhdb.driver.reserved[4]: 0 ; 0x018: 0x00000000 kfdhdb.driver.reserved[5]: 0 ; 0x01c: 0x00000000
Note:
可见并不是所有硬盘头都损坏。
验证更多数据
grid@db002:/home/grid #dd if=/dev/mapper/data03 bs=1024 count=1024 skip=4 |hexdump 0000000 0000 0000 0000 0000 0000 0000 0000 0000 * 1024+0 records in 1024+0 records out 1048576 bytes (1.0 MB) copied, 0.00940994 s, 111 MB/s 00ff000 0000 0000 0000 0000 0000 0000 0000 0000 00ff010 0000 0000 0000 0000 0000 0000 0000 0000 * 0100000
前1M都是0,也可以用dd和hexdump读更多,默认hexdump会分组相同用*显示。下面简单说明hexdump
hexdump演示
root@19c1:/root #dd if=/dev/zero of=/root/test.dd count=5 bs=1M 5+0 records in 5+0 records out 5242880 bytes (5.2 MB) copied, 0.00279103 s, 1.9 GB/s root@19c1:/root #hexdump /root/test.dd 0000000 0000 0000 0000 0000 0000 0000 0000 0000 * 0500000 root@19c1:/root #hexdump /root/test.dd -n 64 0000000 0000 0000 0000 0000 0000 0000 0000 0000 * 0000040 root@19c1:/root #hexdump -v /root/test.dd -n 64 0000000 0000 0000 0000 0000 0000 0000 0000 0000 0000010 0000 0000 0000 0000 0000 0000 0000 0000 0000020 0000 0000 0000 0000 0000 0000 0000 0000 0000030 0000 0000 0000 0000 0000 0000 0000 0000 0000040
恢复disk header
grid@db002:/home/grid #kfed repair /dev/mapper/data01 grid@db002:/home/grid #kfed repair /dev/mapper/data02 grid@db002:/home/grid #kfed repair /dev/mapper/data03 grid@db002:/home/grid #kfed repair /dev/mapper/data04
验证修复
grid@db002:/home/grid #kfed read /dev/mapper/data01
kfbh.endian: 1 ; 0x000: 0x01
kfbh.hard: 130 ; 0x001: 0x82
kfbh.type: 1 ; 0x002: KFBTYP_DISKHEAD
kfbh.datfmt: 1 ; 0x003: 0x01
kfbh.block.blk: 0 ; 0x004: blk=0
kfbh.block.obj: 2147483648 ; 0x008: disk=0
kfbh.check: 1497308808 ; 0x00c: 0x593f1e88
kfbh.fcn.base: 0 ; 0x010: 0x00000000
kfbh.fcn.wrap: 0 ; 0x014: 0x00000000
kfbh.spare1: 0 ; 0x018: 0x00000000
kfbh.spare2: 0 ; 0x01c: 0x00000000
kfdhdb.driver.provstr: ORCLDISK ; 0x000: length=8
kfdhdb.driver.reserved[0]: 0 ; 0x008: 0x00000000
kfdhdb.driver.reserved[1]: 0 ; 0x00c: 0x00000000
kfdhdb.driver.reserved[2]: 0 ; 0x010: 0x00000000
kfdhdb.driver.reserved[3]: 0 ; 0x014: 0x00000000
kfdhdb.driver.reserved[4]: 0 ; 0x018: 0x00000000
kfdhdb.driver.reserved[5]: 0 ; 0x01c: 0x00000000
kfdhdb.compat: 186646528 ; 0x020: 0x0b200000
kfdhdb.dsknum: 0 ; 0x024: 0x0000
kfdhdb.grptyp: 1 ; 0x026: KFDGTP_EXTERNAL
kfdhdb.hdrsts: 3 ; 0x027: KFDHDR_MEMBER
kfdhdb.dskname: DATA_0000 ; 0x028: length=9
kfdhdb.grpname: DATA ; 0x048: length=4
kfdhdb.fgname: DATA_0000 ; 0x068: length=9
kfdhdb.capname: ; 0x088: length=0
grid@db002:/home/grid #dd if=/dev/mapper/data03 bs=$(expr 1024 \* 1024) count=1|hexdump
0000000 8201 0101 0000 0000 0002 8000 7bb6 c8a5
0000010 0000 0000 0000 0000 0000 0000 0000 0000
0000020 524f 4c43 4944 4b53 0000 0000 0000 0000
0000030 0000 0000 0000 0000 0000 0000 0000 0000
0000040 0000 0b20 0002 0301 4144 4154 305f 3030
0000050 0032 0000 0000 0000 0000 0000 0000 0000
0000060 0000 0000 0000 0000 4144 4154 0000 0000
0000070 0000 0000 0000 0000 0000 0000 0000 0000
0000080 0000 0000 0000 0000 4144 4154 305f 3030
0000090 0032 0000 0000 0000 0000 0000 0000 0000
00000a0 0000 0000 0000 0000 0000 0000 0000 0000
*
00000c0 0000 0000 0000 0000 e716 01f8 bc00 0c3c
00000d0 1170 01fa 7000 1c38 0200 1000 0000 0010
00000e0 bc80 0001 0000 0010 000b 0000 0001 0000
00000f0 0002 0000 0000 0000 0000 0000 0000 0000
MOUNT data diskgroup
SQL> ALTER DISKGROUP DATA MOUNT;
ASMCMD> lsdg
State Type Rebal Sector Block AU Total_MB Free_MB Req_mir_free_MB Usable_file_MB Offline_disks Voting_files Name
MOUNTED EXTERN N 512 4096 1048576 0 0 0 0 0 N ARCH/
MOUNTED EXTERN N 512 4096 1048576 0 0 0 0 0 N BAK/
MOUNTED EXTERN N 512 4096 1048576 4194304 3004518 0 3004518 0 N DATA/
MOUNTED NORMAL N 512 4096 1048576 0 0 0 0 0 Y OCR/
ASMCMD>
Note:
其它DISKGROUP如arch,bak 虽然MOUNTED, 但total_mb为0,这是不正常的也有风险, 也是因为ASM元数据损坏,如果有ASM 文件类操作可能会触发磁盘组随时因metadata损坏而dismount。
为什么DBCA删DB实例会触发呢?
NS-00505: Operation timed out nt secondary err code: 110 nt OS err code: 0 Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=10.10.99.165)(PORT=49824)) opidcl aborting process unknown ospid (103505) as a result of ORA-28 Thu Apr 11 15:32:03 2024 ALTER DATABASE DISABLE THREAD 1 Completed: ALTER DATABASE DISABLE THREAD 1 ALTER SYSTEM ARCHIVE LOG ALTER DATABASE DROP LOGFILE GROUP 1 Thu Apr 11 15:32:03 2024 SUCCESS: diskgroup ARCH was dismounted WARNING: Cannot delete Oracle managed file +DATA/anbob/onlinelog/group_1.257.1019860099 Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_ora_55172.trc: ORA-01265: Unable to delete LOG +DATA/anbob/onlinelog/group_1.257.1019860099 ORA-15130: diskgroup "DATA" is being dismounted ORA-15335: ASM metadata corruption detected in disk group 'DATA' ORA-15130: diskgroup "DATA" is being dismounted ORA-15066: offlining disk "DATA_0003" in group "DATA" may result in a data loss ORA-15196: invalid ASM block header [kfc.c:26368] [endian_kfbh] [2147483651] [3] [0 != 1] ORA-15196: invalid ASM block header [kfc.c:26368] [endian_kfbh] [2147483651] [3] [0 != 1] Thu Apr 11 15:32:04 2024 Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_lgwr_39665.trc: ORA-00345: redo log write error block 149399 count 2 ORA-00312: online log 6 thread 2: '+DATA/anbob/onlinelog/group_6.268.1019860427' ORA-15079: ASM file is closed ORA-15079: ASM file is closed Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_ora_55172.trc: ORA-00202: control file: '+DATA/anbob/controlfile/current.256.1019860099' ORA-15079: ASM file is closed SUCCESS: diskgroup DATA was dismounted SUCCESS: diskgroup DATA was dismounted Thu Apr 11 15:32:04 2024 KCF: read, write or open error, block=0x3307b4 online=1
Note:
dbca 删除db实例时会递归删除DATA组中的thread 对应的online redo logfile,属于元数据类操作,所以失败。 上面提示ARCH也diskmount了,数据库归档也无法完成,实际也确实在data mount后启库正常,但几分钟后因归档问题再次crash,db instance实例运行在switch logfile会再次down,不过arch dest 可以先改为filesytem临时规避问题。
归档磁盘组分析
Thu Apr 11 18:10:30 2024 WARNING: cache read a corrupt block: group=1(ARCH) dsk=0 blk=1 disk=0 (ARCH_0000) incarn=3916228183 au=0 blk=1 count=1 Errors in file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_84719.trc: ORA-15196: invalid ASM block header [kfc.c:26368] [endian_kfbh] [2147483648] [1] [0 != 1] NOTE: a corrupted block from group ARCH was dumped to /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_84719.trc WARNING: cache read (retry) a corrupt block: group=1(ARCH) dsk=0 blk=1 disk=0 (ARCH_0000) incarn=3916228183 au=0 blk=1 count=1 Errors in file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_84719.trc: ORA-15196: invalid ASM block header [kfc.c:26368] [endian_kfbh] [2147483648] [1] [0 != 1] ORA-15196: invalid ASM block header [kfc.c:26368] [endian_kfbh] [2147483648] [1] [0 != 1] ERROR: cache failed to read group=1(ARCH) dsk=0 blk=1 from disk(s): 0(ARCH_0000) ORA-15196: invalid ASM block header [kfc.c:26368] [endian_kfbh] [2147483648] [1] [0 != 1] ORA-15196: invalid ASM block header [kfc.c:26368] [endian_kfbh] [2147483648] [1] [0 != 1] NOTE: cache initiating offline of disk 0 group ARCH NOTE: process _user84719_+asm2 (84719) initiating offline of disk 0.3916228183 (ARCH_0000) with mask 0x7e in group 1 NOTE: initiating PST update: grp = 1, dsk = 0/0xe96ce657, mask = 0x6a, op = clear Thu Apr 11 18:10:30 2024 GMON updating disk modes for group 1 at 35 for pid 34, osid 84719 ERROR: Disk 0 cannot be offlined, since diskgroup has external redundancy. ERROR: too many offline disks in PST (grp 1) Thu Apr 11 18:10:30 2024 NOTE: cache dismounting (not clean) group 1/0xAD3C16E3 (ARCH) NOTE: messaging CKPT to quiesce pins Unix process pid: 110697, image: oracle@db002 (B000) WARNING: Offline for disk ARCH_0000 in mode 0x7f failed. Thu Apr 11 18:10:30 2024 NOTE: halting all I/Os to diskgroup 1 (ARCH) Errors in file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_84719.trc (incident=217233): ORA-15335: ASM metadata corruption detected in disk group 'ARCH' ORA-15130: diskgroup "ARCH" is being dismounted grid@db002:/u01/app/grid/diag/asm/+asm/+ASM2/trace #dd if=/dev/mapper/arch01 bs=4096 count=1000|hexdump 0000000 0000 0000 0000 0000 0000 0000 0000 0000 * 00001c0 0001 feee ffff 0001 0000 ffff 7fff 0000 00001d0 0000 0000 0000 0000 0000 0000 0000 0000 * 00001f0 0000 0000 0000 0000 0000 0000 0000 aa55 0000200 4645 2049 4150 5452 0000 0001 005c 0000 grid@db002:/u01/app/grid/diag/asm/+asm/+ASM2/trace #kfed read /dev/mapper/arch01 kfbh.endian: 0 ; 0x000: 0x00 kfbh.hard: 0 ; 0x001: 0x00 kfbh.type: 0 ; 0x002: KFBTYP_INVALID kfbh.datfmt: 0 ; 0x003: 0x00 kfbh.block.blk: 0 ; 0x004: blk=0 kfbh.block.obj: 0 ; 0x008: file=0 kfbh.check: 0 ; 0x00c: 0x00000000 kfbh.fcn.base: 0 ; 0x010: 0x00000000 #kfed repair /dev/mapper/arch01 #dd if=/dev/mapper/arch01 bs=4096 count=100|hexdump 0000000 8201 0101 0000 0000 0000 8000 6d8d 5037 0000010 0000 0000 0000 0000 0000 0000 0000 0000 0000020 524f 4c43 4944 4b53 0000 0000 0000 0000 0000030 0000 0000 0000 0000 0000 0000 0000 0000 0000040 0000 0b20 0000 0301 5241 4843 305f 3030 0000050 0030 0000 0000 0000 0000 0000 0000 0000 0000060 0000 0000 0000 0000 5241 4843 0000 0000 0000070 0000 0000 0000 0000 0000 0000 0000 0000 0000080 0000 0000 0000 0000 5241 4843 305f 3030 0000090 0030 0000 0000 0000 0000 0000 0000 0000 00000a0 0000 0000 0000 0000 0000 0000 0000 0000 * 00000c0 0000 0000 0000 0000 e716 01f8 cc00 09a3 00000d0 104c 01fa b800 8d92 0200 1000 0000 0010 SQL> alter diskgroup arch mount; Diskgroup altered. grid@db002:/u01/app/grid/diag/asm/+asm/+ASM2/trace #asmcmd lsdg State Type Rebal Sector Block AU Total_MB Free_MB Req_mir_free_MB Usable_file_MB Offline_disks Voting_files Name MOUNTED EXTERN N 512 4096 1048576 1048576 599078 0 599078 0 N ARCH/ MOUNTED EXTERN N 512 4096 1048576 0 0 0 0 0 N BAK/ MOUNTED EXTERN N 512 4096 1048576 4194304 3004518 0 3004518 0 N DATA/ MOUNTED NORMAL N 512 4096 1048576 0 0 0 0 0 Y OCR/
Summary:
在这个环境中,我们无法确定是什么操作导致了部分ASM磁盘的前1M盘头被清理。然而,当进行ASM磁盘组文件操作时,可能会导致ASM发现元数据损坏,从而引发数据库实例崩溃。如果没有进行文件操作,这个环境的问题可能不会出现。操作时间段应该是在ASM重启时间、最后一次归档切换和最后一次文件操作之间。在另一个客户那里也曾遇到了ASM元数据损坏的问题,其中au 0 block 1受损,当时是在ASM实例重启或挂载diskgroup时才会被发现。因此,后续的安全巡检应该加入对ASM磁盘头的检查,以避免类似的问题再次发生。
对不起,这篇文章暂时关闭评论。