Troubleshooting oracle CRS start fail after OS reboot ASM device(udev) unable to open
最近一个客户的Oracle 11g RAC 一个节点操作系统(Linux7)重启后,CRS无法启动,另一个节点(NODE2)未重启暂时正常。查看GI日志是Vote disk无法读取,sd设备dd正常,kfed read asmdisk失败,使用udev映射,说明问题出在udev, 简单记录该问题。
验证设备
[grid@anbob1 ~]$ dd if=/dev/asmdiskc bs=1024 count=3 dd: failed to open '/dev/asmdiskc': No such device or address [grid@anbob1 ~]$ dd if=/dev/sdc bs=1024 count=3 dd: failed to open '/dev/sdc': Permission denied [grid@anbob1 ~]$ exit logout [root@anbob1 dev]# dd if=/dev/sdc bs=1024 count=3 ▒▒▒_▒▒ORCLDISK OCR_VOTING_0001OCR_VOTINGOCR_VOTING_0001▒▒▒0▒:'▒▒8e▒▒( ▒▒▒▒: ;3+0 records in 3+0 records out 3072 bytes (3.1 kB) copied, 0.000535419 s, 5.7 MB/s [root@anbob1 dev]# /usr/lib/udev/scsi_id -g -u -d /dev/sdc 3614eb08100e1c60a1b7a502f0000000c [root@anbob1 dev]# grep 3614eb08100e1c60a1b7a502f0000000c /etc/udev/rules.d/*oracle* /etc/udev/rules.d/97-oracle-asmdevices.rules:KERNEL=="sd*",ENV{DEVTYPE}=="disk",SUBSYSTEM=="block",PROGRAM=="/usr/lib/udev/scsi_id -g -u -d $devnode",RESULT=="3614eb08100e1c60a1b7a502f0000000c", RUN+="/bin/sh -c 'mknod /dev/asmdiskc b $major $minor; chown grid:asmadmin /dev/asmdiskc; chmod 0660 /dev/asmdiskc'" [grid@anbob1 ~]$ kfed read /dev/asmdiskc KFED-00303: unable to open file '/dev/asmdiskc' [grid@anbob1 ~]$ dd if=/dev/asmdiskc bs=1024 count=3 dd: failed to open '/dev/asmdiskc': No such device or address
Note:
注意OS层的sd设备是可以read, 但ASM设备无法找不到设备. ASM使用的mknod根据主备设备号创建。
另一节正常节点
[root@anbob2 ~]# su - grid Last login: Mon Dec 25 09:16:10 CST 2023 [grid@anbob2 ~]$ dd if=/dev/sdaw bs=1024 count=2 dd: failed to open '/dev/sdaw': Permission denied [grid@anbob2 ~]$ exit logout [root@anbob2 ~]# dd if=/dev/sdaw bs=1024 count=2 ▒▒▒_▒▒ORCLDISK OCR_VOTING_0000OCR_VOTINGOCR_VOTING_0000▒▒▒0▒:'▒▒8e▒▒( ▒▒▒▒: ;2+0 records in 2+0 records out 2048 bytes (2.0 kB) copied, 0.000578212 s, 3.5 MB/s
NODE2 验证UDEV
[root@anbob2 ~]# /usr/lib/udev/scsi_id -g -u -d /dev/sdaw 3614eb08100e1c60a1b7a502f0000000b [root@anbob2 ~]# grep 3614eb08100e1c60a1b7a502f0000000b /etc/udev/rules.d/97-oracle-asmdevices.rules KERNEL=="sd*",ENV{DEVTYPE}=="disk",SUBSYSTEM=="block",PROGRAM=="/usr/lib/udev/scsi_id -g -u -d $devnode",RESULT=="3614eb08100e1c60a1b7a502f0000000b", RUN+="/bin/sh -c 'mknod /dev/asmdiskb b $major $minor; chown grid:asmadmin /dev/asmdiskb; chmod 0660 /dev/asmdiskb'" [grid@anbob2 ~]$ kfed read /dev/asmdiskb kfbh.endian: 1 ; 0x000: 0x01 kfbh.hard: 130 ; 0x001: 0x82 kfbh.type: 1 ; 0x002: KFBTYP_DISKHEAD kfbh.datfmt: 1 ; 0x003: 0x01 kfbh.block.blk: 0 ; 0x004: blk=0 kfbh.block.obj: 2147483648 ; 0x008: disk=0
对比ASM与BLOCK device
[root@anbob1 dev]# ls -l /dev/asm* brw-rw---- 1 grid asmadmin 68, 240 Dec 25 07:08 /dev/asmdiskc brw-rw---- 1 grid asmadmin 69, 16 Dec 25 07:08 /dev/asmdiskd brw-rw---- 1 grid asmadmin 69, 48 Dec 25 07:08 /dev/asmdiske brw-rw---- 1 grid asmadmin 71, 80 Dec 25 07:08 /dev/asmdiskf brw-rw---- 1 grid asmadmin 69, 96 Dec 25 07:08 /dev/asmdiskg brw-rw---- 1 grid asmadmin 69, 128 Dec 25 07:08 /dev/asmdiskh brw-rw---- 1 grid asmadmin 69, 160 Dec 25 07:08 /dev/asmdiski brw-rw---- 1 grid asmadmin 69, 176 Dec 25 07:08 /dev/asmdiskj brw-rw---- 1 grid asmadmin 69, 208 Dec 25 07:08 /dev/asmdiskk brw-rw---- 1 grid asmadmin 66, 32 Dec 25 07:08 /dev/asmdiskl brw-rw---- 1 grid asmadmin 70, 32 Dec 25 07:08 /dev/asmdiskm brw-rw---- 1 grid asmadmin 70, 64 Dec 25 07:08 /dev/asmdiskn brw-rw---- 1 grid asmadmin 70, 96 Dec 25 07:08 /dev/asmdisko brw-rw---- 1 grid asmadmin 70, 128 Dec 25 07:08 /dev/asmdiskp brw-rw---- 1 grid asmadmin 70, 160 Dec 25 07:08 /dev/asmdiskq [root@anbob1 dev]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 1.3T 0 disk |-sda1 8:1 0 199.5M 0 part /boot/efi |-sda2 8:2 0 1G 0 part /boot `-sda3 8:3 0 1.3T 0 part |-centos-root 253:0 0 1.3T 0 lvm / `-centos-swap 253:1 0 4G 0 lvm [SWAP] sdb 8:16 0 10G 0 disk sdc 8:32 0 10G 0 disk sdd 8:48 0 10G 0 disk sde 8:64 0 1T 0 disk sdf 8:80 0 1T 0 disk sdg 8:96 0 2.9T 0 disk sdh 8:112 0 2T 0 disk sdi 8:128 0 2T 0 disk sdj 8:144 0 2T 0 disk sdk 8:160 0 2T 0 disk sdl 8:176 0 2T 0 disk sdm 8:192 0 2T 0 disk sdn 8:208 0 2T 0 disk sdo 8:224 0 2T 0 disk sdp 8:240 0 2T 0 disk sdq 65:0 0 2T 0 disk
Note:
注意重启后的ASM设备主设备号基本全是8,而重启后设备号发生改变导致无法读取。 使用设备号的UDEV挂载方式,是不建议使用的,就是担心设备号的改变,同样也不建议基于device name, 对于UDEV的挂载的建议可参考《How to create ASM devices with UDEV?》。
重建 ASM UDEV设备
[root@anbob1 dev]# ls -l /dev/asm* brw-rw---- 1 grid asmadmin 8, 16 Dec 25 10:16 /dev/asmdiskb brw-rw---- 1 grid asmadmin 68, 240 Dec 25 07:08 /dev/asmdiskc brw-rw---- 1 grid asmadmin 69, 16 Dec 25 07:08 /dev/asmdiskd brw-rw---- 1 grid asmadmin 69, 48 Dec 25 07:08 /dev/asmdiske ... [root@anbob1 dev]# rm /dev/asmdisk* rm: remove block special file '/dev/asmdiskb'? n rm: remove block special file '/dev/asmdiskc'? y rm: remove block special file '/dev/asmdiskd'? y rm: remove block special file '/dev/asmdiske'? y ... [root@anbob1 dev]# /sbin/udevadm control --reload-rules [root@anbob1 dev]# /sbin/udevadm trigger --type=devices --action=change [root@anbob1 dev]# systemctl stop systemd-udevd Warning: Stopping systemd-udevd.service, but it can still be activated by: systemd-udevd-kernel.socket systemd-udevd-control.socket [root@anbob1 dev]# systemctl start systemd-udevd [root@anbob1 dev]# ls -l /dev/asm* brw-rw---- 1 grid asmadmin 8, 16 Dec 25 10:16 /dev/asmdiskb brw-rw---- 1 grid asmadmin 8, 32 Dec 25 10:26 /dev/asmdiskc brw-rw---- 1 grid asmadmin 8, 48 Dec 25 10:26 /dev/asmdiskd brw-rw---- 1 grid asmadmin 8, 64 Dec 25 10:26 /dev/asmdiske ... [root@anbob1 dev]# su - grid Last login: Mon Dec 25 10:19:48 CST 2023 on pts/1 [grid@anbob1 ~]$ kfed read /dev/asmdiskq |head kfbh.endian: 1 ; 0x000: 0x01 kfbh.hard: 130 ; 0x001: 0x82 kfbh.type: 1 ; 0x002: KFBTYP_DISKHEAD kfbh.datfmt: 1 ; 0x003: 0x01 kfbh.block.blk: 0 ; 0x004: blk=0 kfbh.block.obj: 2147483657 ; 0x008: disk=9 kfbh.check: 3281947134 ; 0x00c: 0xc39e89fe kfbh.fcn.base: 0 ; 0x010: 0x00000000 kfbh.fcn.wrap: 0 ; 0x014: 0x00000000 kfbh.spare1: 0 ; 0x018: 0x00000000
Note:
重建 ASM设备后恢复正常。
分析OS为什么重启?
# gi alert log 2023-12-25 06:57:48.553: [cssd(156463)]CRS-1612:Network communication with node anbob1 (1) missing for 50% of timeout interval. Removal of this node from cluster in 14.880 seconds 2023-12-25 06:57:56.555: [cssd(156463)]CRS-1611:Network communication with node anbob1 (1) missing for 75% of timeout interval. Removal of this node from cluster in 6.870 seconds 2023-12-25 06:58:00.556: [cssd(156463)]CRS-1610:Network communication with node anbob1 (1) missing for 90% of timeout interval. Removal of this node from cluster in 2.870 seconds 2023-12-25 06:58:03.429: [cssd(156463)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008:) in /oracle/app/11.2.0/grid/log/anbob2/cssd/ocssd.log. # OS message Dec 25 06:57:41 anbob1 kernel: mce: Uncorrected hardware memory error in user-access at 779dfd1200 Dec 25 06:57:42 anbob1 kernel: Memory failure: 0x779dfd1: already hardware poisoned Dec 25 06:57:42 anbob1 kernel: reserve_ram_pages_type failed [mem 0x779dfd1000-0x779dfd1fff], track 0x2, req 0x2 Dec 25 06:57:43 anbob1 kernel: Could not invalidate pfn=0x779dfd1 from 1:1 map Dec 25 06:57:43 anbob1 kernel: mce: Uncorrected hardware memory error in user-access at 779dfd1200 Dec 25 06:57:44 anbob1 kernel: Memory failure: 0x779dfd1: already hardware poisoned Dec 25 06:57:44 anbob1 kernel: reserve_ram_pages_type failed [mem 0x779dfd1000-0x779dfd1fff], track 0x2, req 0x2 Dec 25 06:57:44 anbob1 kernel: Could not invalidate pfn=0x779dfd1 from 1:1 map Dec 25 06:57:45 anbob1 kernel: mce: Uncorrected hardware memory error in user-access at 779dfd1200 Dec 25 06:57:45 anbob1 kernel: Memory failure: 0x779dfd1: already hardware poisoned Dec 25 06:57:46 anbob1 kernel: reserve_ram_pages_type failed [mem 0x779dfd1000-0x779dfd1fff], track 0x2, req 0x2 Dec 25 06:57:47 anbob1 kernel: Could not invalidate pfn=0x779dfd1 from 1:1 map Dec 25 06:57:47 anbob1 kernel: mce: Uncorrected hardware memory error in user-access at 779dfd1200 Dec 25 06:57:48 anbob1 kernel: Memory failure: 0x779dfd1: already hardware poisoned Dec 25 06:57:48 anbob1 kernel: reserve_ram_pages_type failed [mem 0x779dfd1000-0x779dfd1fff], track 0x2, req 0x2 Dec 25 06:57:48 anbob1 kernel: Could not invalidate pfn=0x779dfd1 from 1:1 map Dec 25 06:57:48 anbob1 kernel: mce: Uncorrected hardware memory error in user-access at 779dfd1200 Dec 25 06:57:48 anbob1 kernel: Memory failure: 0x779dfd1: already hardware poisoned Dec 25 06:57:49 anbob1 kernel: reserve_ram_pages_type failed [mem 0x779dfd1000-0x779dfd1fff], track 0x2, req 0x2
Note:
注意好像是因为OS层发现内存相关的访问错误,导致OS panic, 建议分析硬件分析。
https://access.redhat.com/solutions/67599 以下可能的根本原因并不详尽,但可能涵盖大多数情况:
内存 DIMM 故障。
内存控制器故障(通常是板载的)。
主板上的内存线有故障。
BIOS 有故障。
过热系统。
RAM 潜在结点故障(用户静电放电)。
电源问题或短路。
https://access.redhat.com/solutions/6412581
System restarts when memory uncorrectable error injection is performed in memory MCA mode In the Operation System: Perform cmd:
# mcelog --daemon # modprobe einj param_extension=1 # mount -t debugfs none /sys/kernel/debug # git clone https://github.com/andikleen/mce-inject.git # cd mce-inject; make # ./mce-inject test/uncorrected
SCAN IP访问问题
发现使用scan ip访问DB服务器访问正常,但远程访问提示没有正常的服务名.注意检查 local_listener和remote_listener不为空,并配置默认如下
SQL> show parameter listener NAME TYPE VALUE ------------------------------------ --------------------------------- ------------------------------ listener_networks string local_listener string (ADDRESS=(PROTOCOL=TCP)(HOST=192.xxVIPxx)(PORT=1521)) remote_listener string anbob-scan:1521
另外发现有可能存在主机重启后crs 自动启动时使用crs中配置的spfile, 覆盖本地$ORACLE_HOME/dbs/initXXX.ora 中的内容,如果之前是手动配置的spfile,与CRS不是相同的spfile,可能会导致使用错误的SPFILE启动DB 实例。
— over —
对不起,这篇文章暂时关闭评论。