首页 » ORACLE 9i-23ai, 系统相关 » Troubleshooting ASM Instance start failed with ORA-00445 on AIX

Troubleshooting ASM Instance start failed with ORA-00445 on AIX

某客户的一套ORACLE RAC 2-nodes   on AIX,  年迈的大块头25G memory, 平时反馈就非常慢, 因存储原因重启,修复后启动实例2无法启动,而实例1启动正常,无网络问题,,检查启动日志在ASM 实例正在starting后报错, 手动启动ASM 超时报错,ORA-00445: background process “LMD0” did not start after 120 seconds ,这是一个比较古老且常见的问题,可能还会有其它现象如lsnrctl status看监听状态无响应, sqlplus 远程过监听无法链接.

ASM alert log

PING started with pid=7, OS id=270358 
Tue Jul 12 17:09:23 2022
DIA0 started with pid=8, OS id=241918 
Tue Jul 12 17:09:23 2022
LMON started with pid=9, OS id=282632 
Tue Jul 12 17:09:23 2022
SKGXP:[110564060.1]{-}: WARNING: Failed to set buffer limit on IPC interconnect socket
SKGXP:[110564060.1]{-}: Oracle requires that the socket receive buffer size be tunable up to 2048 KB.
Please make sure the kernel parameter which limits the receive socket space set by 
applications (i.e. SO_RCVBUF) is at least that value.
Tue Jul 12 17:11:23 2022
Errors in file /grid/app/diag/asm/+asm/+ASM4/trace/+ASM4_ora_266562.trc  (incident=38253):
ORA-00445: background process "LMD0" did not start after 120 seconds
Incident details in: /grid/app/diag/asm/+asm/+ASM4/incident/incdir_38253/+ASM4_ora_266562_i38253.trc
Tue Jul 12 17:11:26 2022
Dumping diagnostic data in directory=[cdmp_20220712171126], requested by (instance=4, osid=266562), summary=[incident=38253].
USER (ospid: 266562): terminating the instance due to error 445
Instance terminated by USER, pid = 266562

Note:
看到2个报错,1个是socket receive buffer 另1个是ora-445

WARNING: Failed to set buffer limit on IPC interconnect socket
原因是当前OS 内核参数的net socket receive buffer配置过低, db 软件把OS报错提示那么明显,还是比较可贵

The commands required to do this depends on the Operating System.
For example, as root:
      On Linux execute   # sysctl -w net.core.rmem_max=2097152
      On Solaris execute # ndd -set /dev/udp udp_max_buf 2097152
      On AIX execute     # no -o sb_max=4194304   (note: AIX only permits sizes of 1048576, 4194304 or 8388608)

但这并非问题关键,调整后重启ASM依旧提示ora-445. LMD超时检查集群间网络,发现在ping ip可以秒回,但是Ping 主机名时要等几分钟才会返回。刚才的sqlplus / as sysasm同样存在等待几分钟才有输出的现象,看来并非是机器慢问题。开启truss跟踪

Linux strace sample
/usr/bin/strace -ftT -o /tmp/strace.out  “command”

AIX truss sample
truss -failed -o /tmp/truss.out -p pid

或开启sqlnet client event trace

在客户端$ORACLE_HOME/network/admin/sqlnet.ora配置文件中加入以下记录:

trace_level_client=16
trace_directory_client=/u01/app/product/11.2.0/client_1/network/trace
trace_unique_client=on
trace_timestamp_client=on
diag_adr_enabled=off

再次尝试登陆就会触发Oracle Network Trace文件在$trace_directory_client指定的目录下产生。从trace中会发现“snlinGetAddrInfo:getaddrinfo() failed with error -2” 错误。

 

sqlplus的truss日志未保留,下面是ping的truss日志, 不过能看到相同的时间消耗调用

258372: _getpid()					= 258372
258372: kopen("/etc/resolv.conf", O_RDONLY)		= 3
258372: kioctl(3, 22528, 0x00000000, 0x00000000)	Err#25 ENOTTY
258372: kioctl(3, 22528, 0x00000000, 0x00000000)	Err#25 ENOTTY
258372: kread(3, " n a m e s e r v e r   1".., 4096)	= 48
258372: kread(3, " n a m e s e r v e r   1".., 4096)	= 0
258372: statx("/etc/resolv.conf", 0x2FF216D8, 76, 0)	= 0
258372: close(3)					= 0
258372: gethostname(0x2FF2174C, 1024)			= 0
258372: access("/usr/lib/nls/msg/en_US/libcnet.cat", 0)	= 0
258372: _getpid()					= 258372
258372: socket(1, 1, 0)					= 3
258372: kfcntl(3, F_SETFD, 0x00000001)			= 0
258372: connext(3, 0x2FF211E0, 1025)			Err#2  ENOENT
258372: close(3)					= 0
258372: kopen("/etc/netsvc.conf", O_RDONLY)		= 3
258372: kioctl(3, 22528, 0x00000000, 0x00000000)	Err#25 ENOTTY
258372: kioctl(3, 22528, 0x00000000, 0x00000000)	Err#25 ENOTTY
258372: kread(3, " #   @ ( # ) 4 3        ".., 4096)	= 4096
258372: kread(3, " o n   a n d   r e s o l".., 4096)	= 620
258372: kread(3, " o n   a n d   r e s o l".., 4096)	= 0
258372: close(3)					= 0
258372: kopen("/etc/irs.conf", O_RDONLY)		Err#2  ENOENT
258372: _thread_self()					= 1048949
258372: getdomainname(0xF06C99D8, 1024)			= 0
258372: _thread_self()					= 1048949
258372: _thread_self()					= 1048949
258372: __libc_sbrk(0x00000000)				= 0x20031EE0
258372: _thread_self()					= 1048949
258372: getdomainname(0xF06C99D8, 1024)			= 0
258372: _thread_self()					= 1048949
258372: _thread_self()					= 1048949
258372: _thread_self()					= 1048949
258372: kopen("/etc/hesiod.conf", O_RDONLY)		Err#2  ENOENT
258372: _thread_self()					= 1048949
258372: getdomainname(0xF06C99D8, 1024)			= 0
258372: _thread_self()					= 1048949
258372: _thread_self()					= 1048949
258372: _thread_self()					= 1048949
258372: getdomainname(0xF06C99D8, 1024)			= 0
258372: _thread_self()					= 1048949
258372: _thread_self()					= 1048949
258372: _thread_self()					= 1048949
258372: getdomainname(0xF06C99D8, 1024)			= 0
258372: _thread_self()					= 1048949
258372: _thread_self()					= 1048949
258372: __libc_sbrk(0x00000000)				= 0x20041EF0
258372: socket(2, 2, 0)					= 3
258372: getsockopt(3, 65535, 4104, 0x2FF203C4, 0x2FF203C0) = 0
258372: connext(3, 0xF0637910, 16)			= 0
258372: _esend(3, 0x2FF211C0, 22, 0, 0x00000000)	= 22
258372: _poll(0x2FF20450, 1, 5000)	(sleeping...)
258372: _poll(0x2FF20450, 1, 5000)			= 0
258372: close(3)					= 0
258372: socket(2, 2, 0)					= 3
258372: _esendto(3, 0x2FF211C0, 22, 0, 0xF0637920, 16, 0x00000000) = 22
258372: _poll(0x2FF20450, 1, 5000)	(sleeping...)
258372: _poll(0x2FF20450, 1, 5000)			= 0
258372: close(3)					= 0
258372: socket(2, 2, 0)					= 3
258372: _esendto(3, 0x2FF211C0, 22, 0, 0xF0637910, 16, 0x00000000) = 22
258372: _poll(0x2FF20450, 1, 5000)	(sleeping...)
258372: _poll(0x2FF20450, 1, 5000)			= 0
258372: close(3)					= 0
258372: socket(2, 2, 0)					= 3
258372: _esendto(3, 0x2FF211C0, 22, 0, 0xF0637920, 16, 0x00000000) = 22
258372: _poll(0x2FF20450, 1, 5000)	(sleeping...)
258372: _poll(0x2FF20450, 1, 5000)			= 0
258372: close(3)					= 0
258372: socket(2, 2, 0)					= 3
258372: _esendto(3, 0x2FF211C0, 22, 0, 0xF0637910, 16, 0x00000000) = 22
258372: _poll(0x2FF20450, 1, 10000)	(sleeping...)
258372: _poll(0x2FF20450, 1, 10000)			= 0
258372: close(3)					= 0
258372: socket(2, 2, 0)					= 3
258372: _esendto(3, 0x2FF211C0, 22, 0, 0xF0637920, 16, 0x00000000) = 22
258372: _poll(0x2FF20450, 1, 10000)	(sleeping...)
258372: _poll(0x2FF20450, 1, 10000)			= 0
258372: close(3)					= 0
258372: socket(2, 2, 0)					= 3
258372: _esendto(3, 0x2FF211C0, 22, 0, 0xF0637910, 16, 0x00000000) = 22
258372: _poll(0x2FF20450, 1, 20000)	(sleeping...)
258372: _poll(0x2FF20450, 1, 20000)			= 0
258372: close(3)					= 0
258372: socket(2, 2, 0)					= 3
258372: _esendto(3, 0x2FF211C0, 22, 0, 0xF0637920, 16, 0x00000000) = 22
258372: _poll(0x2FF20450, 1, 20000)	(sleeping...)

现在相信都猜到接下来排查DNS. 发现确实只有问题节点配置了DNS服务器地址(/etc/resolv.conf), 解析顺序文件/etc/netsvc.conf无内容。

AIX中域名解析通常使用
1.BIND/DNS (domain name server, named)
2.Network Information Service (NIS)
3. The local /etc/hosts file
默认情况下,这些解析进程首先尝试使用 BIND/DNS 进行解析。如果 /etc/resolv.conf 文件不存在或 BIND/DNS 找不到条目,​​则查询 NIS 是否正在运行。如果 NIS 未运行,则搜索本地 /etc/hosts 文件。可以通过创建配置文件 /etc/netsvc.conf 并指定所需的顺序来覆盖默认顺序, 如hosts=local,bind4 也可以使用NSORDER=local,bind4 方式OS 环境变量更高的优先级。

原因
可能因为域名解析存在一些延迟,影响了db 登录和机器名到IP的解析时间。

解决方法
修改/etc/netsvc.conf 文件中的解析顺序
hosts = local, bind4
或去掉/etc/resolv.conf的DNS地址

为了保证实例间配置一致,把问题节点/etc/resolv.conf的DNS服务器地址注释, ping和sqlplus 恢复了快速响应, CRS启动正常。

打赏

对不起,这篇文章暂时关闭评论。