Troubleshooting ORA-00600: internal error code, arguments: [kcbbxsv_nwp] and opatch fail one-off patch
几年前在某个客户查看Oracle数据库 RAC补丁时发现节点间存在不一致的现象,当时在这篇《Using ‘opatch lsinventory’ show patched is real? (看到的补丁信息真的靠谱么?) 》记录过, 最近又遇到了一片“沼泽地”环境,运维人的窘境是无法要求建设阶段如何规范化,什么样的环境都要接,即使到处是坑。这里再分享一例有个ora-600错误引起的一系列问题。
–环境ORACLE RAC 19.11 ON RHEL 7.4.(没看错7.4,虽然官方要求是7.5+).
这环境据说是几日前出现过logfile switch checkpoint incomplete问题,当时发现数据库配置每个thread只有2组redo log. ok 太“节俭”了,后来增加到了4组, 这次故障发生前是数据库sqlplus / as sysdba夯,但没有记录hanganalye信息,没有ASH, OSW, 从db alert log中平时段在半小时左右会有log file switch而故障实例3小时未switch,并提示过一次checkpoint incompleted。2小时左右后dbwr 抛错crash了实例1,CRS自动拉起后恢复正常,期间node2 gc event也明显增加受到了影响,在实例1crash后,实例2恢复正常。猜测实例1crash前应该是长时间未checkpoint,所有online redo 都active无法复用。
DB ALERT LOG
Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_dbw0_70127.trc (incident=921953) (PDBNAME=CDB$ROOT): ORA-00600: internal error code, arguments: [kcbbxsv_nwp], [], [], [], [], [], [], [], [], [], [], [] Incident details in: /u01/app/oracle/diag/rdbms/anbob/anbob1/incident/incdir_921953/anbob1_dbw0_70127_i921953.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. 2022-11-17T15:06:56.316462+08:00 Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_dbw0_70127.trc: ORA-00600: internal error code, arguments: [kcbbxsv_nwp], [], [], [], [], [], [], [], [], [], [], [] Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_dbw0_70127.trc (incident=921954) (PDBNAME=CDB$ROOT): ORA-471 [] [] [] [] [] [] [] [] [] [] [] [] Incident details in: /u01/app/oracle/diag/rdbms/anbob/anbob1/incident/incdir_921954/anbob1_dbw0_70127_i921954.trc 2022-11-17T15:06:56.372478+08:00 Dumping diagnostic data in directory=[cdmp_20221117150656], requested by (instance=1, osid=70127 (DBW0)), summary=[incident=921953]. 2022-11-17T15:06:57.575063+08:00 USER (ospid: 70127): terminating the instance due to ORA error 471 2022-11-17T15:06:57.601250+08:00 TYJB(4):opiodr aborting process unknown ospid (122215) as a result of ORA-1092 2022-11-17T15:06:57.601832+08:00 TYJB(4):Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_ora_122215.trc: 2022-11-17T15:07:03.672394+08:00 Instance terminated by USER, pid = 70127 2022-11-17T15:07:04.132714+08:00 Warning: 2 processes are still attacheded to shmid 3964937: (size: 49152 bytes, creator pid: 69713, last attach/detach pid: 70076) 2022-11-17T15:07:04.880460+08:00 USER(prelim) (ospid: 114832): terminating the instance
anbob1_dbw0_70127_i921954.trc trace file
----- Abridged Call Stack Trace -----
ksedsts()+426<-kjzduptcctx()+724<-kjzdpcrshnfy()+353<-dbkedKstDump()+27<-dbgdaExecuteAction()+354<-dbgerRunAction()+83
<-dbgerRunActions()+1352<-dbgexPhaseII()+549<-dbgexExplicitEndInc()+285<-dbgeEndDDEInvocationImpl()+314<-kcbbxsv()+17643
<-kcbb_coalesce_int()+326<-kcbb_coalesce()+438<-kcbbwthc()+820<-kcbbdrv()+8918<-ksb_act_run_int()+117<-ksb_act_run()+130
<-ksbcti()+1871<-ksbabs()+2363<-ksbrdp()+1167
----- End of Abridged Call Stack Trace -----
根据CALL stack在MOS可以匹配Bug 31711720,直到19.13 RU修复。
Description
DBWR process might encounter ORA-600 [kcbbxsv_nwp] which will
lead to instance crash.
Call stack might include:
... kcbbxsv kcbb_coalesce_int kcbb_coalesce kcbbwdl ...
Rediscovery Notes
If DBW process fails with ORA-600 [kcbbxsv_nwp], this bug might
be hit.
其于这个问题建议
1,安装one-off patch 31711720
2, 优化checkpoint参数,如MTTR
安装31711720
安装oneoff patch比较简单的操作,有停机时间,选择了offline安装,那就很简单opatch apply就可以,但这里又一个“坑”。前期检查opatch lsinv输出正常。但是在sqlplus登陆两个节点发现SQL*Plus banner输出不同。node2提示是19.3.
安装补丁会发生什么?
节点1安装patch 顺利,但节点2应用报错,如下
[oracle@rac2 31711720]$ /u01/app/oracle/product/19.0.0/dbhome_1/OPatch/opatch apply
Oracle Interim Patch Installer version 12.2.0.1.25
Copyright (c) 2022, Oracle Corporation. All rights reserved.
Oracle Home : /u01/app/oracle/product/19.0.0/dbhome_1
Central Inventory : /u01/app/oraInventory
from : /u01/app/oracle/product/19.0.0/dbhome_1/oraInst.loc
OPatch version : 12.2.0.1.25
OUI version : 12.2.0.7.0
Log file location : /u01/app/oracle/product/19.0.0/dbhome_1/cfgtoollogs/opatch/opatch2022-11-18_19-41-01PM_1.log
...
Is the local system ready for patching? [y|n]
y
User Responded with: Y
Backing up files...
Applying interim patch '31711720' to OH '/u01/app/oracle/product/19.0.0/dbhome_1'
Patching component oracle.rdbms, 19.0.0.0.0...
Make failed to invoke "/usr/bin/make -f ins_rdbms.mk ioracle ORACLE_HOME=/u01/app/oracle/product/19.0.0/dbhome_1 OPATCH_SESSION=apply"....'/u01/app/oracle/product/19.0.0/dbhome_1/lib//libserver19.a(ksl.o): In function `kslges':
ksl.c:(.text+0x1fdfd): undefined reference to `sslss_async_call_cnt_'
ksl.c:(.text+0x1fe0f): undefined reference to `sslss_sig_pending_'
ksl.c:(.text+0x1fecb): undefined reference to `sslss_async_call_cnt_'
ksl.c:(.text+0x1fed7): undefined reference to `sslss_sigset_raise_signals'
ksl.c:(.text+0x1feeb): undefined reference to `sslss_async_call_cnt_'
/u01/app/oracle/product/19.0.0/dbhome_1/lib//libserver19.a(ksmmga.o): In function `ksm_mga_sctx_rcv_do':
ksmmga.c:(.text+0x361f): undefined reference to `sslss_async_call_cnt_'
ksmmga.c:(.text+0x3672): undefined reference to `sslss_async_call_cnt_'
ksmmga.c:(.text+0x3680): undefined reference to `sslss_async_call_cnt_'
ksmmga.c:(.text+0x368c): undefined reference to `sslss_sig_pending_'
ksmmga.c:(.text+0x375b): undefined reference to `sslss_sigset_raise_signals'
/u01/app/oracle/product/19.0.0/dbhome_1/lib//libserver19.a(ksmmga.o): In function `ksm_mga_detach':
...
Note:
对比了libserver19.a文件一致,即使从node1复制过来依旧有问题。
lib 节点间不同
$ cd /u01/app/oracle/product/19.0.0/dbhome_1/ $ ls -l lib/*|sort -k 9|awk '{print $1" "$3" "$5" "$9}'>node1lib $ ls -l lib/*|sort -k 9|awk '{print $1" "$3" "$5" "$9}'>node2lib [root@oel7db1 ~]# diff node1lib node2lib 1d0 < [oracle@rac1 34,36c33,35 < -rw-r--r-- oracle 39603 classgen.jar < -rw-r--r-- oracle 6567902 clntsh.map < -rw-r--r-- oracle 1818540 clntshcore.map --- > -rw-r----- oracle 39293 classgen.jar > -rw-r--r-- oracle 6519050 clntsh.map > -rw-r--r-- oracle 1818066 clntshcore.map 52c51 < -rwxr-xr-x oracle 1676112 libagtsh.so.1.0 --- > -rwxr-xr-x oracle 1674632 libagtsh.so.1.0 54c53 < -rw-r--r-- oracle 8677152 libanllexer19.so --- > -rw-r--r-- oracle 8676584 libanllexer19.so 57c56 ...
Note:
如果可以把整合lib目录全从好的节点复制一份到报错节点,但前面注意到了sqlplus banner,有必要考虑下这套软件环境是不是之前安装的就有问题,也对比了OS 的rpm lib安装一致。
检查补丁信息
\u01\app\oracle\product\19.0.0\dbhome_1\.opatchauto_storage\system_patches
-- node1
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<system-patch-information>
<system-patch id="29547676">
<sub-patch>
<id>29517242</id>
<time>2019/04/18 07:21:20</time>
</sub-patch>
<sub-patch>
<id>29585399</id>
<time>2019/04/18 07:21:37</time>
</sub-patch>
</system-patch>
<system-patch id="32545008">
<sub-patch>
<id>32545013</id> ------19.11 RU
<time>2022/10/24 14:45:53</time>
</sub-patch>
<sub-patch>
<id>32579761</id>
<time>2022/10/24 14:46:38</time>
</sub-patch>
</system-patch>
</system-patch-information>
-- node2
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<system-patch-information>
<system-patch id="29547676">
<sub-patch>
<id>29517242</id>
<time>2019/04/18 07:21:20</time>
</sub-patch>
<sub-patch>
<id>29585399</id>
<time>2019/04/18 07:21:37</time>
</sub-patch>
</system-patch>
<system-patch id="32545008">
<sub-patch>
<id>32579761</id>
<time>2022/10/24 15:35:07</time>
</sub-patch>
</system-patch>
</system-patch-information>
Note:
node2缺少32545013 RU信息, 开始我提到之前的blog提到过,如果怀疑软件是否一致,可以查看一下oracle binary执行文件是不是大小一致, 节点2确认是不一致的(注意要和节点1apply补丁前对比),所以后期打补丁时不防检查一下节点间oracle执行文件是否大小一致。 如果软件有bug,这里可以尝试rollback 节点的RU 重新安装试试,节点1软件是正常的,这里使用另一方案,delete node /add node重新初始化软件。
-- delete node2 # dbca -silent -deleteInstance -nodeList "rac2" -gdbName "anbob" -instanceName "anbob2" -sysDBAUserName "sys" -sysDBAPassword xxxxxxx [grid@rac1 ~]$ srvctl disable listener -listener LISTENER -node rac2 [grid@rac1 ~]$ srvctl stop listener -listener LISTENER -node rac2 [oracle@rac2 bin]$ ./runInstaller -updateNodeList ORACLE_HOME =/u01/app/oracle/product/19.0.0/dbhome_1 "CLUSTER_NODES={rac2}" -local [oracle@rac2 bin]$ cd $ORACLE_HOME/deinstall [oracle@rac2 deinstall]$ ./deinstall -local [oracle@rac1 bin]$ ./runInstaller -updateNodeList ORACLE_HOME =/u01/app/oracle/product/19.0.0/dbhome_1 "CLUSTER_NODES={rac1}" [grid@rac2 bin]$ ./runInstaller -updateNodeList ORACLE_HOME =/u01/app/19.0.0/grid "CLUSTER_NODES={rac2}" CRS=TRUE -silent -local [grid@rac2 ContentsXML]$ cd $ORACLE_HOME/deinstall [grid@rac2 deinstall]$ ./deinstall -local [root@rac2 ~]# /u01/app/19.0.0/grid/crs/install/rootcrs.sh -force -deconfig -paramfile "/tmp/deinstall2022-11-18_11-33-32PM/response/deinstall_OraGI19Home1.rsp" [grid@rac1 bin]$ ./runInstaller -updateNodeList ORACLE_HOME=/u01/app/19.0.0/grid "CLUSTER_NODES={rac1}" CRS=TRUE -silent -local [root@rac1 bin]# ./crsctl delete node -n rac2 CRS-4661: Node rac2 successfully deleted. [root@rac1 bin]# ./olsnodes -s -t rac1 Active Unpinned -- add node2 [grid@rac1 ~]$ cluvfy stage -pre nodeadd -n rac2 -verbose -fixup [grid@rac1 ~]$ cd /u01/app/19.0.0/grid/addnode/ [grid@rac1 ~]$export IGNORE_PREADDNODE_CHECKS=Y [grid@rac1 addnode]$ ./addnode.sh -silent -ignorePrereq "CLUSTER_NEW_NODES={rac2}" "CLUSTER_NEW_VIRTUAL_HOSTNAMES={rac2-vip}" "CLUSTER_NEW_NODE_ROLES={hub}" [root@rac2 ~]# /u01/app/19.0.0/grid/root.sh [oracle@rac1 ~]$ cd /u01/app/oracle/product/19.0.0/dbhome_1/addnode/ [oracle@rac1 addnode]$ ./addnode.sh -silent -ignorePrereq "CLUSTER_NEW_NODES={rac2}" [root@rac2 ~]# /u01/app/oracle/product/19.0.0/dbhome_1/root.sh [oracle@rac1 ~]$ dbca -silent -addInstance -gdbName "anbob" -nodeName "rac2" -instanceName "anbob2" -sysDBAUserName "sys" -sysDBAPassword xxxxxxx Prepare for db operation 40% complete Adding instance 48% complete 52% complete 56% complete 60% complete 64% complete 68% complete 72% complete 80% complete Completing instance management. 86% complete 100% complete [FATAL] PRCR-1013 : Failed to start resource ora.tyjbcdb.db PRCR-1064 : Failed to start resource ora.tyjbcdb.db on node rac2 CRS-5017: The resource action "ora.anbob.db start" encountered the following error: ORA-12547: TNS:lost contact . For details refer to "(:CLSN00107:)" in "/u01/app/grid/diag/crs/rac2/crs/trace/crsd_oraagent_oracle.trc". ORA-12547: TNS:lost contact CRS-2674: Start of 'ora.anbob.db' on 'rac2' failed Look at the log file "/u01/app/oracle/cfgtoollogs/dbca/anbob/anbob.log" for further details.
NOTE:
在启实例阶段失败了。ora-12547
检查oracle binary权限
[oracle@rac1 bin]$ ll oracle -rwsr-s--x 1 oracle asmadmin 448711384 Nov 18 23:00 oracle [oracle@rac2 bin]$ ll oracle -rwsr-s--x 1 oracle asmadmin 448711384 Nov 19 00:12 oracle
strace跟踪
[oracle@rac2 ~]$ strace -f -o /tmp/trace.1.log sqlplus / as sysdba SQL*Plus: Release 19.0.0.0.0 - Production on Sat Nov 19 00:41:19 2022 Version 19.11.0.0.0 Copyright (c) 1982, 2020, Oracle. All rights reserved. ERROR: ORA-12547: TNS:lost contact
分析 trace文件
检查“exited”
close(6) = 0 <0.000006> close(4) = 0 <0.000006> close(5) = 0 <0.000011> close(3) = 0 <0.000006> close(5) = -1 EBADF (Bad file descriptor) <0.000005> sigaltstack(NULL, {ss_sp=0x7f6df2186000, ss_flags=0, ss_size=1048576}) = 0 <0.000005> sigaltstack({ss_sp=0x7f6df2186000, ss_flags=SS_DISABLE, ss_size=1048576}, NULL) = 0 <0.000005> exit_group(0) = ? +++ exited with 0 +++
继续向上找file句柄 5
open("/u01/app/oracle/product/19.0.0/dbhome_2/rdbms/log/tyjbcdb2_ora_439527.trc", O_WRONLY|O_CREAT|O_EXCL, 0660) = 5 <0.000047> lstat("/u01/app/oracle/product/19.0.0/dbhome_2/rdbms/log/tyjbcdb2_ora_439527.trc", {st_mode=S_IFREG|0640, st_size=0, ...}) = 0 <0.000006> fcntl(5, F_SETFD, FD_CLOEXEC) = 0 <0.000005> write(5, "Dump file ", 10) = 10 <0.000014> write(5, "/u01/app/oracle/product/19.0.0/d"..., 73) = 73 <0.000007> write(5, "\n", 1) = 1 <0.000007> write(5, "Build label: RDBMS_19.11.0.0."..., 54) = 54 <0.000007> write(5, "Oracle Database 19c Enterprise E"..., 90) = 90 <0.000007> write(5, "\n", 1) = 1 <0.000007> write(5, "Build label: RDBMS_19.11.0.0."..., 54) = 54 <0.000007> write(5, "\n", 1) = 1 <0.000014> access("/u01/app/oracle/common/oracle/sql/ORACLE_HOME: /u01/app/oracle/product/19.0.0/dbhome_2\n", F_OK) = -1 ENOENT (No such file or directory) <0.000007> access("./ORACLE_HOME: /u01/app/oracle/product/19.0.0/dbhome_2\n", F_OK) = -1 ENOENT (No such file or directory) <0.000007> access("/u01/app/oracle/product/19.0.0/dbhome_1/rdbms/admin/ORACLE_HOME: /u01/app/oracle/product/19.0.0/dbhome_2\n", F_OK) = -1 ENOENT (No such file or directory) <0.000007> write(5, "ORACLE_HOME: /u01/app/oracle/"..., 56) = 56 <0.000007> uname({sysname="Linux", nodename="rac2", ...}) = 0 <0.000006> write(5, "System name:\tLinux\n", 19) = 19 <0.000007> write(5, "Node name:\trac2\n", 16) = 16 <0.000007> write(5, "Release:\t3.10.0-693.el7.x86_64\n", 31) = 31 <0.000006> write(5, "Version:\t#1 SMP Thu Jul 6 19:56:"..., 44) = 44 <0.000007> write(5, "Machine:\tx86_64\n", 16) = 16 <0.000008> write(5, "Storage:\t?\n", 11) = 11 <0.000007> write(5, "Instance name: tyjbcdb2\n", 24) = 24 <0.000007> write(5, "Redo thread mounted by this inst"..., 47) = 47 <0.000006> write(5, "Oracle process number: 0\n", 25) = 25 <0.000007> write(5, "Unix process pid: 439527, image:"..., 33) = 33 <0.000006> write(5, "\n", 1) = 1 <0.000006> write(5, "\n", 1) = 1 <0.000006> write(5, "\n*** 2022-11-19T12:45:43.868985+"..., 38) = 38 <0.000007> write(5, "ORA-12753: The Oracle home user "..., 131) = 131 <0.000006> write(5, "\n", 1) = 1 <0.000006> write(5, "ORA-27300: OS system dependent o"..., 91) = 91 <0.000006> write(5, "\n", 1) = 1 <0.000007> write(5, "ORA-27301: OS failure message: E"..., 38) = 38 <0.000007> write(5, "\n", 1) = 1 <0.000007> write(5, "ORA-27302: failure occurred at: "..., 39) = 39 <0.000007> write(5, "\n", 1) = 1 <0.000006> write(5, "ORA-27303: additional informatio"..., 94) = 94 <0.000007> write(5, "\n", 1) = 1 <0.000007>
查看trace file
Dump file /u01/app/oracle/product/19.0.0/dbhome_2/rdbms/log/tyjbcdb2_ora_439527.trc *** 2022-11-19T12:45:43.868985+08:00 ORA-12753: The Oracle home user ID does not match the effective user ID of the local client which connected to the Oracle instance. ORA-27300: OS system dependent operation:check $ORACLE_HOME ownership failed with status: 0 ORA-27301: OS failure message: Error 0 ORA-27302: failure occurred at: spstp:4 ORA-27303: additional information: ORACLE_HOME uid does not match euid. uid: 1002 euid: 1001
检查ORACLE HOME owner uid
[oracle@rac2 19.0.0]$ ls -l total 10510948 drwxrwxr-x 73 grid oinstall 4096 Nov 19 00:13 dbhome_1 [oracle@rac2 19.0.0]$ id oracle uid=1001(oracle) gid=501(oinstall) groups=501(oinstall),502(dba),503(oper),506(asmdba) [oracle@rac2 19.0.0]$ id grid uid=1002(grid) gid=501(oinstall) groups=501(oinstall),502(dba),503(oper),504(asmadmin),505(asmoper),506(asmdba)
Note:
应该是1001 oracle,当前是1002 grid。
修正OH owner
[root@rac2 19.0.0]# chown oracle:oinstall dbhome_1/ [oracle@rac2 ~]$ sqlplus / as sysdba SQL*Plus: Release 19.0.0.0.0 - Production on Sat Nov 19 13:28:35 2022 Version 19.11.0.0.0 Copyright (c) 1982, 2020, Oracle. All rights reserved. Connected to an idle instance.
Summary:
一个错误引起的一系列“闯关”。源于一个软件最开始的安装与配置, 以后再安装补丁时,记的增加一个检查吧。
— over —
对不起,这篇文章暂时关闭评论。