Troubleshooting Active Dataguard Hangs waiting for library cache lock on DBINSTANCE namespace
Oracle 11.2.0.3 2nodes RAC on hpux环境, standby dataguard端突然产生较大GAP,日志停止应用,大量前台查询进程等待”library cache load lock”, hanganalyze trace显示’rdbms ipc message'<=’library cache lock’ 等待链, library cache lock是一个parse lock,当在ADG环境遇到library cache lock时通常发现在以下情况:primary site上做了DDL; 系统权限做了revoke. 而对于这个问题我们尝试重启日志应用进程都没有解决,无奈当时重启standby实例恢复,后分析确认在ADG环境中存在一个相关bug, 安装patch后就可以彻底解决,当然如果是primary 就是频繁的DDL,而导致standby只是争用这个patch可能无用。这里记录一下这个问题。
这类问题通常建议收集:
hanganalyze, systemstate dump, v$ash;
— hanganalye
Chain 1: ------------------------------------------------------------------------------- Oracle session identified by: { instance: 1 (stdanbob.anbob1) os id: 22931 process id: 135, oracle@weejar1 session id: 2479 session serial #: 1571 } is waiting for 'library cache lock' with wait info: { p1: 'handle address'=0xc0000015bff7f678 p2: 'lock address'=0xc000001334cffad8 p3: '100*mode+namespace'=0x1004a0002 time in wait: 21.353805 sec timeout after: 14 min 38 sec wait id: 2 blocking: 0 sessions wait history: * time between current wait and wait #1: 0.143108 sec 1. event: 'SQL*Net message from client' time waited: 0.001206 sec wait id: 1 p1: 'driver id'=0x54435000 p2: '#bytes'=0x1 * time between wait #1 and #2: 0.000013 sec 2. event: 'SQL*Net message to client' time waited: 0.000001 sec wait id: 0 p1: 'driver id'=0x54435000 p2: '#bytes'=0x1 } and is blocked by => Oracle session identified by: { instance: 1 (stdanbob.anbob1) os id: 12817 process id: 32, oracle@weejar1 (LGWR) session id: 11265 session serial #: 1 } which is waiting for 'rdbms ipc message' with wait info: { p1: 'timeout'=0x12c time in wait: 2.488548 sec heur. time in wait: 8.508734 sec timeout after: 0.511452 sec wait id: 586406071 blocking: 7 sessions wait history: * time between current wait and wait #1: 0.000089 sec 1. event: 'rdbms ipc message' time waited: 3.010034 sec wait id: 586406070 p1: 'timeout'=0x12c * time between wait #1 and #2: 0.000065 sec 2. event: 'rdbms ipc message' time waited: 3.009998 sec wait id: 586406069 p1: 'timeout'=0x12c * time between wait #2 and #3: 0.000025 sec 3. event: 'rdbms ipc message' time waited: 0.687558 sec wait id: 586406068 p1: 'timeout'=0x44 } Chain 1 Signature: 'rdbms ipc message'<='library cache lock'
# SSD trace
Blockers ~~~~~~~~ Above is a list of all the processes. If they are waiting for a resource then it will be given in square brackets. Below is a summary of the waited upon resources, together with the holder of that resource. Notes: ~~~~~ o A process id of '???' implies that the holder was not found in the systemstate. (The holder may have released the resource before we dumped the state object tree of the blocking process). o Lines with 'Enqueue conversion' below can be ignored *unless* other sessions are waiting on that resource too. For more, see http://gbr30026.uk.oracle.com:81/Public/TOOLS/Ass.html#enqcnv) o You might see a process blocked on a mutex wait event but also reported as holding the same mutex. You will need to check the processstate dump as we might have been waiting at the start of the process dump but have acquired it before the dump completed. Resource Holder State Mutex 13e4d 10: Blocker Mutex 13e4d 39: Blocker LOCK: Handle=c0000015bff7f678 32: waiting for 'rdbms ipc message' LOAD: c0000013bcdaf710 ??? Blocker LOAD: c0000013bcdab630 361: 361: is waiting for 32: LOAD: c0000013bcd70ac8 667: 667: is waiting for 32: LOAD: c0000013e2cfa070 918: 918: is waiting for 32: LOAD: c0000013beff2170 943: 943: is waiting for 32: LOAD: c0000013e2d60cd8 954: 954: is waiting for 32: LOAD: c0000013e2d60ef8 950: 950: is waiting for 32: LOAD: c0000013e2cfa290 704: 704: is waiting for 32: LOAD: c0000013e2e5e690 910: 910: is waiting for 32: LOAD: c0000013bde32768 650: 650: is waiting for 32: LOAD: c0000012fb399e90 988: 988: is waiting for 32: LOAD: c0000012fb6ef678 ??? Blocker LOAD: c0000013bcdf7880 1008: 1008: is waiting for 32: LOAD: c0000013bcdf7660 1014: 1014: is waiting for 32: PID to SID Mapping ~~~~~~~~~~~~~~~~~~ Pid 10 maps to Sid(s): 3521 Pid 39 maps to Sid(s): 13729 13730 Pid 32 maps to Sid(s): 11265 Pid 361 maps to Sid(s): 14434 Pid 667 maps to Sid(s): 9522 9525 Pid 918 maps to Sid(s): 7770 7746 Pid 943 maps to Sid(s): 16554 Pid 954 maps to Sid(s): 20422 Pid 950 maps to Sid(s): 19019 Pid 704 maps to Sid(s): 27 1 Pid 910 maps to Sid(s): 4934 4953 Pid 650 maps to Sid(s): 3531 3532 Pid 988 maps to Sid(s): 9884 Pid 1008 maps to Sid(s): 16909 Pid 1014 maps to Sid(s): 19037 Warning: The following processes have multiple session state objects and may not be properly represented above : 42: 52: 273: Object Names ~~~~~~~~~~~~ Mutex 13e4d LOCK: Handle=c0000015bff7f678 CURSOR(00):SYS.anbob LOAD: c0000013bcdaf710 LOAD: c0000013bcdab630 LOAD: c0000013bcd70ac8 LOAD: c0000013e2cfa070 LOAD: c0000013beff2170 LOAD: c0000013e2d60cd8 LOAD: c0000013e2d60ef8 LOAD: c0000013e2cfa290 LOAD: c0000013e2e5e690 LOAD: c0000013bde32768 LOAD: c0000012fb399e90 LOAD: c0000012fb6ef678 LOAD: c0000013bcdf7880 LOAD: c0000013bcdf7660 Summary of Wait Events Seen (count>10) ~~~~~~~~~~~~~~~~~~~~~~~~~~~ No wait events seen more than 10 times ---------------------------------------- SO: 0xc0000016536b1cc8, type: 2, owner: 0x0000000000000000, flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0xc0000016536b1cc8, name=process, file=ksu.h LINE:12616 ID:, pg=0 (process) Oracle pid:32, ser:1, calls cur/top: 0xc0000016164719f8/0xc0000016164719f8 flags : (0x6) SYSTEM flags2: (0x0), flags3: (0x10) intr error: 0, call error: 0, sess error: 0, txn error 0 intr queue: empty ksudlp FALSE at location: 0 (post info) last post received: 0 0 35 last post received-location: ksr2.h LINE:627 ID:ksrpublish last process to post me: c0000016236326c0 1 6 last post sent: 0 0 36 last post sent-location: ksr2.h LINE:631 ID:ksrmdone last process posted by me: c0000016236326c0 1 6 (latch info) wait_event=0 bits=0 Process Group: DEFAULT, pseudo proc: 0xc000001653f68b68 O/S info: user: oracle, term: UNKNOWN, ospid: 12817 OSD pid info: Unix process pid: 12817, image: oracle@kdyya1 (LGWR) Short stack dump: ksedsts()+544<-ksdxfstk()+48<-ksdxcb()+3216<-sspuser()+688<-<-_pw_wait()+48<-pw_wait()+112<-sskgpwwait()+432<-skgpwwait()+320<-ksliwat()+3328<-kslwaitctx()+304<-kslwait()+192<-ksarcv()+640<-ks babs()+752<-ksbrdp()+2736<-opirip()+1296<-opidrv()+1152<-sou2o()+256<-opimai_real()+352<-ssthrdmain()+576<-main()+336<-main_opd_entry()+80 ---------------------------------------- ---------------------------------------- SO: 0xc0000016552ca578, type: 4, owner: 0xc0000016536b1cc8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0xc0000016536b1cc8, name=session, file=ksu.h LINE:12624 ID:, pg=0 (session) sid: 11265 ser: 1 trans: 0x0000000000000000, creator: 0xc0000016536b1cc8 flags: (0x51) USR/- flags_idl: (0x1) BSY/-/-/-/-/- flags2: (0x409) -/-/INC DID: , short-term DID: txn branch: 0x0000000000000000 oct: 0, prv: 0, sql: 0x0000000000000000, psql: 0x0000000000000000, user: 0/SYS ksuxds FALSE at location: 0 service name: SYS$BACKGROUND Current Wait Stack: 0: waiting for 'rdbms ipc message' timeout=0xe7, =0x0, =0x0 wait_id=586406182 seq_num=64592 snap_id=1 wait times: snap=0.027546 sec, exc=0.027546 sec, total=0.027546 sec wait times: max=2.310000 sec, heur=0.027546 sec wait counts: calls=1 os=1 in_wait=1 iflags=0x5a8 There are 1 sessions blocked by this session. Dumping one waiter: inst: 1, sid: 20067, ser: 55 wait event: 'library cache lock' p1: 'handle address'=0xc0000015bff7f678 p2: 'lock address'=0xc0000013dad14ef8 p3: '100*mode+namespace'=0x1004a0002 row_wait_obj#: 4294967295, block#: 0, row#: 0, file# 0 min_blocked_time: 12 secs, waiter_cache_ver: 16092 Wait State: fixed_waits=0 flags=0x22 boundary=0x0000000000000000/-1 Session Wait History: ---------------------------------------- SO: 0xc0000016164719f8, type: 3, owner: 0xc0000016536b1cc8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0xc0000016536b1cc8, name=call, file=ksu.h LINE:12620 ID:, pg=0 (call) sess: cur c0000016552ca578, rec 0, usr c0000016552ca578; flg:20 fl2:1; depth:0 svpt(xcb:0x0000000000000000 sptn:0x2 uba: 0x00000000.0000.00) ksudlc FALSE at location: 0 ---------------------------------------- SO: 0xc0000015cff543b8, type: 78, owner: 0xc0000016164719f8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0xc0000016536b1cc8, name=LIBRARY OBJECT LOCK, file=kgl.h LINE:8547 ID:, pg=0 LibraryObjectLock: Address=c0000015cff543b8 Handle=c0000015bff7f678 Mode=X CanBeBrokenCount=1 Incarnation=1 ExecutionCount=0 User=c0000016552ca578 Session=c0000016552ca578 ReferenceCount=1 Flags=CNB/[0001] SavepointNum=2 LibraryHandle: Address=c0000015bff7f678 Hash=d68d5950 LockMode=X PinMode=0 LoadLockMode=0 Status=0 ObjectName: Name=SYS.anbob FullHashValue=93b41a89a2321a9b51ff1c32d68d5950 Namespace=DBINSTANCE(74) Type=CURSOR(00) Identifier=1 OwnerIdn=0 Statistics: InvalidationCount=0 ExecutionCount=0 LoadCount=0 ActiveLocks=1 TotalLockCount=3424094 TotalPinCount=0 Counters: BrokenCount=1 RevocablePointer=1 KeepDependency=0 BucketInUse=7020 HandleInUse=7020 HandleReferenceCount=0 Concurrency: DependencyMutex=c0000015bff7f728(0, 0, 0, 0) Mutex=c0000015bff7f7a8(0, 7048705, 1949, 0) Flags=RON/PIN/KEP/BSO/[00810003] WaitersLists: Lock=c0000015bff7f708[c0000013dad14f68,c0000013dad14f68] Pin=c0000015bff7f6e8[c0000015bff7f6e8,c0000015bff7f6e8] LoadLock=c0000015bff7f760[c0000015bff7f760,c0000015bff7f760] Timestamp: HandleReference: Address=c0000015bff7f818 Handle=0000000000000000 Flags=[00]
Note:
在当前实例Lgwr 持有DBINSTANCE namespace library cache lock并且是 X mode。LGWR本身是空闲状态event为rdbms ipc message
,等信号调用, 但是它会堵塞其它会话,在”library cache lock” for the DBINSTANCE namespace lock。
在11.2.0.3版本仅当 bug 11664426 的修复到位时才会发生这种情况,更多参考MOS Bug 17292239 Active Dataguard Hangs waiting for library cache lock on DBINSTANCE. Tracking bug to deliver backports to remove fix for 11664426。
触发场景通常是, 当Primary数据库上存在大量 DDL 时,Standby Active Dataguard 数据库可能会遇到严重的library cache lock争用,并可能导致死锁,从而导致数据库挂起。
因为11.2.0.3 中安装bug 11664426.才引起了该问题,所在Oracle紧急出了Bug 17292239补丁删除了11664426的修复,出了新的Bug 16717701 ,如果修复了错误 16717701,ADG 解析锁获取超时可能会失败,并出现 ORA-600 [krdrsb_end_qscn_2] 或 ORA-4021存在并且实例可能会崩溃。引出新的问题Bug:17018214 ,所以Bug 16717701修复再次叫停,被Bug:17018214取代。
修复不易呀 11664426》17292239》16717701》17018214
从SSD中可以看到LGWR holding “DBINSTANCE” namespace library cache lock in X-mode, MOS给出Cancel the media recovery and restart临时解决方法在我们这个例子是不可行的,但是有测试cancel后,flush 所有实例的shared pool再次应用日志有时可以解决。 彻底的解决方法是在standby db安装patch 17018214.
注意安装16717701 patch后所有版本默认fix是Disable的,需要使用event 16717701使修复生效,level 是一个参数值,在readme中也有介绍该值的计算方法, 如果在没有安装oneoff patch 启用event 16717701不报错,但是同样不会有任何作用。
For backports/one-off patches, the fix must be enabled by setting event 16717701 at level 104887600:
alter system set EVENT='16717701 trace name context forever, level 104887600' scope=spfile;
The 104887600 value of the event encodes two things:
1. Timeout – this is the amount of time that LGWR will wait for an X lock before signaling a timeout error and retrying.
2. Sleep duration – this is the amount of time that LGWR will sleep for after having timed out.
Both of the above values are denoted in MILLISECONDS. The event value encodes the sleep duration in its 12high-order bits and the timeout in the 20 lower order bits. The value can be calculated using the following formula:
value = (S * 1048576) + T
where S = sleep duration in milliseconds
T = timeout in milliseconds
当然这个值可以进行微调,根据oracle的建议是开始配置30 second timeout 和 100 ms sleep duration,也就是说LGWR会等待30秒获取X-lock, 如时失败会在sleeping 100秒秒后重试,使用上面的计算方法,event的level值计算就是:
value = (100 * 1048576) + 30000 = 104887600
在 RAC ADG 环境中这个event必须在所有使用使用相同的值,而且不支持滚动修改。
目前来看安装16717701再配置event启用已经没有必要,安装17018214即可。这个bug在11.2.0.4 和 12c已修复,同时引入了隐藏参数控制 “_adg_parselock_timeout“和”_adg_parselock_timeout_sleep “.
SQL> @pd parselock Show all parameters and session values from x$ksppi/x$ksppcv... NUM N_HEX NAME VALUE DESCRIPTION ---------- ----- ---------------------------------- ------------ ------------------------------------------------------------------ 2345 929 _adg_parselock_timeout 0 timeout for parselock get on ADG in centiseconds 2347 92B _adg_parselock_timeout_sleep 100 sleep duration after a parselock timeout on ADG in milliseconds
The value of 550 centiseconds has been typically enough (550*0.01=5.5 seconds): “_adg_parselock_timeout”=550;
it can be adjusted to the recommended value of 30 seconds; “_adg_parselock_timeout”= 3000.
There should be a statistic “ADG parselock X get attempts” If it gets set too small that value would likely increase a lot due to keep timing out and retrying.
— 2024-1-17 update —
注意上面的案列前台进程被LGWR 堵塞,LGWR持有dbinstance X mode的library cache lock, 还有一种情况是LGWR请求library cache lock 被前台进程会话堵塞,现象就是LGWR 挂起DG的scn无法增加,产生同步延迟。 可以做hanganalyze和systemstate dump。
PROCESS 59: LGWR ---------------------------------------- SO: 0x1200008528, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x1200008528, name=process, file=ksu.h LINE:12721, pg=1 (process) Oracle pid:59, ser:1, calls cur/top: 0x11d36ee730/0x11d36ee730 OSD pid info: Unix process pid: 293835, image: xxxxxxxxxxxxx (LGWR) SO: 0x1200548dd8, type: 4, owner: 0x1200008528, flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x1200008528, name=session, file=ksu.h LINE:12729, pg=0 (session) sid: 827 ser: 1 trans: (nil), creator: 0x1200008528 service name: SYS$BACKGROUND Current Wait Stack: 0: waiting for 'library cache lock' handle address=0x18bfd0cb80, lock address=0x191ff3f030, 100*mode+namespace=0x1004a0003 wait_id=991 seq_num=1014 snap_id=1 wait times: snap=1.829312 sec, exc=1.829312 sec, total=1.829312 sec wait times: max=5.500000 sec, heur=1.829312 sec wait counts: calls=2 os=2 in_wait=1 iflags=0x15a2 There is at least one session blocking this session. Dumping 1 direct blocker(s): inst: 1, sid: 996, ser: 5 LibraryObjectLock: Address=0x191ff3f030 Handle=0x18bfd0cb80 RequestMode=X CanBeBrokenCount=1 Incarnation=1 ExecutionCount=0 User=0x1200548dd8 Session=0x1200548dd8 ReferenceCount=0 Flags=[0000] SavepointNum=2 LibraryHandle: Address=0x18bfd0cb80 Hash=548dac5f LockMode=S PinMode=0 LoadLockMode=0 Status=0 ObjectName: Name=SYS.xxxxxxx FullHashValue=2499dcf083f8e14276cd0eaf548dac5f Namespace=DBINSTANCE(74) Type=CURSOR(00) Identifier=1 OwnerIdn=0 PROCESS 107: ---------------------------------------- SO: 0x120000e978, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x120000e978, name=process, file=ksu.h LINE:12721, pg=1 (process) Oracle pid:107, ser:3, calls cur/top: 0x11d36f2680/0x11d36f20c0 OSD pid info: Unix process pid: 318247, image: oracle@xxxxxxxxxxxxxxx (TNS V1-V3) Short stack dump: ksedsts()+465<-ksdxfstk()+32<-ksdxcb()+1927<-sspuser()+112<-__sighandler()<-kgh_heap_sizes()+267<-kkoDumpHeapSize()+92<-kkoqbc()+32778<-apakkoqb()+171<-apaqbdDescendents()+452<-apaqbdList()+76<-apaqbdDescendents()+772<-apaqbdList()+76<-apaqbdDescendents()+267<-apaqbdList()+76<-apaqbdDescendents()+772<-apaqbdList()+76<-apaqbdDescendents()+267<-apaqbdList()+76<-apaqbdDescendents()+772<-apaqbdList()+76<-apaqbdDescendents()+267<-apaqbdList()+76<-apaqbdDescendents()+772<-apaqbdList()+76<-apaqbdDescendents()+267<-apaqbdList()+76<-apaqbdDescendents()+267<-apaqbdList()+76<-apaqbd()+14<-kkoggvc()+628<-kkogvcd()+582<-kkqljppuvs()+2612<-kkqljppur()+1766<-kkqfppDrvDescendents()+1123<-kkqfppDrv()+184<-kkqfppPsh()+480<-kkqfppPsh()+5020<-kkqfppDrv1()+384<-kkqfppDrvDescendents()+1206<-kkqfppDrv()+184<-kkqfppPsh()+480<-kkqfppPsh()+5020<-kkqfppDrv1()+384<-kkqfppDrvDescendents()+1206<-kkqfppDrv()+184<-kkqfppPsh()+480<-kkqfppPsh()+5020<-kkqfppDrv1()+384<-kkqfppDrvDescendents()+1206<-kkqfppDrv()+184<-kkqfppPsh()+480<-kkqfppDrv1()+384<-kkqfppDrvDescendents()+1206<-kkqfppDrv()+184<-opitca()+1163<-kksFullTypeCheck()+69<-rpiswu2()+1776<-kksLoadChild()+11468<-kxsGetRuntimeLock()+2209<-kksfbc()+15108<-kkspsc0()+1171<-kksParseCursor()+116<-opiosq0()+2019<-kpooprx()+274<-kpoal8()+842<-opiodr()+915<-ttcpip()+2183<-opitsk()+1705<-opiino()+969<-opiodr()+915<-opidrv()+570<-sou2o()+103<-opimai_real()+133<-ssthrdmain()+265<-main()+201<-__libc_start_main()+253 SO: 0x1200746ff8, type: 4, owner: 0x120000e978, flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x120000e978, name=session, file=ksu.h LINE:12729, pg=0 (session) sid: 996 ser: 5 trans: (nil), creator: 0x120000e978 Current Wait Stack: Not in wait; last wait ended 7 min 22 sec ago There are 1 sessions blocked by this session. Dumping one waiter: inst: 1, sid: 827, ser: 1 wait event: 'library cache lock' p1: 'handle address'=0x18bfd0cb80 p2: 'lock address'=0x191ff3f030 p3: '100*mode+namespace'=0x1004a0003 The user process can have a different stack as it depends on the parsing issue; e.g. Short stack dump: ksedsts()+465<-ksdxfstk()+32<-ksdxcb()+1927<-sspuser()+112<-__sighandler()<-qksqbCheckNameConflict()+51<-qksqbCreateName()+1459<-qksqbGenerateName()+1109<-kkqvmTrMrg()+2678<-kkqvmTrMrg()+15502<-kkqvmTrMrg()+10338<-kk
当时是因为前端进程一直在sql的parsing 阶段未成功, 取前端进程call stack, 根据parse特性尝试禁用解析器部分特性。
对不起,这篇文章暂时关闭评论。