How to release still “killed“ status session in v$session?(释放killed的session) (一)
How to release still “killed“ status session in v$session? (释放killed的session) (二)
昨天数据库有3个session记录在v$session一直是KILLED状态,且已经好几个小时,当前的等待事件为”db file sequential read”, 其中有一个session是DML SQL,所以最后在事务结束前的行级锁一直未能释放,影响了操作相同记录的业务, 从数据库尝试kill [immeidate]没报错但是无法清理,手动weekup PMON检查PMON trace无报错, 通过操作系统层KILL -9 无报错仍无法清理,环境是11.2.0.3 2-NODES RAC ON AIX, 记录一下分析思路。
SQL>@ase
USERNAME SID SEQ# EVENT MACHINE MODULE STATUS LAST_CALL_ET SQL_ID WAI_SECINW
---------- ------ ---------- -------------------- ---------- ----------- -------- ------------ --------------- ----------
...
ANBOB 3510 28110 gc buffer busy acqui kdbxxxx SQL*xxxxxxx ACTIVE 3458 3dpbdtsg82768 0:0
ANBOB 6024 24828 enq: TX - row lock c kycxxxx evenxxxxxxx ACTIVE 9761 drdanx3hhzwc6 0:9760
ANBOB 6121 1224 read by other sessio kycxxxx evenxxxxxxx ACTIVE 9767 drdanx3hhzwc6 0:9762
ANBOB 7180 1768 read by other sessio qmzxxxx AccSxxxxxxx ACTIVE 29170 bazvmr4hmyprb 0:28806
ANBOB 5276 19993 db file sequential r kycxxxx evenxxxxxxx KILLED 29183 drdanx3hhzwc6 0:29183
ANBOB 2284 38595 db file sequential r kycxxxx evenxxxxxxx KILLED 29183 3yv2fann81bq7 0:29183
ANBOB 6959 26745 db file sequential r qmzxxxx AccSxxxxxxx KILLED 29184 cntmwhgmv5wuw 0:29183
...
[oracle@weejar1:/home/oracle]$ kill -9 39518840
[oracle@weejar1:/home/oracle]$ ps -ef|grep 39518840
oracle 44892446 59769246 0 15:33:54 pts/9 0:00 grep 39518840
grid 39518840 1 0 Jun 07 - 199:30 oracleanbob1 (LOCAL=NO)
[oracle@weejar1:/home/oracle]$
truss -p pid
-- hang
procestack pid
-- hang
oradebug setorapid xx
oradebug errorstack
-- hang
oracebug short_stack
-- hang
SQL> oradebug setmypid;
Statement processed.
SQL> oradebug unlimit;
Statement processed.
SQL> oradebug dump systemstate 266;
Statement processed.
SQL> oradebug tracefile_name
/oracle/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_ora_65995280.trc
[oracle@weejar1:/home/oracle]$ grep -n 39518840 /oracle/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_ora_65995280.trc
5891344: O/S info: user: grid, term: UNKNOWN, ospid: 39518840
5891345: OSD pid info: Unix process pid: 39518840, image: oracle@weejar1
5891348: Short stack dump: ORA-32515: cannot issue ORADEBUG command 'SHORT_STACK' to process 'Unix process pid: 39518840, image: oracle@weejar1'; prior command execution time exceeds 30000 ms
5899983: owner: 700000db18e19c0 - ospid: 39518840
5900005: pid : 39518840
[oracle@weejar1:/home/oracle]$ sed -n 5891330,5891364p /oracle/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_ora_65995280.trc
(process) Oracle pid:2474, ser:178, calls cur/top: 0x700000d43d97f58/0x700000be07a5dd0
flags : (0x0) -
flags2: (0x0), flags3: (0x10)
intr error: 24761, call error: 0, sess error: 0, txn error 0
intr queue: 24761 2147483676 2147483676 2147483676 2147483676 2147483676 2147483676
ksudlp FALSE at location: 0
(post info) last post received: 0 0 178
last post received-location: kcl2.h LINE:4340 ID:kclcget
last process to post me: 700000dc96f2df8 1 6
last post sent: 0 0 159
last post sent-location: kcb2.h LINE:4139 ID:kcbzww
last process posted by me: 700000db9866cc8 93 0
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: 0x700000dd9a86230
O/S info: user: grid, term: UNKNOWN, ospid: 39518840
OSD pid info: Unix process pid: 39518840, image: oracle@weejar1
*** 2017-06-13 17:08:33.300
Short stack dump: ORA-32515: cannot issue ORADEBUG command 'SHORT_STACK' to process 'Unix process pid: 39518840, image: oracle@weejar1'; prior command execution time exceeds 30000 ms
----------------------------------------
SO: 0x700000d7912cdc8, type: 10, owner: 0x700000db18e19c0, flag: INIT/-/-/0x00 if: 0x1 c: 0x1
proc=0x700000db18e19c0, name=FileOpenBlock, file=ksfd.h LINE:6448 ID:, pg=0
(FOB) 700000d7912cdc8 flags=2050 fib=700000da3a2c340 incno=0 pending i/o cnt=0
fname=/dev/rzwb_lv15_104
fno=377 lblksz=16384 fsiz=983039
----------------------------------------
SO: 0x700000d7b9b2ea0, type: 10, owner: 0x700000db18e19c0, flag: INIT/-/-/0x00 if: 0x1 c: 0x1
proc=0x700000db18e19c0, name=FileOpenBlock, file=ksfd.h LINE:6448 ID:, pg=0
(FOB) 700000d7b9b2ea0 flags=2050 fib=700000da39b38b0 incno=0 pending i/o cnt=0
fname=/dev/rzwb_lv15_106
fno=61 lblksz=16384 fsiz=983039
----------------------------------------
SO: 0x700000d7c53d2b8, type: 10, owner: 0x700000db18e19c0, flag: INIT/-/-/0x00 if: 0x1 c: 0x1
proc=0x700000db18e19c0, name=FileOpenBlock, file=ksfd.h LINE:6448 ID:, pg=0
(FOB) 700000d7c53d2b8 flags=2050 fib=700000da39dfd60 incno=0 pending i/o cnt=0
Note:
尝试trace进程调用均hang. 下一步使用PS检查进程的STAT列,如果进程状态为D,并且如strace,truss, tuss等工具无输出时,我们是无法kill或恢复进程,表未进程在wait内核调用,KILL 命令无权终止,只能重启操作系统恢复。通常数据库出现该问题时,如果系统资源充足时可能是I/O请求挂起引起, 会一直等待下去。
$ ps aux | grep ” D”
SQL> @sw "select sid from v$session where status='KILLED'"
SID STATE EVENT SEQ# SEC_IN_WAIT P1 P2 P3
------- ------- ---------------------------------------- ---------- ----------- ------------------- ------------------- -----------------
6959 WAITING db file sequential read 26745 39163 file#= 621 block#= 266395 blocks= 1
2284 WAITING db file sequential read 38595 39163 file#= 655 block#= 569108 blocks= 1
5276 WAITING db file sequential read 19993 39163 file#= 764 block#= 1948676 blocks= 1
SQL> @sw "select sid from v$session where status='KILLED'"
SID STATE EVENT SEQ# SEC_IN_WAIT P1 P2 P3
------- ------- ---------------------------------------- ---------- ----------- ------------------- ------------------- -----------------
6959 WAITING db file sequential read 26745 39165 file#= 621 block#= 266395 blocks= 1
2284 WAITING db file sequential read 38595 39165 file#= 655 block#= 569108 blocks= 1
5276 WAITING db file sequential read 19993 39165 file#= 764 block#= 1948676 blocks= 1
SQL> @sw "select sid from v$session where status='KILLED'"
SID STATE EVENT SEQ# SEC_IN_WAIT P1 P2 P3
------- ------- ---------------------------------------- ---------- ----------- ------------------- ------------------- -----------------
6959 WAITING db file sequential read 26745 39166 file#= 621 block#= 266395 blocks= 1
2284 WAITING db file sequential read 38595 39166 file#= 655 block#= 569108 blocks= 1
5276 WAITING db file sequential read 19993 39166 file#= 764 block#= 1948676 blocks= 1
SQL> select name from v$datafile where file#=621;
NAME
--------------------------------------------------------------
/dev/rzwb_lv15_330
SQL> select name from v$datafile where file#=655;
NAME
--------------------------------------------------------------
/dev/rzwb_lv30_411
SQL> select name from v$datafile where file#=764;
NAME
--------------------------------------------------------------
/dev/rzwb_lv30_525
[oracle@weejar1:/home/oracle]$ lslv zwb_lv15_330
0516-1201 lslv: Warning: Volume group zwb_datavg09 is locked. This command
will continue retries until lock is free. If lock is inadvertent
and needs to be removed, execute 'chvg -u zwb_datavg09'.
[oracle@weejar1:/home/oracle]$ errpt
IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
4B436A3D 0613172517 T H fscsi1 LINK ERROR
4B436A3D 0613172517 T H fscsi1 LINK ERROR
4B436A3D 0613172417 T H fscsi1 LINK ERROR
4B436A3D 0613172417 T H fscsi1 LINK ERROR
4B436A3D 0613172417 T H fscsi1 LINK ERROR
4B436A3D 0613172317 T H fscsi1 LINK ERROR
Note:
从数据库等待事件的数据文件查起,确认了该LV及所在的VG已被锁定,同步有4个VG受到影响, 第二个节点存储(11个VG)正常。 从节点一主机日志及系统确认该主机存储目前有4条存储链路, 其中1条存储链路故障(sicsi1 hba卡,且使用POWER存储命令查看该链路有3个读写I/O请求未完成)引起读写异常进而引发VG被锁,进程处于等待I/O资源的核心态而不是用户态,不能接受任何外部信号(包括KILL)。
当然这种情况下最安全的方法是重启主机, 当然这个情况数据库需要ABORT才能停下来,且DB INSTANCE和CRS LISTENER停掉后,KILLED状态的OS进程(LOCAL=NO)还一直存在,重启操作系统后,链路和数据库恢复正常。