How to release still “killed“ status session in v$session? (释放killed的session) (三)

How to release still “killed“ status session in v$session?(释放killed的session) (一)
How to release still “killed“ status session in v$session? (释放killed的session) (二)
昨天数据库有3个session记录在v$session一直是KILLED状态，且已经好几个小时，当前的等待事件为”db file sequential read”, 其中有一个session是DML SQL,所以最后在事务结束前的行级锁一直未能释放，影响了操作相同记录的业务，从数据库尝试kill [immeidate]没报错但是无法清理，手动weekup PMON检查PMON trace无报错，通过操作系统层KILL -9 无报错仍无法清理,环境是11.2.0.3 2-NODES RAC ON AIX, 记录一下分析思路。

SQL>@ase

USERNAME      SID       SEQ# EVENT                MACHINE    MODULE      STATUS   LAST_CALL_ET SQL_ID          WAI_SECINW 
---------- ------ ---------- -------------------- ---------- ----------- -------- ------------ --------------- ---------- 
...                                                                    
ANBOB      3510      28110 gc buffer busy acqui kdbxxxx     SQL*xxxxxxx ACTIVE           3458 3dpbdtsg82768   0:0        
ANBOB      6024      24828 enq: TX - row lock c kycxxxx     evenxxxxxxx ACTIVE           9761 drdanx3hhzwc6   0:9760     
ANBOB      6121       1224 read by other sessio kycxxxx     evenxxxxxxx ACTIVE           9767 drdanx3hhzwc6   0:9762     
ANBOB      7180       1768 read by other sessio qmzxxxx     AccSxxxxxxx ACTIVE          29170 bazvmr4hmyprb   0:28806    
ANBOB      5276      19993 db file sequential r kycxxxx     evenxxxxxxx KILLED          29183 drdanx3hhzwc6   0:29183    
ANBOB      2284      38595 db file sequential r kycxxxx     evenxxxxxxx KILLED          29183 3yv2fann81bq7   0:29183    
ANBOB      6959      26745 db file sequential r qmzxxxx     AccSxxxxxxx KILLED          29184 cntmwhgmv5wuw   0:29183    
...


[oracle@weejar1:/home/oracle]$ kill -9 39518840
[oracle@weejar1:/home/oracle]$ ps -ef|grep 39518840
  oracle 44892446 59769246   0 15:33:54  pts/9  0:00 grep 39518840
    grid 39518840        1   0   Jun 07      - 199:30 oracleanbob1 (LOCAL=NO)
[oracle@weejar1:/home/oracle]$


truss -p  pid 
-- hang
procestack  pid
-- hang

oradebug setorapid xx
oradebug errorstack 
-- hang

oracebug short_stack
-- hang

SQL> oradebug setmypid;
Statement processed.
SQL> oradebug unlimit;
Statement processed.
SQL> oradebug dump systemstate 266;
Statement processed.

SQL> oradebug tracefile_name
/oracle/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_ora_65995280.trc

[oracle@weejar1:/home/oracle]$ grep -n 39518840 /oracle/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_ora_65995280.trc
5891344:    O/S info: user: grid, term: UNKNOWN, ospid: 39518840 
5891345:    OSD pid info: Unix process pid: 39518840, image: oracle@weejar1
5891348:    Short stack dump: ORA-32515: cannot issue ORADEBUG command 'SHORT_STACK' to process 'Unix process pid: 39518840, image: oracle@weejar1'; prior command execution time exceeds 30000 ms
5899983:                       owner: 700000db18e19c0 - ospid: 39518840
5900005:    pid               : 39518840


[oracle@weejar1:/home/oracle]$  sed -n 5891330,5891364p  /oracle/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_ora_65995280.trc
  (process) Oracle pid:2474, ser:178, calls cur/top: 0x700000d43d97f58/0x700000be07a5dd0
            flags : (0x0) -
            flags2: (0x0),  flags3: (0x10) 
            intr error: 24761, call error: 0, sess error: 0, txn error 0
            intr queue: 24761 2147483676 2147483676 2147483676 2147483676 2147483676 2147483676 
    ksudlp FALSE at location: 0
  (post info) last post received: 0 0 178
              last post received-location: kcl2.h LINE:4340 ID:kclcget
              last process to post me: 700000dc96f2df8 1 6
              last post sent: 0 0 159
              last post sent-location: kcb2.h LINE:4139 ID:kcbzww
              last process posted by me: 700000db9866cc8 93 0
    (latch info) wait_event=0 bits=0
    Process Group: DEFAULT, pseudo proc: 0x700000dd9a86230
    O/S info: user: grid, term: UNKNOWN, ospid: 39518840 
    OSD pid info: Unix process pid: 39518840, image: oracle@weejar1

*** 2017-06-13 17:08:33.300
    Short stack dump: ORA-32515: cannot issue ORADEBUG command 'SHORT_STACK' to process 'Unix process pid: 39518840, image: oracle@weejar1'; prior command execution time exceeds 30000 ms
    ----------------------------------------
    SO: 0x700000d7912cdc8, type: 10, owner: 0x700000db18e19c0, flag: INIT/-/-/0x00 if: 0x1 c: 0x1
     proc=0x700000db18e19c0, name=FileOpenBlock, file=ksfd.h LINE:6448 ID:, pg=0
    (FOB) 700000d7912cdc8 flags=2050 fib=700000da3a2c340 incno=0 pending i/o cnt=0
     fname=/dev/rzwb_lv15_104
     fno=377 lblksz=16384 fsiz=983039
    ----------------------------------------
    SO: 0x700000d7b9b2ea0, type: 10, owner: 0x700000db18e19c0, flag: INIT/-/-/0x00 if: 0x1 c: 0x1
     proc=0x700000db18e19c0, name=FileOpenBlock, file=ksfd.h LINE:6448 ID:, pg=0
    (FOB) 700000d7b9b2ea0 flags=2050 fib=700000da39b38b0 incno=0 pending i/o cnt=0
     fname=/dev/rzwb_lv15_106
     fno=61 lblksz=16384 fsiz=983039
    ----------------------------------------
    SO: 0x700000d7c53d2b8, type: 10, owner: 0x700000db18e19c0, flag: INIT/-/-/0x00 if: 0x1 c: 0x1
     proc=0x700000db18e19c0, name=FileOpenBlock, file=ksfd.h LINE:6448 ID:, pg=0
    (FOB) 700000d7c53d2b8 flags=2050 fib=700000da39dfd60 incno=0 pending i/o cnt=0

Note:
尝试trace进程调用均hang. 下一步使用PS检查进程的STAT列，如果进程状态为D，并且如strace,truss, tuss等工具无输出时，我们是无法kill或恢复进程，表未进程在wait内核调用，KILL 命令无权终止，只能重启操作系统恢复。通常数据库出现该问题时，如果系统资源充足时可能是I/O请求挂起引起，会一直等待下去。

$ ps aux | grep ” D”

SQL> @sw "select sid from v$session where status='KILLED'"

    SID STATE   EVENT                                          SEQ# SEC_IN_WAIT P1                  P2                  P3               
------- ------- ---------------------------------------- ---------- ----------- ------------------- ------------------- -----------------
   6959 WAITING db file sequential read                       26745       39163 file#= 621          block#= 266395      blocks= 1
   2284 WAITING db file sequential read                       38595       39163 file#= 655          block#= 569108      blocks= 1
   5276 WAITING db file sequential read                       19993       39163 file#= 764          block#= 1948676     blocks= 1

SQL> @sw "select sid from v$session where status='KILLED'"

    SID STATE   EVENT                                          SEQ# SEC_IN_WAIT P1                  P2                  P3               
------- ------- ---------------------------------------- ---------- ----------- ------------------- ------------------- -----------------
   6959 WAITING db file sequential read                       26745       39165 file#= 621          block#= 266395      blocks= 1
   2284 WAITING db file sequential read                       38595       39165 file#= 655          block#= 569108      blocks= 1
   5276 WAITING db file sequential read                       19993       39165 file#= 764          block#= 1948676     blocks= 1

SQL> @sw "select sid from v$session where status='KILLED'"

    SID STATE   EVENT                                          SEQ# SEC_IN_WAIT P1                  P2                  P3               
------- ------- ---------------------------------------- ---------- ----------- ------------------- ------------------- -----------------
   6959 WAITING db file sequential read                       26745       39166 file#= 621          block#= 266395      blocks= 1
   2284 WAITING db file sequential read                       38595       39166 file#= 655          block#= 569108      blocks= 1
   5276 WAITING db file sequential read                       19993       39166 file#= 764          block#= 1948676     blocks= 1

SQL> select name from v$datafile where file#=621;
NAME
--------------------------------------------------------------
/dev/rzwb_lv15_330

SQL> select name from v$datafile where file#=655;
NAME
--------------------------------------------------------------
/dev/rzwb_lv30_411

SQL> select name from v$datafile where file#=764;
NAME
--------------------------------------------------------------
/dev/rzwb_lv30_525

[oracle@weejar1:/home/oracle]$ lslv zwb_lv15_330
0516-1201 lslv: Warning: Volume group zwb_datavg09 is locked. This command
        will continue retries until lock is free.  If lock is inadvertent
        and needs to be removed, execute 'chvg -u zwb_datavg09'.

[oracle@weejar1:/home/oracle]$ errpt
IDENTIFIER TIMESTAMP  T C RESOURCE_NAME  DESCRIPTION
4B436A3D   0613172517 T H fscsi1         LINK ERROR
4B436A3D   0613172517 T H fscsi1         LINK ERROR
4B436A3D   0613172417 T H fscsi1         LINK ERROR
4B436A3D   0613172417 T H fscsi1         LINK ERROR
4B436A3D   0613172417 T H fscsi1         LINK ERROR
4B436A3D   0613172317 T H fscsi1         LINK ERROR

Note:
从数据库等待事件的数据文件查起，确认了该LV及所在的VG已被锁定，同步有4个VG受到影响，第二个节点存储（11个VG）正常。从节点一主机日志及系统确认该主机存储目前有4条存储链路，其中1条存储链路故障（sicsi1 hba卡，且使用POWER存储命令查看该链路有3个读写I/O请求未完成）引起读写异常进而引发VG被锁，进程处于等待I/O资源的核心态而不是用户态，不能接受任何外部信号(包括KILL)。

当然这种情况下最安全的方法是重启主机，当然这个情况数据库需要ABORT才能停下来，且DB INSTANCE和CRS LISTENER停掉后，KILLED状态的OS进程（LOCAL=NO）还一直存在，重启操作系统后，链路和数据库恢复正常。