Troubleshooting TNS-12547&TNS-12560 AIX error 32: Broken pipe caused by tcp socket leak
上周出现个蹊跷案例,最近一直在忙今天简单的记录一下, 中间件反馈数据库连接时失败,在数据库使用lsnrctl status 查看监听状态会发现Listener一会儿正常,一会儿报错,但是在Listener正常时可以看出listener的start date并没有restart过,报错信息如下:
[oracle@anbob1:/home/oracle]# /oracle/app/11.2.0.3/grid/bin/lsnrctl status LSNRCTL for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production on 04-JUL-2015 19:37:45 Copyright (c) 1991, 2011, Oracle. All rights reserved. Connecting to (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1521)) TNS-12547: TNS:lost contact TNS-12560: TNS:protocol adapter error TNS-00517: Lost contact IBM/AIX RISC System/6000 Error: 32: Broken pipe
Note:
这是一个2 nodes RAC ON AIX6, 11.2.0.3 Grid, 10.2.0.5 DB, 期间数据库和主机没做任何修改, 从”IBM/AIX RISC System/6000 Error: 32: Broken pipe” 更倾向于OS 资源限制,并不像有些案例那样如listener.log 大小限制, 或oracle 执行文件权限不对。
The error 32 indicates the communication has been broken while the listener is trying to hand off the client connection to the server process or dispatcher process.
[oracle@anbob1:/oracle/app/oracle/product/10.2.0/db/OPatch]# ./opatch lsinventory Invoking OPatch 10.2.0.5.1 Oracle Interim Patch Installer version 10.2.0.5.1 Copyright (c) 2010, Oracle Corporation. All rights reserved. Oracle Home : /oracle/app/oracle/product/10.2.0/db Central Inventory : /oracle/app/oraInventory from : /etc/oraInst.loc OPatch version : 10.2.0.5.1 OUI version : 10.2.0.5.0 OUI location : /oracle/app/oracle/product/10.2.0/db/oui Log file location : /oracle/app/oracle/product/10.2.0/db/cfgtoollogs/opatch/opatch2015-07-06_14-42-44PM.log Patch history file: /oracle/app/oracle/product/10.2.0/db/cfgtoollogs/opatch/opatch_history.txt Lsinventory Output file location : /oracle/app/oracle/product/10.2.0/db/cfgtoollogs/opatch/lsinv/lsinventory2015-07-06_14-42-44PM.txt -------------------------------------------------------------------------------- Installed Top-level Products (2): Oracle Database 10g 10.2.0.1.0 Oracle Database 10g Release 2 Patch Set 4 10.2.0.5.0 There are 2 products installed in this Oracle Home. Interim patches (2) : Patch 9952245 : applied on Thu Jan 02 21:23:40 GMT+08:00 2014 Unique Patch ID: 13348225 Created on 17 Jan 2011, 04:22:06 hrs UTC Bugs fixed: 9952245, 9879181, 9610672, 9872299, 9577921, 8308495, 9103966, 8996060 6899183, 9261837, 8487143, 9646836 Patch 13632743 : applied on Thu Jan 02 19:58:43 GMT+08:00 2014 [grid@anbob1:/home/grid]# crsctl stat res ora.LISTENER.lsnr -n anbob1 -p NAME=ora.LISTENER.lsnr TYPE=ora.listener.type ACL=owner:grid:rwx,pgrp:oinstall:rwx,other::r-- ACTION_FAILURE_TEMPLATE= ACTION_SCRIPT=%CRS_HOME%/bin/racgwrap%CRS_SCRIPT_SUFFIX% AGENT_FILENAME=%CRS_HOME%/bin/oraagent%CRS_EXE_SUFFIX% ALIAS_NAME=ora.%CRS_CSS_NODENAME_LOWER_CASE%.LISTENER_%CRS_CSS_NODENAME_UPPER_CASE%.lsnr AUTO_START=restore CHECK_INTERVAL=60 CHECK_TIMEOUT=30 DEFAULT_TEMPLATE=PROPERTY(RESOURCE_CLASS=listener) PROPERTY(LISTENER_NAME=PARSE(%NAME%, ., 2)) DEGREE=1 DESCRIPTION=Oracle Listener resource ENABLED=1 ENDPOINTS=TCP:1521 LOAD=1 LOGGING_LEVEL=1 NLS_LANG= NOT_RESTARTING_TEMPLATE= OFFLINE_CHECK_INTERVAL=0 ORACLE_HOME=%CRS_HOME% PORT=1521 PROFILE_CHANGE_TEMPLATE= RESTART_ATTEMPTS=5 SCRIPT_TIMEOUT=60 START_DEPENDENCIES=hard(type:ora.cluster_vip_net1.type) pullup(type:ora.cluster_vip_net1.type) START_TIMEOUT=180 STATE_CHANGE_TEMPLATE= STOP_DEPENDENCIES=hard(intermediate:type:ora.cluster_vip_net1.type) STOP_TIMEOUT=0 TYPE_VERSION=1.2 UPTIME_THRESHOLD=1d USR_ORA_ENV=ORACLE_BASE=/oracle/app/grid USR_ORA_OPI=false VERSION=11.2.0.3.0
NOTE:
首先我是尝试重启listener, 尝试了srvctl stop listener和 lsnrctl stop 都并未完全把listener停止,listener processes 的进程一直存在,后通过OS KILL 结束讲程,但是启动时报上面同样的错误,此时另一个节点的listener正常,多次尝试listener启动成功。检查listener
[oracle@anbob1:/home/oracle]# /oracle/app/11.2.0.3/grid/bin/lsnrctl status LSNRCTL for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production on 04-JUL-2015 19:37:36 Copyright (c) 1991, 2011, Oracle. All rights reserved. Connecting to (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1521)) STATUS of the LISTENER ------------------------ Alias LISTENER Version TNSLSNR for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production Start Date 04-JUL-2015 19:11:19 Uptime 0 days 0 hr. 26 min. 17 sec Trace Level off Security ON: Local OS Authentication SNMP ON Listener Parameter File /oracle/app/11.2.0.3/grid/network/admin/listener.ora Listener Log File /oracle/app/grid/diag/tnslsnr/anbob1/listener/alert/log.xml Listening Endpoints Summary... (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=LISTENER))) (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=133.96.60.19)(PORT=1521))) (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=133.96.60.119)(PORT=1521))) Services Summary... Service "anbob.com" has 1 instance(s). Instance "anbob1", status READY, has 1 handler(s) for this service... Service "anbob_XPT.anbob.com" has 1 instance(s). Instance "anbob1", status READY, has 1 handler(s) for this service... The command completed successfully [oracle@anbob1:/home/oracle]# /oracle/app/11.2.0.3/grid/bin/lsnrctl status LSNRCTL for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production on 04-JUL-2015 19:38:08 Copyright (c) 1991, 2011, Oracle. All rights reserved. Connecting to (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1521)) TNS-12547: TNS:lost contact TNS-12560: TNS:protocol adapter error TNS-00517: Lost contact IBM/AIX RISC System/6000 Error: 32: Broken pipe [oracle@anbob1:/home/oracle]# /oracle/app/11.2.0.3/grid/bin/lsnrctl status LSNRCTL for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production on 04-JUL-2015 19:38:10 Copyright (c) 1991, 2011, Oracle. All rights reserved. Connecting to (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1521)) STATUS of the LISTENER ------------------------ Alias LISTENER Version TNSLSNR for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production Start Date 04-JUL-2015 19:11:19 Uptime 0 days 0 hr. 26 min. 51 sec Trace Level off Security ON: Local OS Authentication SNMP ON Listener Parameter File /oracle/app/11.2.0.3/grid/network/admin/listener.ora Listener Log File /oracle/app/grid/diag/tnslsnr/anbob1/listener/alert/log.xml Listening Endpoints Summary... (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=LISTENER))) (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=133.96.60.19)(PORT=1521))) (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=133.96.60.119)(PORT=1521))) Services Summary... Service "anbob.com" has 1 instance(s). Instance "anbob1", status READY, has 1 handler(s) for this service... Service "anbob_XPT.anbob.com" has 1 instance(s). Instance "anbob1", status READY, has 1 handler(s) for this service... The command completed successfully +-------------------------------------------------------------------------------------------------------------------------- +-topas_nmon--k=Kernel-stats-----Host=anbob1---------Refresh=2 secs---15:53.33---------------------------------------------+ | Memory ------------------------------------------------------------------------------------------------------------------| | Physical PageSpace | pages/sec In Out | FileSystemCache | |% Used 72.9% 0.9% | to Paging Space 0.0 0.0 | (numperm) 7.7% | |% Free 27.1% 99.1% | to File System 2.0 0.0 | Process 53.5% | |GB Used 87.5GB 0.3GB | Page Scans 0.0 | System 11.7% | |GB Free 32.5GB 31.7GB | Page Cycles 0.0 | Free 27.1% | |Total(GB) 120.0GB 32.0GB | Page Steals 0.0 | ------ | | | Page Faults 2009.9 | Total 100.0% | |------------------------------------------------------------ | numclient 7.7% | |Min/Maxperm 3574MB( 3%) 9532MB( 8%) <--% of RAM | maxclient 8.0% | |Min/Maxfree 960 1088 Total Virtual 152.0GB | User 58.3% | |Min/Maxpgahead 2 8 Accessed Virtual 77.6GB 51.0%| Pinned 14.4% | | | lruable pages 30502384.0 | |-------------------------------------------------------------------------------------------- sys@anbob>select * from v$resource_limit; select * from v$resource_limit; RESOURCE_NAME CURRENT_UTILIZATION MAX_UTILIZATION INITIAL_AL LIMIT_VALU ------------------------------ -------------------- -------------------- ---------- ---------- processes 1058 1968 4000 4000 sessions 1081 1993 4405 4405 enqueue_locks 507 1244 53190 53190 enqueue_resources 520 885 20424 UNLIMITED ges_procs 1057 1966 4001 4001 ges_ress 0 0 82503 UNLIMITED ges_locks 0 0 128193 UNLIMITED ges_cache_ress 6730 63814 0 UNLIMITED ges_reg_msgs 942 3388 8750 UNLIMITED ges_big_msgs 283 2190 1934 UNLIMITED ges_rsv_msgs 0 0 1000 1000 gcs_resources 1309030 2762250 2851766 2851766 gcs_shadows 787290 961880 2851766 2851766 dml_locks 7 780 19380 UNLIMITED temporary_table_locks 0 361 UNLIMITED UNLIMITED transactions 6 702674 4845 UNLIMITED branches 1556 3125 4845 UNLIMITED cmtcallbk 0 14 4845 UNLIMITED sort_segment_locks 1 136 UNLIMITED UNLIMITED max_rollback_segments 3314 3314 4845 65535 max_shared_servers 0 0 UNLIMITED UNLIMITED parallel_max_servers 17 246 40 3600 22 rows selected.
Note:
发现监听问题依然存在,请原谅我用上面大篇幅贴结果,只是为了证明这种离奇的现象。cpu、内存资源充足,数据库连接数未满,在MOS 发现NOTE 1084443.1 搭述了现象很是接近,那里描述是aix 6引入新特性“ Traffic Regulation” 限制了tcp port 的范围,当client 尝试打开一个新socket时, 当发现socket达到最大上限时会提示上面的错误, 使用下面命令查看是否启用了该特性:
tcptr -show
但是我这里的环境并没有并没有启用TCP Traffic Regulation policy, 之少给了排查的方向,查看一下当前的tcp 连接数。
[oracle@anbob1:/home/oracle]# ps -ef|wc -l 1207 anbob1:/home/grid> netstat -an|wc -l 62954 <<<<<<< anbob1:/home/grid> netstat -an|grep 61020|wc -l 56361 <<<<<<< anbob1:/> lsattr -E -l sys0 | grep maxuproc maxuproc 16384 Maximum number of PROCESSES allowed per user True [oracle@anbob1:/home/oracle]# /usr/sbin/no -a | fgrep ephemeral tcp_ephemeral_high = 65500 tcp_ephemeral_low = 9000 udp_ephemeral_high = 65500 udp_ephemeral_low = 9000
Note:
这里的显示的连接数好高,另一节点network连接数在5000左右,看一下连接明细满屏的tcp连接都是本机的private interconnect ip(如下),而且在都是通过61020 port发出, 这时送了一口气,不会是主机上某个程序的问题吧,这样就可以把球踢给OS了, 呵当然不好, 一个负责的DBA不能止步于此,下面要判断时哪个程序打开了这个端口的socket 连接。
... tcp4 0 0 192.168.60.11.48817 192.168.60.11.61020 CLOSE_WAIT tcp4 0 0 192.168.60.11.24318 192.168.60.11.61020 CLOSE_WAIT tcp4 0 0 192.168.60.11.24319 192.168.60.11.61020 CLOSE_WAIT tcp4 0 0 192.168.60.11.48818 192.168.60.11.61020 CLOSE_WAIT tcp4 0 0 192.168.60.11.24320 192.168.60.11.61020 CLOSE_WAIT tcp4 0 0 192.168.60.11.48819 192.168.60.11.61020 CLOSE_WAIT tcp4 0 0 192.168.60.11.24321 192.168.60.11.61020 CLOSE_WAIT tcp4 0 0 192.168.60.11.48820 192.168.60.11.61020 CLOSE_WAIT ...
下面就来演示如果根据socket port 定位process
anbob1:/> netstat -aAn|grep CLOS*|head f1000e003b08dbb8 tcp4 0 0 192.168.60.11.60433 192.168.60.11.61020 CLOSE_WAIT f1000e00354a8bb8 tcp4 0 0 192.168.60.11.35934 192.168.60.11.61020 CLOSE_WAIT f1000e0006fd5bb8 tcp4 0 0 192.168.60.11.11435 192.168.60.11.61020 CLOSE_WAIT f1000e000546dbb8 tcp 0 0 *.* *.* CLOSED f1000e0005527bb8 tcp4 0 0 *.* *.* CLOSED f1000e0004bf93b8 tcp 0 0 *.* *.* CLOSED f1000e0004becbb8 tcp4 0 0 *.* *.* CLOSED f1000e0004b2ebb8 tcp 0 0 *.* *.* CLOSED f1000e0004b34bb8 tcp4 0 0 *.* *.* CLOSED f1000e00139d03b8 tcp4 0 0 192.168.60.11.35935 192.168.60.11.61020 CLOSE_WAIT anbob1:/> kdb START END <name> 0000000000001000 0000000004160000 start+000FD8 F00000002FF47600 F00000002FFDF9C8 __ublock+000000 000000002FF22FF4 000000002FF22FF8 environ+000000 000000002FF22FF8 000000002FF22FFC errno+000000 F1000F0A00000000 F1000F0A10000000 pvproc+000000 F1000F0A10000000 F1000F0A18000000 pvthread+000000 read vscsi_scsi_ptrs OK, ptr = 0x0 (0)> sockinfo f1000e003b08dbb8 tcpcb ---- TCPCB ----(@ F1000E003B08DBB8)---- seg_next......@F1000E003B08DBB8 seg_prev......@F1000E003B08DBB8 t_softerror... 00000000 t_state....... 00000005 (CLOSE_WAIT) t_timer....... 00000000 (TCPT_REXMT) t_timer....... 00000000 (TCPT_PERSIST) t_timer....... 00000000 (TCPT_KEEP) t_timer....... 00000000 (TCPT_2MSL) t_rxtshift.... 00000000 t_rxtcur...... 00000004 t_dupacks..... 00000000 t_maxseg...... 000041CC t_force....... 00000000 t_flags....... 000003E0 (RFC1323|SENT_WS|RCVD_WS|SENT_TS|RCVD_TS|COPYFLAGS) t_oobflags.... 00000000 () t_template....@0000000000000000 t_inpcb.......@F1000E003B08DAA0 t_iobc........ 00000000 t_timestamp... FAC3B601 snd_una....... 47F28D30 snd_nxt....... 47F28D30 snd_up........ 47F28D30 snd_wl1....... 866BAB50 snd_wl2....... 47F28D30 iss........... 47F28B50 snd_wnd....... 0000000000020E04 rcv_wnd....... 0000000000020E60 rcv_nxt....... 866BAB51 rcv_up........ 866BAB50 irs........... 866BAB4F snd_wnd_scale. 00000002 rcv_wnd_scale. 00000002 req_scale_sent 00000002 req_scale_rcvd 00000002 last_ack_sent. 866BAB51 timestamp_rec. 5607037C timestamp_age. 000E42DC rcv_adv....... 866DB9B1 snd_max....... 47F28D30 snd_cwnd...... 0000000000010730 snd_ssthresh.. 000000003FFFC000 t_idle........ 000042DC t_rtt......... 00000000 t_rtseq....... 47F28CD7 t_srtt........ 00000007 t_rttvar...... 00000004 t_rttmin...... 00000002 max_rcvd...... 0000000000000000 max_sndwnd.... 0000000000020E60 t_peermaxseg.. 000041CC snd_in_pipe... 00000000 sack_data.....@0000000000000000 snd_recover... 00000000 snd_high...... 47F28D30 snd_ecn_max... 47F28D30 snd_ecn_clear. 47F28D31 t_splice_with.@0000000000000000 t_splice_flags 00000000 ... ---- SOCKET INFO ----(@ F1000E003B08D808)---- type........ 0001 (STREAM) opts........ 0000 () linger...... 0000 state....... FFFF81A2 (ISCONNECTED|CANTRCVMORE|PRIV|NBIO) pcb.....@F1000E003B08DAA0 proto...@0000000004875628 lock....@F1000E003B959080 head....@0000000000000000 q0......@0000000000000000 q.......@0000000000000000 q0len....... 0000 qlen........ 0000 qlimit...... 0000 timeo....... 0000 error....... 0000 special..... 0A08 pgid.... 0000000000000000 oobmark. 0000000000000000 snd:cc...... 0000000000000000 hiwat... 0000000000020E60 mbcnt... 0000000000000000 mbmax... 0000000000083980 lowat... 0000000000003FFF mb......@0000000000000000 sel.....@0000000000000000 events...... 0000 iodone.. 00000000 ioargs..@0000000000000000 lastpkt.@F1000E0019F10C00 wakeone. FFFFFFFFFFFFFFFF timer...@0000000000000000 timeo... 00000000 flags....... 0000 () wakeup.. 00000000 wakearg.@0000000000000000 lockwtg. FFFFFFFFFFFFFFFF .... more proc/fd: 3250/54028 proc/fd: fd: 54028 SLOT NAME STATE PID PPID ADSPACE CL #THS pvproc+32C800 3250*osysmond ACTIVE 0B2037C 0000001 0000000C2DA42590 0 000C ######<<<<<<<<<< (0)> hcal 0B2037C Value hexa: 00B2037C Value decimal: 11666300 (0)> quit anbob1:/> ps -ef|grep 11666300 root 11666300 1 3 Apr 15 - 12970:57 /oracle/app/11.2.0.3/grid/bin/osysmond.bin root 1639464 19661020 0 14:10:21 pts/5 0:00 grep 11666300
Note:
这里我们又不得不把球抱起来,这是一个oracle的程序osysmond.bin, 定位程序在aix 还可以使用lsof 但是本机没有带,需要下载就一个执行程序
[oracle@anbob1:/home/oracle]# ftp 133.96.60.19 ftp: connect: Resource temporarily unavailable
Note:
当时FTP 都提示connect 资源都不可用,可见OS的tcp 连接肯定出现了枯竭
anbob1:/> lsof -i:61020 osysmond. 11666300 root *738u IPv4 0xf1000e0041ff4bb8 0t0 TCP anbob1-prv:34166->anbob1-prv:61020 (CLOSE_WAIT) osysmond. 11666300 root *739u IPv4 0xf1000e0007091bb8 0t0 TCP anbob1-prv:35703->anbob1-prv:61020 (CLOSE_WAIT) osysmond. 11666300 root *740u IPv4 0xf1000e004283bbb8 0t0 TCP anbob1-prv:32682->anbob1-prv:61020 (CLOSE_WAIT) ...
TIP:
接下来在去MOS 以”osysmond.bin netstat”为关键字搜索,第一个就是我们想要的“Bug 15934834 osysmond.bin has lot of sockets open in close_wait state” 现象完全一致是因为osysmond.bin leaks socket. 但是MOS里 Workaround None, 而且当前环境也是在11.2.0.4 修复。
“Which socket states from netstat are candidates for the rmsock command?
FIN_WAIT_1,FIN_WAIT_2, TIME_WAIT, CLOSING are all states where the application performed an active close of the socket.
How do we know a socket is no longer associated with a process?
You can use lsof to check. It is highly recommend that you use netstat in conjunction with lsof to validate the situation prior to doing anything drastic.
Which socket states are not candidates for the rmsock command?
CLOSE_WAIT, CLOSED
”
停掉osysmond.bin进程就可以释放那些socket, 这里我是建议重启本节点的Instance和CRS后再观察, osysmond.bin是个monitor进程,有agent 守护,当天我们尝试最小修改通过OS kiil 了osysmond.bin ,agent自动重启了该进程且instence 未受影响,kill了该进程后socket释放,listener 恢复了正常。不过事后发现osysmond.bin 调用的socket还是有上涨趋势,如果下次有停机时间我们会尝试重启CRS.
Summary:
本次的问题是因为oracle bug 导致osysmond.bin 调用tcp socket 泄漏,导致占满了OS的TCP socket 连接,随后FTP 和监听变的异常,但是偶尔有连接释放socket,所以当时的监听表现是时好时坏。
–2015-7-29 updated
重启crs 也无法避免该问题,在11.2.0.3中 11.2.0.3.7 patch 修复了这个bug.
Patch description: “Grid Infrastructure Patch Set Update : 11.2.0.3.7 (16742216)”
Created on 24 Jun 2013, 22:52:45 hrs PST8PDT
Bugs fixed:
16619898, 16315641, 15876003, 14275572, 13919095, 13696251, 13348650
12659561, 14305980, 14277586, 13987807, 14625969, 13825231, 12794268
13000491, 13498267, 15934834, 11675721, 14082976, 12771830, 14515980
可以使用shell 加crontab 定时去检查
#!/bin/sh ############################################################### ## file: kill_bug15934834.sh ## ## this scripts to clear connect tcp socket leak by osysmond.bin process ## Temporary workaround fixed bug 15934834 ## author: weejar (anbob.com) ## date: 2015-7-29 ## note: run as root, after kill osysmond process, it will automatically restart ## ################################################################ # logfile LOGFILE=/oracle/lw/kill_bug15934834.log # number of connect socket NUM=`netstat -an|grep 61020|wc -l` PSINFO=`ps -ef|grep "grid/bin/osysmond.bin"|grep -v grep` # pid of osysmond.bin PID=`echo $PSINFO|awk '{print $2}'` echo "zzz ***"`date '+%a %b %e %T %Z %Y'` >> $LOGFILE echo "Number of socket osysmond:$NUM" >> $LOGFILE if [ $NUM -ge 50000 ] then echo `ps -ef|head -n 1` >> $LOGFILE echo $PSINFO >> $LOGFILE if [[ "" != "$PID" ]]; then echo "Warning: osysmond.bin pid $PID will be kill..." >> $LOGFILE kill -9 $PID if [ $? -ne 0 ]; then echo "Error: osysmond.bin pid $PID kill failed!" >> $LOGFILE fi fi fi
对不起,这篇文章暂时关闭评论。