首页 » Cloud, ORACLE 9i-23ai » Alert: Oracle RAC最大进程数限制受UDP port range影响

Alert: Oracle RAC最大进程数限制受UDP port range影响

几年前测试oracle RAC的节点间UDP通信《The FG(server process) and remote node LMSn process communication over the interconnect?(用户进程会和另一节点的LMS进程直接通信么?)》测试过节点间存在Server进程与LMS的udp连接,使用的是HAIP(169.254.*.*), 而Linux操作系统的网络端口可用范围net.ipv4.ip_local_port_range 参数控制,适用于TCP和UDP,最大值是65535. 如果RAC中就一个private network 网卡,假设不排除所有进程都和某一个LMS进程通信如LMS1,LMS1分配1个IP addr+UDP port, 那FG进程的上限就是net.ipv4.ip_local_port_range /单个FG进程打开的UDP个数。

参数net.ipv4.ip_local_port_range 

net.ipv4.ip_local_port_range该参数定义了系统自动分配的本地端口的范围。一般用于客户端应用程序在发起连接时操作系统自动选择的源端口。对于同一个协议,如UDP网络链接个数是source IP,PORT 与target IP, PORT四元组的组合, net.ipv4.ip_local_port_range 定义网络连接可用作其源(本地)端口的最小和最大端口的限制, Oracle建议值范围是9000   65500,减去一些系统服务保留或使用给Oracle RAC用户进程可能不足50000。 net.ipv4.tcp_tw_reuse参数用于控制 TCP 连接处于 TIME_WAIT 状态时的重用与回收,减少系统资源占用,提升端口的复用效率。

BTW, 在 AIX 系统中,网络端口的可用范围也可以通过系统参数进行控制,尽管这些参数和 Linux 系统的配置稍有不同。tcp_ephemeral_lowtcp_ephemeral_high这两个参数控制 AIX 系统中 TCP 协议的临时端口(也称为短暂端口、随机端口)的范围。类似于 Linux 中的 ip_local_port_range 参数.udp_ephemeral_lowudp_ephemeral_high这两个参数与 tcp_ephemeral_lowtcp_ephemeral_high 类似,但用于控制 UDP 协议的临时端口范围。

在Oracle RAC 多节点的环境下,可能存在用户进程跨节点GC 访问, 一个FG server进程可能和多个LMS和LMD进程通信,1个LMS也会和多个FG server进程通信,测试新创建一个用户连接,初始化似乎是HAIP的2个UDP port, 后期还会增加(猜测和LMS Server数量有关)。那样算很可能单个节点的最大进程数在20000就有出现UDP port不足的风险。看来多个HAIP还可以提升这个限制。 那是不是真的这样呢?

 

案例ORA-27530: IPC Ephemeral ports on IP address 169.254.*.*  are exhausted.
环境Oracle 19C 2节点环境,进程数近2w. LMS进程数12。 下面的诊断使用了tannelpoder的脚本包, 最初数据库现象为library cache lock。

SQL> @ash/ashtop sql_id,top_level_sql_id  "event='library cache lock'" sysdate-1/24 sysdate

    Total                                                                                         Distinct Distinct
  Seconds     AAS %This   SQL_ID          TOP_LEVEL_SQL FIRST_SEEN          LAST_SEEN           Execs Seen  Tstamps
--------- ------- ------- --------------- ------------- ------------------- ------------------- ---------- --------
    26646     7.4  100% |                               2024-03-27 08:41:02 2024-03-27 09:36:14          1      396
        5      .0    0% |                 865qwpcdyggkk 2024-03-27 08:55:02 2024-03-27 09:30:02          1        5
        1      .0    0% | 2jsd20a6n9jzc   2jsd20a6n9jzc 2024-03-27 09:10:54 2024-03-27 09:10:54          1        1

SQL> @ash/ashtop p3 "event='library cache lock'" sysdate-1/24 sysdate


    Total                                                                                Distinct Distinct
  Seconds     AAS %This                     P3 FIRST_SEEN          LAST_SEEN           Execs Seen  Tstamps
--------- ------- ------- -------------------- ------------------- ------------------- ---------- --------
    26646     7.4  100% |              8323074 2024-03-27 08:56:03 2024-03-27 09:54:02          1      397
        7      .0    0% |              8978434 2024-03-27 09:02:04 2024-03-27 09:45:02          1        7
        6      .0    0% |              8323075 2024-03-27 08:55:02 2024-03-27 09:50:15          1        6
        6      .0    0% |              9043971 2024-03-27 08:59:00 2024-03-27 09:44:38          1        6
        2      .0    0% | 18446744069423628291 2024-03-27 08:57:58 2024-03-27 08:59:07          1        2
        1      .0    0% | 18446744069421400066 2024-03-27 09:10:54 2024-03-27 09:10:54          1        1

SQL> select to_char('8323074','xxxxxxxxxxxx') from dual;
----------------
7f0002


SQL> @hex 7F

                                DEC                  HEX
----------------------------------- --------------------
                         127.000000                   7F

SQL> SELECT indx,kglstdsc FROM x$kglst  WHERE kglsttyp='NAMESPACE' and indx=127;

                INDX KGLSTDSC
-------------------- ----------------------------------------------------------------
                 127 Last_Successful_Logon_Time

对于这个LSLT特性我写过好几篇就不再详说了,有兴趣可以去看:

对于登录相关的 LAST SUCCESSFUL LOGIN TIME的Library cache lock又出新参数

Oracle 12c New Feature: Last Login Time for Non-Sys Users

Oracle19c 建议的 One-off patch之disable LAST SUCCESSFUL LOGIN TIME

分析堵塞者会话

SQL> @ash/ashtop p3,BLOCKING_INST_ID,blocking_session "event='library cache lock'" sysdate-1/24 sysdate

    Total                                                                                                                          Distinct Distinct
  Seconds     AAS %This                     P3     BLOCKING_INST_ID     BLOCKING_SESSION FIRST_SEEN          LAST_SEEN           Execs Seen  Tstamps
--------- ------- ------- -------------------- -------------------- -------------------- ------------------- ------------------- ---------- --------
    24794     6.9   93% |              8323074                    1                20903 2024-03-27 09:23:02 2024-03-27 09:26:25          1      201
     1817      .5    7% |              8323074                                           2024-03-27 09:07:53 2024-03-27 10:00:46          1      167
        6      .0    0% |              8978434                                           2024-03-27 09:08:36 2024-03-27 09:45:02          1        6

SQL> @ash/ashtop sql_id,event "inst_id=1 and session_id=20903"  "to_date('2024-03-27 09:23:02','yyyy-mm-dd hh24:mi:ss')"  "to_date('2024-03-27 09:26:25','yyyy-mm-dd hh24:mi:ss')"

    Total                                                                                                                      Distinct Distinct
  Seconds     AAS %This   SQL_ID          EVENT                                      FIRST_SEEN          LAST_SEEN           Execs Seen  Tstamps
--------- ------- ------- --------------- ------------------------------------------ ------------------- ------------------- ---------- --------
      195     1.0   98% | 9zg9qd9bm4spu                                              2024-03-27 09:23:02 2024-03-27 09:26:24          1      195
        3      .0    2% | 9zg9qd9bm4spu   log file sequential read                   2024-03-27 09:23:57 2024-03-27 09:26:05          1        3
        1      .0    1% | 9zg9qd9bm4spu   KSV master wait                            2024-03-27 09:23:04 2024-03-27 09:23:04          1        1
        1      .0    1% | 9zg9qd9bm4spu   control file sequential read               2024-03-27 09:23:05 2024-03-27 09:23:05          1        1

SQL> select sql_fulltext from v$sqlarea where sql_id='&sql_id'; 
Enter value for sql_id: 9zg9qd9bm4spu

SQL_FULLTEXT
---------------------------------------------------------------------------------------------------------------- 
update user$ set spare6=DECODE(to_char(:2, 'YYYY-MM-DD'), '0000-00-00', to_date(NULL), :2) where user#=:1

note:
堵塞会话20903 ,最后两列说是只执行了1次 持续了201秒, 其中这个SQL ID 195秒 on cpu, 3秒的读redo, 1秒等待后台进程通信和1秒的控制文件读。 此时可以查看进程的后台trace 。

DB ALERT LOG 文件

2024-03-27T09:23:02.786954+08:00
Errors in file /oracle/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_ora_105683.trc  (incident=4576016):
ORA-00603: ORACLE server session terminated by fatal error
ORA-27530: IPC Ephemeral ports on IP address 169.254.10.xx are exhausted.  
ORA-27300: OS system dependent operation:bind failed with status: 98
ORA-27301: OS failure message: Address already in use
ORA-27302: failure occurred at: sskgxpsock
Incident details in: /oracle/app/oracle/diag/rdbms/anbob/anbob1/incident/incdir_4576016/anbob1_ora_105683_i4576016.trc
2024-03-27T09:23:04.461321+08:00
*****************************************************************
An internal routine has requested a dump of selected redo.
This usually happens following a specific internal error, when
analysis of the redo logs will help Oracle Support with the
diagnosis.
It is recommended that you retain all the redo logs generated (by
all the instances) during the past 12 hours, in case additional
redo dumps are required to help with the diagnosis.
*****************************************************************

与前同的wait event对应。

进程TRACE

Dump file /oracle/app/oracle/diag/rdbms/anbob/anbob1/incident/incdir_4576016/anbob1_ora_105683_i4576016.trc
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.9.0.0.0
Build label:    RDBMS_19.9.0.0.0DBRU_LINUX.X64_200930
ORACLE_HOME:    /oracle/app/oracle/product/19c/db_1
System name:	Linux
Node name:	ldyya1
Release:	3.10.0-957.el7.x86_64
Version:	#1 SMP Thu Oct 4 20:48:51 UTC 2018
Machine:	x86_64
Instance name: anbob1
Redo thread mounted by this instance: 1
Oracle process number: 10022
Unix process pid: 105683, image: oracle@
*** 2024-03-27T09:23:02.789465+08:00
*** SESSION ID:(20903.55558) 2024-03-27T09:23:02.789490+08:00 
*** CLIENT ID:() 2024-03-27T09:23:02.789502+08:00
*** SERVICE NAME:(anbob.XXXXXX.com) 2024-03-27T09:23:02.789512+08:00
*** MODULE NAME:(sqlplus@XXX-4 (TNS V1-V3)) 2024-03-27T09:23:02.789523+08:00
*** ACTION NAME:() 2024-03-27T09:23:02.789536+08:00
*** CLIENT DRIVER:(SQL*PLUS ) 2024-03-27T09:23:02.789547+08:00
[TOC00000]
Jump to table of contents
Dump continued from file: /oracle/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_ora_105683.trc
[TOC00001]
ORA-00603: ORACLE server session terminated by fatal error
ORA-27530: IPC Ephemeral ports on IP address 169.254.10.76 are exhausted.
ORA-27300: OS system dependent operation:bind failed with status: 98
ORA-27301: OS failure message: Address already in use
ORA-27302: failure occurred at: sskgxpsock
*** 2024-03-27T09:23:02.790567+08:00
dbkedDefDump(): Starting incident default dumps (flags=0x2, level=3, mask=0x0)
[TOC00003]
----- Current SQL Statement for this session (sql_id=9zg9qd9bm4spu) -----
update user$ set spare6=DECODE(to_char(:2, 'YYYY-MM-DD'), '0000-00-00', to_date(NULL), :2) where user#=:1
[TOC00003-END]

[TOC00004]
----- Call Stack Trace -----
calling              call     entry               
location             type     point               
-------------------- -------- --------------------
ksedst1()+95         call     kgdsdst()           
ksedst()+58          call     ksedst1()           
dbkedDefDump()+1754  call     ksedst()            
ksedmp()+244         call     dbkedDefDump()      
dbgexPhaseII()+2092  call     ksedmp()            
dbgexProcessError()  call     dbgexPhaseII()      
+1867                                             
dbgePostErrorKGE()+  call     dbgexProcessError() 
1853                                              
dbkePostKGE_kgsf()+  call     dbgePostErrorKGE()  
71                                                
kgeade()+393         call     dbkePostKGE_kgsf()  
kgefecl()+184        call     kgeade()            
ksmsq_handle_oseerr  call     kgefecl()           
()+748                                            
ksmsq_create()+1056  call     ksmsq_handle_oseerr 
                              ()                  
ipc_recreate_que()+  call     ksmsq_create()      
635                                               
kcbgcur()+71153      call     ipc_recreate_que()  
ktbgcur()+112        call     kcbgcur()           
kdddgb()+788         call     ktbgcur()           
kdusru()+458         call     kdddgb()            
kauupd()+356         call     kdusru()            
updrow()+1699        call     kauupd()            
qerupUpdRow()+671    call     updrow()            
qerupFetch()+595     call     qerupUpdRow()       
updaul()+1416        call     qerupFetch()        
updThreePhaseExe()+  call     updaul()            
340                                               
updexe()+443         call     updThreePhaseExe()  
opiexe()+11927       call     updexe()            
kpoal8()+2387        call     opiexe()            
opiodr()+1202        call     kpoal8()            
kpoodrc()+40         call     opiodr()            
rpiswu2()+2004       call     kpoodrc()           
kpoodr()+660         call     rpiswu2()           
upirtrc()+2760       call     kpoodr()            
kpurcsc()+100        call     upirtrc()           
kpuexec()+10017      call     kpurcsc()           
OCIStmtExecute()+41  call     kpuexec()           
kzulsltUpdate()+820  call     OCIStmtExecute()    
kglHandleFlush()+25  call     kzulsltUpdate()     
3                                                 
kglfls1()+310        call     kglHandleFlush()    
ktcCommitTxn_new()+  call     kglfls1()           
5246                                              
ktcCommitTxn()+94    call     ktcCommitTxn_new()  
kziaUpdateLSLT()+72  call     ktcCommitTxn()      
0                                                 
kziaulslt()+199      call     kziaUpdateLSLT()    
kpoauth()+3911       call     kziaulslt()         
opiodr()+1202        call     kpoauth()           
ttcpip()+1222        call     opiodr()            
opitsk()+1900        call     ttcpip()            
opiino()+936         call     opitsk()            
opiodr()+1202        call     opiino()            
opidrv()+1094        call     opiodr()            
sou2o()+165          call     opidrv()            
opimai_real()+422    call     sou2o()             
ssthrdmain()+417     call     opimai_real()       
main()+256           call     ssthrdmain()        
__libc_start_main()  call     main()              
+245                                              
_start()+41          call     __libc_start_main()

分析OS 使用

$ oerr ora  27530
27530, 00000, "IPC Ephemeral ports on IP address %s are exhausted."
// *Cause:   The ephemeral ports on the IP address assigned to the instance 
//           were exhausted.
// *Action:  Increase the ephemeral port range.
// 27531, 00000, "<>"
// *Cause:
// *Action:

# netstat -nlup|grep 169.254.10.xx|wc -l
36997

# netstat -nlup|grep 169.254.10.xx|awk '{print $6}'|sort|uniq -c |sort -rn|grep oracle
      5 97351/oracleanbob1
      5 69254/oracleanbob1
      4 99503/oracleanbob1
      4 99004/oracleanbob1
      4 98295/oracleanbob1
      4 98161/oracleanbob1
      4 96358/oracleanbob1
      4 96279/oracleanbob1
      4 96234/oracleanbob1
      4 95442/oracleanbob1
      4 94697/oracleanbob1
      4 93919/oracleanbob1
      4 93516/oracleanbob1
      4 92649/oracleanbob1
      4 92586/oracleanbob1
...

Note:
有使用4个udp port的情况。

Summary:
可见UDP可用Port范围也限制了数据库的进程数量,通常进程数控制在2*CPU core是个健康的值, 如果一味的给应用服务器放行,配置不合里的连接池,当真有问题时那就是瞬间雪崩,不会给处理的时间,进程数起不到保护数据库的作用。目前看除了内存、操作系统最大进程数(maxuproc)、最大PID(pid_max)、CPU资源外,UDP网络端口范围也是一个参考指标, 当然进程增加还可能需要增加最大打开文件个数参数(e.g.nofiles )、信号量 等级联影响,Oracle可能都没想到我们会有这样不合理的的用户场景吧?

打赏

, ,

对不起,这篇文章暂时关闭评论。