Troubleshooting errors caused by OS resource limit on AIX,HP-UX, SolarisOS, Linux
操作系统资源限制有时会导致上面的应用程序无法fock新进程或open 文件,导致连接创建失败或实例crash, 尤其当数据库的进程数搞的很大时,开始的OS kernel resource limit没有级联的修改,就有可能导致该问题的发生。 有时Linux操作系统命令会提示: fork: retry: No child processes , 数据库创建新的连接时提示ORA-12518 or 12537。
CASE 1, HP-UX 11.31 通过监听创建新连接报错
TNS-12518: TNS:listener could not hand off client connection TNS-12536: TNS:operation would block TNS-12560: TNS:protocol adapter error TNS-00506: Operation would block HPUX Error: 246: Operation would block
以上可看是遇到OS 进程上限,需要检查uproc和maxuproc, ORACLE环境建议值
参数名 | NODE1 | Oracle推荐值 | ||
aio_max_ops | 8192 | >= 2048 | ||
executable_stack | 0 | 0 | ||
filecache_min | 3% | 5% | ||
filecache_max | 5% | 10% | ||
ksi_alloc_max | 131072 | >= nproc*8 | ||
max_async_ports | 16384 | >= nproc | ||
max_thread_proc | 1200 | >= 1024 | ||
maxdsiz | 1073741824 | >= 1073741824 | ||
maxdsiz_64bit | 137438953472 | >= 2147483648 | ||
maxssiz | 134217728 | >= 134217728 | ||
maxssiz_64bit | 2147483648 | >= 1073741824 | ||
maxuprc | 20000 | >= ((nproc*9)/10)+1 | ||
msgmni | 16384 | >= nproc | ||
msgtql | 16384 | >= nproc | ||
ncsize | 134144 | 8*nproc+3072 | ||
nflocks | 16384 | >= nproc |
查看资源使用
oracle@anbob:/home/oracle> kcusage Tunable Usage / Setting ============================================= filecache_max 33312395264 / 39194697728 maxdsiz 352256000 / 1073741824 maxdsiz_64bit 239075328 / 137438953472 maxfiles_lim 23564 / 65535 maxssiz 131072 / 134217728 maxssiz_64bit 2097152 / 2147483648 maxtsiz 13484032 / 100663296 maxtsiz_64bit 771751936 / 1073741824 maxuprc 15551 / 16384 max_thread_proc 385 / 1200 msgmbs 0 / 8 msgmni 2 / 16384 msgtql 0 / 16384 nflocks 101 / 16384 ninode 10403 / 1157120 nkthread 18576 / 28688 nproc 16382 / 21000 npty 2 / 60 nstrpty 12 / 60 nstrtel 0 / 60 nswapdev 2 / 32 nswapfs 0 / 32 semmni 116 / 1024 semmns 22095 / 307200 shmmax 161061273600 / 274877906944 shmmni 39 / 4096 shmseg 4 / 512
CASE2, AIX 平台的应用运行时报错
ORA-04030: (TCHK^9d12ad4,eavp:kkestRCHistgrm)
call stack 中包含“kghnospc“=》 kernel generic heap manager no space available in the heap, signal an error
dump trace
======================================= PRIVATE MEMORY SUMMARY FOR THIS PROCESS --------------------------------------- ****************************************************** PRIVATE HEAP SUMMARY DUMP 111 MB total: #进程使用PGA 111MB 111 MB commented, 605 KB permanent 47 KB free (0 KB in empty extents), 103 MB, 1 heap: "session heap " ------------------------------------------------------ Summary of subheaps at depth 1 110 MB total: 35 MB commented, 109 KB permanent 75 MB free (30 MB in empty extents), 45 MB, 1 heap: "kolr heap ds i " 44 MB free held 28 MB, 3 heaps: "koh dur heap d " 1056 KB free held ------------------------- Top 10 processes: ------------------------- (percentage is of 1697 MB total allocated memory) 7% pid 201: 111 MB used of 112 MB allocated # CURRENT PROC 当前进程使用最高,111MB 4% pid 204: 56 MB used of 63 MB allocated (5696 KB freeable) 4% pid 13: 60 MB used of 62 MB allocated 4% pid 14: 59 MB used of 62 MB allocated 3% pid 202: 53 MB used of 59 MB allocated (6016 KB freeable) 3% pid 200: 52 MB used of 58 MB allocated (5824 KB freeable) 3% pid 40: 52 MB used of 56 MB allocated (832 KB freeable) 3% pid 37: 41 MB used of 55 MB allocated 3% pid 12: 50 MB used of 52 MB allocated 3% pid 173: 42 MB used of 49 MB allocated (5888 KB freeable) ================ SWAP INFORMATION ---------------- swap info: free_mem = 22096.49M rsv = 192.00M alloc = 112.52M avail = 49152.00M swap_free = 49039.48M ----- End of Customized Incident Dump(s) -----
ITpub案例类似, 需要检查PGA, _pga_max_size 和_smm_max_size和OS $ ulimit -a限制和当前进程的Limit限制
AIX 可以使用dbx查看当前进程的limit, 可以选LOCAL=NO server进程或LISTENR进程
# dbx -a [pid] Type 'help' for help. reading symbolic information ... stopped in read at 0x90000000003c260 ($t1) 0x90000000003c260 (read+0x260) e8410028 ld r2,0x28(r1) (dbx) proc rlimit rlimit name: rlimit_cur rlimit_max (units) RLIMIT_CPU: (unlimited) (unlimited) sec RLIMIT_FSIZE: (unlimited) (unlimited) bytes RLIMIT_DATA: 134217728 (unlimited) bytes RLIMIT_STACK: 33554432 4294967296 bytes RLIMIT_CORE: (unlimited) (unlimited) bytes RLIMIT_RSS: 33554432 (unlimited) bytes RLIMIT_AS: (unlimited) (unlimited) bytes RLIMIT_NOFILE: 100000 (unlimited) descriptors RLIMIT_THREADS: (unlimited) (unlimited) per process RLIMIT_NPROC: (unlimited) (unlimited) per user (dbx)
Note:
in the dbx rlimit output, the RLIMIT_CUR is the soft limit and the RLIMIT_MAX is the hard limit. RLIMIT_CUR is the limit that is actually enforced, so the problem may persist if RLIMIT_CUR is not unlimited, even though RLIMIT_MAX may be unlimited. In this case, the instance may need to be restarted in order for RLIMIT_CUR to take on the new value.
如果不通过监听的进程则不存在该限制(继承oracle user limit), 原因是因为通过监听创建的进程依赖监听的limit配置,监听又依赖于启动监听的用户limit, 如是LISTNEER是OHASD CRS启动那继承的root, 如果是grid手工启需要检查grid limit, 当然也存在在调整了OS limit后,进程没有重启识别不到已改变的limit.
case 3, Solaris SunOS swap 不足出现的ora-4030
查看进程的Limit可以使用plimit
# plimit [PID] 按swap排序 $ awk '/^zzz/{t=$5;next}/^\s*[0-9]/{print t,$4,$5}' xxxxxx_vmstat_16.10.31.1500.dat | sort -k2,2rn How to Configure Swap Space (Doc ID 286388.1) 建议swap space=75%* OS memory How does the Solaris Operating System Calculate Available Swap? (Doc ID 1010585.1) When a process calls the malloc()/sbrk() commands, only virtual swap is allocated. The operating system allocates the memory from physical disk-based swap first. If disk-based swap is exhausted or unconfigured, the reservation is allocated from physical memory. If both resources are exhausted then the malloc() call fails. To ensure malloc() won't fail due to lack of virtual swap, configure a large physical disk-based swap facility in the form of a device or swapfile. You can monitor swap reservation via "swap -s" and "vmstat:swap", as described above. Follow the guidelines below to calculate amount of virtual swap usage: Virtual swap = Physical Memory + Fixed Disk swap
CASE 4, linux 平台,在安装如OEM agent平台,进程不足
在linux 平台查看当前进程limit的方法比较多,如查看proc系统中进程限制
$ cat /proc/PID/limits
Nproc在操作系统级别定义,以限制每个用户的进程数。Oracle 11.2.0.4文档建议以下内容:
oracle soft nproc 2047 oracle hard nproc 16384
如果有运行oem agent这可能有点低, 您是否要检查自己是否超出限制?那么您可以使用“ ps”。但是请注意,默认情况下,“ ps”不会显示所有进程。在Linux中,执行多线程处理时,每个线程都实现为轻量级进程(LWP)。并且您必须使用“ -L”来查看所有这些内容。如以用户分组
$ ps h -Led -o user | sort | uniq -c | sort -n
如果不使用”-L” 还可以使用”ps -o nlwp,pid,lwp,args -u oracle | sort -n” 如有些环境Oracle 12c EM agent已启动可以启动1000多个个线程,当您达到nproc限制时,用户将无法创建新进程。clone()调用将返回EAGAIN,Oracle将其报告为:
ORA-27300: OS system dependent operation:fork failed with status: 11
ORA-27301: OS failure message: Resource temporarily unavailable
下面模拟一段Franck有段简短的forc进程的代码略改动一下,测试
[root@anbob ~]# ps h -Led -o user | sort | uniq -c | sort -n 1 chrony 1 dbus 1 oracle 1 rpc 7 polkitd 133 root [oracle@oel7db1 ~]$ ulimit -u 500 [oracle@oel7db1 ~]$ cat fockp.c #include #include <sys/resource.h> #include int main( int argc, char *argv[] ) { int i; int p[3000]; // get nproc limit struct rlimit rl; if ( getrlimit( RLIMIT_NPROC , &rl) != 0 ) { printf("getrlimit() failed with errno=%d\n", errno); return 255; }; // fork 3000 times for( i=1 ; i<= 3000 ; i++ ) { p[i] = fork(); if ( p[i] >= 0 ) { if ( p[i] == 0 ) { printf("parent says fork number %d sucessful \n" , i ); } else { printf(" child says fork number %d pid %d \n" , i , p[i] ); sleep(100); break; } } else { printf("parent says fork number %d failed (nproc: soft=%d hard=%d) with errno=%d\n", i, rl.rlim_cur , rl.rlim_max , errno); return 255; } } } 编译执行 [oracle@anbob ~]$ ./fockp child says fork number 1 pid 2442 parent says fork number 1 sucessful child says fork number 2 pid 2443 parent says fork number 2 sucessful child says fork number 3 pid 2444 parent says fork number 3 sucessful child says fork number 4 pid 2445 parent says fork number 4 sucessful ... parent says fork number 497 sucessful child says fork number 498 pid 2941 parent says fork number 498 sucessful parent says fork number 499 failed (nproc: soft=500 hard=500) with errno=11 使用root查看 [root@anbob ~]# ps h -Led -o user | sort | uniq -c | sort -n 1 chrony 1 dbus 1 rpc 7 polkitd 133 root 500 oracle
Linux平台限制在早期/etc/limits.conf中设置并用’ulimit -u’检查,但是根据RHEL官方文档,在5-8修改参数是修改/etc/security/limits.conf。
How to set ulimit values Environment Red Hat Enterprise Linux (RHEL) 5, 6, 7, 8 Issue How to set ulimit values Resolution Settings in /etc/security/limits.conf take the following form: # vi /etc/security/limits.conf # * - core * - data * - priority * - fsize * soft sigpending eg:57344 * hard sigpending eg:57444 * - memlock * - nofile eg:1024 * - msgqueue eg:819200 * - locks * soft core * hard nofile @ hard nproc soft nproc % hard nproc hard nproc @ - maxlogins hard cpu soft cpu hard locks can be: a user name a group name, with @group syntax the wildcard *, for default entry the wildcard %, can be also used with %group syntax, for maxlogin limit can have two values: soft for enforcing the soft limits hard for enforcing hard limits can be one of the following: core - limits the core file size (KB) data - max data size (KB) fsize - maximum filesize (KB) memlock - max locked-in-memory address space (KB) nofile - max number of open files rss - max resident set size (KB) stack - max stack size (KB) cpu - max CPU time (MIN) nproc - max number of processes (see note below) as - address space limit (KB) maxlogins - max number of logins for this user maxsyslogins - max number of logins on the system priority - the priority to run user process with locks - max number of file locks the user can hold sigpending - max number of pending signals msgqueue - max memory used by POSIX message queues (bytes) nice - max nice priority allowed to raise to values: [-20, 19] rtprio - max realtime priority Exit and re-login from the terminal for the change to take effect.
文档中Setting nproc in /etc/security/limits.conf has no effect in Red Hat Enterprise Linux. 配置nproc不启作用,
Resolution
Add the desired entry in /etc/security/limits.d/90-nproc.conf instead of /etc/security/limits.conf.
Root Cause
For limits, the PAM stack is moving to a modular configuration. This includes the introduction of /etc/security/limits.d/90-nproc.conf, which sets the maximum number of processes to 1024 for non-root users. This was done in part to prevent fork-bombs.
After reading /etc/security/limits.conf, individual files from the /etc/security/limits.d/ directory are read. Only files with *.conf extension will be read from this directory.
所以如果安装oracle preinstall PRM配置oracle环境会发现它也是在/etc/security/limits.d/oracle-rdbms-server-12cR1-preinstall.conf中,它会覆盖/etc/security/limits.conf。
Linux的优点之一是您可以控制几乎所有与其相关的内容。这使系统管理员可以很好地控制其系统并更好地利用系统资源。还可以修复一个已经运行中的程序的limit限制,这适用于如应用server无重启时间,可以在线修改。On Linux systems with kernel >=2.6.36 and util-linux >=2.21, you can use the prlimit command to set a process resource limits: (和solariOS有点像)
下面演示如何修改一个已运行的程序的limit
[root@anbob ~]# ps -ef|grep lsnr oracle 15837 1 0 00:02 ? 00:00:00 /u01/app/oracle/product/19.2.0/db_1/bin/tnslsnr LISTENER -inherit root 16128 16100 0 00:07 pts/1 00:00:00 grep --color=auto lsnr [root@anbob ~]# cat /proc/15837/limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size 10485760 33554432 bytes Max core file size 0 unlimited bytes Max resident set unlimited unlimited bytes Max processes 16384 16384 processes Max open files 65536 65536 files Max locked memory 137438953472 137438953472 bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 14595 14595 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us [oracle@anbob ~]$ gdb GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-115.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. (gdb) attach 15837 Attaching to process 15837 (gdb) set $rlim = &{0ll, 0ll} (gdb) print getrlimit(7, $rlim) $1 = 0 (gdb) print *$rlim $2 = {65536, 65536} TIP: Limit 0 Max cpu time 1 Max file size 2 Max data size 3 Max stack size 4 Max core file size 5 Max resident set 6 Max processes 7 Max open files 8 Max locked memory 9 Max address space 10 Max file locks 11 Max pending signals 12 Max msgqueue size 13 Max nice priority 14 Max realtime priority 15 Max realtime timeout # 使用gdb modify (gdb) set *$rlim[0] = 1024*4 (gdb) print *$rlim $3 = {4096, 65536} (gdb) print setrlimit(7, $rlim) $4 = 0 [root@anbob ~]# cat /proc/15837/limits|grep "open files" Limit Soft Limit Hard Limit Units Max open files 4096 65536 files [root@anbob ~]# prlimit --nofile --output RESOURCE,SOFT,HARD --pid 15837 RESOURCE SOFT HARD NOFILE 4096 65536 # 使用prlimit修改 [root@anbob ~]# prlimit --nofile=1024:8192 --pid 15837 [root@anbob ~]# cat /proc/15837/limits |grep "open files" Limit Soft Limit Hard Limit Units Max open files 1024 8192 files (gdb) print getrlimit(7, $rlim) $5 = 0 (gdb) print *$rlim $6 = {1024, 8192} (gdb)
Note:
resource limit限制分为soft和hard, soft limit就是实际resource限制,hard limit限制只是为了使用limit命令可以修改的最大上限。
对不起,这篇文章暂时关闭评论。