Troubleshooting Out-Of-Memory(OOM) killer db crash when memory exhausted
in oracle database env, database Instance terminated due to death of background process. In this case, it was DBWR. No more information in alert / traces why DBWR process dead.
# db alert log
Warning: VKTM detected a time drift. Time drifts can result in an unexpected behavior such as time-outs. Please check trace file for more details. Tue Apr 23 08:54:27 2019 WARNING: Heavy swapping observed on system in last 5 mins. pct of memory swapped in [3.68%] pct of memory swapped out [13.12%]. Please make sure there is no memory pressure and the SGA and PGA are configured correctly. Look at DBRM trace file for more details. Tue Apr 23 08:56:27 2019 Thread 1 cannot allocate new log, sequence 10395 Private strand flush not complete Current log# 2 seq# 10394 mem# 0: /hescms/oradata/anbob/redo02a.log Current log# 2 seq# 10394 mem# 1: /hescms/oradata/scms/redo02b.log Thread 1 advanced to log sequence 10395 (LGWR switch) Current log# 3 seq# 10395 mem# 0: /hescms/oradata/scms/redo03a.log Current log# 3 seq# 10395 mem# 1: /hescms/oradata/scms/redo03b.log Tue Apr 23 08:56:41 2019 Archived Log entry 10505 added for thread 1 sequence 10394 ID 0xaef43455 dest 1: Tue Apr 23 09:08:37 2019 System state dump requested by (instance=1, osid=8886 (PMON)), summary=[abnormal instance termination]. Tue Apr 23 09:08:37 2019 PMON (ospid: 8886): terminating the instance due to error 471 System State dumped to trace file /ora/diag/rdbms/scms/scms/trace/scms_diag_8896_20190423090837.trc Tue Apr 23 09:08:37 2019 opiodr aborting process unknown ospid (22614) as a result of ORA-1092 Tue Apr 23 09:08:38 2019 opiodr aborting process unknown ospid (27627) as a result of ORA-1092 Instance terminated by PMON, pid = 8886 Tue Apr 23 09:18:18 2019 Starting ORACLE instance (normal)
# OS log /var/log/messages
Apr 23 08:52:18 anbobdb kernel: NET: Unregistered protocol family 36 Apr 23 09:07:28 anbobdb kernel: oracle invoked oom-killer: gfp_mask=0x84d0, order=0, oom_adj=0, oom_score_adj=0 Apr 23 09:07:32 anbobdb rtkit-daemon[3097]: The canary thread is apparently starving. Taking action. Apr 23 09:07:47 anbobdb kernel: oracle cpuset=/ mems_allowed=0-4 Apr 23 09:07:47 anbobdb kernel: Pid: 22753, comm: oracle Not tainted 2.6.32-431.el6.x86_64 #1 Apr 23 09:07:47 anbobdb rtkit-daemon[3097]: Demoting known real-time threads. Apr 23 09:07:47 anbobdb rtkit-daemon[3097]: Demoted 0 threads. Apr 23 09:07:47 anbobdb kernel: Call Trace: Apr 23 09:07:47 anbobdb kernel: [] ? dump_header+0x90/0x1b0 Apr 23 09:07:47 anbobdb kernel: [] ? security_real_capable_noaudit+0x3c/0x70 Apr 23 09:07:47 anbobdb kernel: [] ? oom_kill_process+0x82/0x2a0 Apr 23 09:07:47 anbobdb kernel: [] ? select_bad_process+0xe1/0x120 Apr 23 09:07:47 anbobdb kernel: [] ? out_of_memory+0x220/0x3c0 Apr 23 09:07:47 anbobdb kernel: [] ? __alloc_pages_nodemask+0x8ac/0x8d0 Apr 23 09:07:47 anbobdb rtkit-daemon[3097]: The canary thread is apparently starving. Taking action. Apr 23 09:07:47 anbobdb rtkit-daemon[3097]: Demoting known real-time threads. Apr 23 09:07:47 anbobdb rtkit-daemon[3097]: Demoted 0 threads. Apr 23 09:07:48 anbobdb kernel: [] ? alloc_pages_current+0xaa/0x110 Apr 23 09:07:52 anbobdb kernel: [] ? pte_alloc_one+0x1b/0x50 Apr 23 09:07:52 anbobdb kernel: [] ? __pte_alloc+0x32/0x160 Apr 23 09:07:52 anbobdb kernel: [] ? handle_mm_fault+0x1c0/0x300 Apr 23 09:07:52 anbobdb kernel: [] ? down_read_trylock+0x1a/0x30
Note: OS messages indicating resource shortage, OOM killer etc (TFA will collect this)
another case
Mar 10 17:25:17 anbob kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 Mar 10 17:25:17 anbob kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 Mar 10 17:25:19 anbob kernel: oracle cpuset=/ mems_allowed=0 Mar 10 17:25:19 anbob kernel: CPU: 3 PID: 10485 Comm: oracle Tainted: GF O-------------- 3.10.0-123.el7.x86_64 #1 Mar 10 17:25:19 anbob kernel: Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.10.2-1ubuntu1 04/01/2014 Mar 10 17:25:19 anbob kernel: ffff88038ee8b8e0 0000000068e706a6 ffff880025da1938 ffffffff815e19ba Mar 10 17:25:19 anbob kernel: ffff880025da19c8 ffffffff815dd02d ffffffff810b68f8 ffff8801a4defde0 Mar 10 17:25:19 anbob kernel: 0000000000000202 ffff88038ee8b8e0 ffff880025da19b0 ffffffff81102eff Mar 10 17:25:19 anbob kernel: Call Trace: Mar 10 17:25:20 anbob kernel: [<ffffffff815e19ba>] dump_stack+0x19/0x1b Mar 10 17:25:20 anbob kernel: [<ffffffff815dd02d>] dump_header+0x8e/0x214 Mar 10 17:25:20 anbob kernel: [<ffffffff810b68f8>] ? ktime_get_ts+0x48/0xe0 Mar 10 17:25:20 anbob kernel: [<ffffffff81102eff>] ? delayacct_end+0x8f/0xb0 Mar 10 17:25:20 anbob kernel: [<ffffffff8114520e>] oom_kill_process+0x24e/0x3b0 Mar 10 17:25:20 anbob kernel: [<ffffffff81144d76>] ? find_lock_task_mm+0x56/0xc0 Mar 10 17:25:20 anbob kernel: [<ffffffff8106af3e>] ? has_capability_noaudit+0x1e/0x30 Mar 10 17:25:20 anbob kernel: [<ffffffff81145a36>] out_of_memory+0x4b6/0x4f0 Mar 10 17:25:20 anbob kernel: [<ffffffff8114b579>] __alloc_pages_nodemask+0xa09/0xb10 Mar 10 17:25:20 anbob kernel: [<ffffffff81188779>] alloc_pages_current+0xa9/0x170 Mar 10 17:25:20 anbob kernel: [<ffffffff811419f7>] __page_cache_alloc+0x87/0xb0 Mar 10 17:25:20 anbob kernel: [<ffffffff81143d48>] filemap_fault+0x188/0x430 Mar 10 17:25:20 anbob kernel: [<ffffffff811682ce>] __do_fault+0x7e/0x520 Mar 10 17:25:20 anbob kernel: [<ffffffff8116c615>] handle_mm_fault+0x3e5/0xd90 Mar 10 17:25:20 anbob kernel: [<ffffffff8104606f>] ? kvm_clock_read+0x1f/0x30 Mar 10 17:25:21 anbob kernel: [<ffffffff8101a679>] ? sched_clock+0x9/0x10 Mar 10 17:25:21 anbob kernel: [<ffffffff81099dcd>] ? sched_clock_local+0x1d/0x80 Mar 10 17:25:21 anbob kernel: [<ffffffff815ed186>] __do_page_fault+0x156/0x540 Mar 10 17:25:21 anbob kernel: [<ffffffff8109adfb>] ? thread_group_cputime+0x8b/0xd0 Mar 10 17:25:21 anbob kernel: [<ffffffff8109ae90>] ? thread_group_cputime_adjusted+0x50/0x70 Mar 10 17:25:21 anbob kernel: [<ffffffff815ed58a>] do_page_fault+0x1a/0x70 Mar 10 17:25:21 anbob kernel: [<ffffffff815ecc19>] do_async_page_fault+0x29/0xe0 Mar 10 17:25:21 anbob kernel: [<ffffffff815e97f8>] async_page_fault+0x28/0x30 Mar 10 17:25:23 anbob kernel: Mem-Info: Mar 10 17:25:23 anbob kernel: Node 0 DMA per-cpu: Mar 10 17:25:23 anbob kernel: CPU 0: hi: 0, btch: 1 usd: 0 Mar 10 17:25:23 anbob kernel: CPU 1: hi: 0, btch: 1 usd: 0 Mar 10 17:25:23 anbob kernel: CPU 2: hi: 0, btch: 1 usd: 0 Mar 10 17:25:23 anbob kernel: CPU 3: hi: 0, btch: 1 usd: 0 Mar 10 17:25:26 anbob kernel: CPU 14: hi: 186, btch: 31 usd: 0 Mar 10 17:25:26 anbob kernel: CPU 15: hi: 186, btch: 31 usd: 31 Mar 10 17:25:26 anbob kernel: active_anon:1467236 inactive_anon:224984 isolated_anon:0 active_file:2329 inactive_file:2606 isolated_file:0 unevictable:0 dirty:0 writeback:0 unstable:0 free:49972 slab_reclaimable:111184 slab_unreclaimable:33656 mapped:738956 shmem:969265 pagetables:6185943 bounce:0 free_cma:0 Mar 10 17:25:26 anbob kernel: Node 0 DMA free:15908kB min:32kB low:40kB high:48kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes Mar 10 17:25:26 anbob kernel: lowmem_reserve[]: 0 2806 31987 31987 Mar 10 17:25:26 anbob kernel: Node 0 DMA32 free:122608kB min:5924kB low:7404kB high:8884kB active_anon:71684kB inactive_anon:71720kB active_file:812kB inactive_file:976kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3129192kB managed:2874324kB mlocked:0kB dirty:0kB writeback:0kB mapped:117900kB shmem:120984kB slab_reclaimable:74880kB slab_unreclaimable:31176kB kernel_stack:4072kB pagetables:2479148kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:7790 all_unreclaimable? yes Mar 10 17:25:26 anbob kernel: lowmem_reserve[]: 0 0 29180 29180 Mar 10 17:25:26 anbob kernel: Node 0 Normal free:61372kB min:61620kB low:77024kB high:92428kB active_anon:5797260kB inactive_anon:828216kB active_file:8504kB inactive_file:9448kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:30408704kB managed:29881328kB mlocked:0kB dirty:0kB writeback:0kB mapped:2837924kB shmem:3756076kB slab_reclaimable:369856kB slab_unreclaimable:103448kB kernel_stack:8472kB pagetables:22264624kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:93277 all_unreclaimable? yes Mar 10 17:25:26 anbob kernel: lowmem_reserve[]: 0 0 0 0 Mar 10 17:25:26 anbob kernel: Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (R) 3*4096kB (M) = 15908kB Mar 10 17:25:26 anbob kernel: Node 0 DMA32: 18804*4kB (UEM) 5876*8kB (UM) 14*16kB (MR) 3*32kB (R) 1*64kB (R) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 122608kB Mar 10 17:25:27 anbob kernel: Node 0 Normal: 12483*4kB (UEMR) 414*8kB (UEM) 168*16kB (UEM) 96*32kB (UEM) 33*64kB (UEM) 2*128kB (U) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 61372kB Mar 10 17:25:27 anbob kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Mar 10 17:25:27 anbob kernel: 1059304 total pagecache pages Mar 10 17:25:27 anbob kernel: 85104 pages in swap cache Mar 10 17:25:27 anbob kernel: Swap cache stats: add 462584273, delete 462499169, find 377223118/426059204 Mar 10 17:25:27 anbob kernel: Free swap = 0kB Mar 10 17:25:27 anbob kernel: Total swap = 16777212kB Mar 10 17:25:27 anbob kernel: 8388607 pages RAM Mar 10 17:25:27 anbob kernel: 193081 pages reserved Mar 10 17:25:27 anbob kernel: 245495148 pages shared
Note:
It not’t use Hugepages, 32g memory+16g swap, swap free 0kb, pagetables 20gb
What is OOM Killer?
The OOM killer, a feature enabled by default on Linux kernel, is a self protection mechanism employed the Linux kernel when under severe memory pressure.If kernel can not find memory to allocate when it’s needed, it puts in-use user data pages on the swap-out queue, to be swapped out. If the Virtual Memory (VM) cannot allocate memory and canot swap out in-use memory, the Out-of-memory killer may begin killing current userspace processes. it will sacrifice one or more processes in order to free up memory for the system when all else fails.
The behavior of OOM killer in principle is as follows:
– Lose the minimum amount of work done
– Recover as much as memory it can Do not kill anything actually not using a lot memory alone
– Kill the minimum amount of processes (one)
– Try to kill the process the user expects to kill
Reason Probable Cause:
1 Spike in memory usage based on a load event (additional processes are needed for increased load).
2 Spike in memory usage based on additional services being added or migrated to the system. (Added another app or started a new service on the system)
3 Spike in memory usage due to failed hardware such as a DIMM memory module.
4 Spike in memory usage due to undersizing of hardware resources for the running application(s).
5 There’s a memory leak in a running application.
If the application uses mlock() or HugeTLB pages (HugePages), it may not be able to use its swap space for that application (because locked pages or HugePages are not swappable). If this happens, SwapFree may still have a very large value when the OOM occurs. However overusing them may exhaust system memory and leave the system with no other recourse.
Troubleshooting
Check to see how often the Out of memory (OOM) killer process is running.
$ egrep 'Out of memory:' /var/log/messages
Check to see how large the memory consumption is of the processes being killed.
$ egrep 'total-vm' /var/log/messages
Further analysis, we can check the system activity reporter (SAR) data to see what it’s captured about the OS.
Check swap statistics with the -S
flag: A high % of swpused
indicates swapping and memory shortage
$ sar -S -f /var/log/sa/sa2
Check CPU and IOwait statistics: High %user
or %system
indicate a busy system, also high %iowait
the system is spending important time waiting on underlying storage
$ sar -f /var/log/sa/sa31
Check memory statistics: High %memused
and %commit
values tell us the system is using nearly all of its memory, and memory that is committed to processes (high %commit
is more concerning)
$ sar -r -f /var/log/sa/sa
Lastly, check the amount of memory on the system, and how much is free/available:
$ free -m or cat /proc/meminfo or dmidecode -t memory
In the oracle environment, first check whether the SGA and PGA configuration is reasonable. In this case, we later reduced the size of these memory areas, reserved more available memory for the operating system, and configured hugepage. The benefits of hugepage are not described in multiple descriptions, BTW, If you increase the hugepages, check if the check has reached the upper limit of kernel.shmall. and check application process memory leak, even PGA leak. config hugepage linux
Solution:
1 add more RAM / swap to server to avoid this issue.
2, hugepage for oracle
3, pre_page_sga and lock_sga
References Linux: Out-of-Memory (OOM)Killer (文档 ID 452000.1) and RHEL online docs.
目前这篇文章有1条评论(Rss)评论关闭。