首页 » ORACLE 9i-23ai, 系统相关 » Troubleshooting Out-Of-Memory(OOM) killer db crash when memory exhausted

Troubleshooting Out-Of-Memory(OOM) killer db crash when memory exhausted

in oracle database env, database Instance terminated due to death of background process. In this case, it was DBWR. No more information in alert / traces why DBWR process dead.

# db alert log

Warning: VKTM detected a time drift.
Time drifts can result in an unexpected behavior such as time-outs. Please check trace file for more details.
Tue Apr 23 08:54:27 2019
WARNING: Heavy swapping observed on system in last 5 mins.
pct of memory swapped in [3.68%] pct of memory swapped out [13.12%].
Please make sure there is no memory pressure and the SGA and PGA 
are configured correctly. Look at DBRM trace file for more details.
Tue Apr 23 08:56:27 2019
Thread 1 cannot allocate new log, sequence 10395
Private strand flush not complete
  Current log# 2 seq# 10394 mem# 0: /hescms/oradata/anbob/redo02a.log
  Current log# 2 seq# 10394 mem# 1: /hescms/oradata/scms/redo02b.log
Thread 1 advanced to log sequence 10395 (LGWR switch)
  Current log# 3 seq# 10395 mem# 0: /hescms/oradata/scms/redo03a.log
  Current log# 3 seq# 10395 mem# 1: /hescms/oradata/scms/redo03b.log
Tue Apr 23 08:56:41 2019
Archived Log entry 10505 added for thread 1 sequence 10394 ID 0xaef43455 dest 1:
Tue Apr 23 09:08:37 2019
System state dump requested by (instance=1, osid=8886 (PMON)), summary=[abnormal instance termination].
Tue Apr 23 09:08:37 2019
PMON (ospid: 8886): terminating the instance due to error 471
System State dumped to trace file /ora/diag/rdbms/scms/scms/trace/scms_diag_8896_20190423090837.trc
Tue Apr 23 09:08:37 2019
opiodr aborting process unknown ospid (22614) as a result of ORA-1092
Tue Apr 23 09:08:38 2019
opiodr aborting process unknown ospid (27627) as a result of ORA-1092
Instance terminated by PMON, pid = 8886
Tue Apr 23 09:18:18 2019
Starting ORACLE instance (normal)

# OS log /var/log/messages

Apr 23 08:52:18 anbobdb kernel: NET: Unregistered protocol family 36
Apr 23 09:07:28 anbobdb kernel: oracle invoked oom-killer: gfp_mask=0x84d0, order=0, oom_adj=0, oom_score_adj=0
Apr 23 09:07:32 anbobdb rtkit-daemon[3097]: The canary thread is apparently starving. Taking action.
Apr 23 09:07:47 anbobdb kernel: oracle cpuset=/ mems_allowed=0-4
Apr 23 09:07:47 anbobdb kernel: Pid: 22753, comm: oracle Not tainted 2.6.32-431.el6.x86_64 #1
Apr 23 09:07:47 anbobdb rtkit-daemon[3097]: Demoting known real-time threads.
Apr 23 09:07:47 anbobdb rtkit-daemon[3097]: Demoted 0 threads.
Apr 23 09:07:47 anbobdb kernel: Call Trace:
Apr 23 09:07:47 anbobdb kernel: [] ? dump_header+0x90/0x1b0
Apr 23 09:07:47 anbobdb kernel: [] ? security_real_capable_noaudit+0x3c/0x70
Apr 23 09:07:47 anbobdb kernel: [] ? oom_kill_process+0x82/0x2a0
Apr 23 09:07:47 anbobdb kernel: [] ? select_bad_process+0xe1/0x120
Apr 23 09:07:47 anbobdb kernel: [] ? out_of_memory+0x220/0x3c0
Apr 23 09:07:47 anbobdb kernel: [] ? __alloc_pages_nodemask+0x8ac/0x8d0
Apr 23 09:07:47 anbobdb rtkit-daemon[3097]: The canary thread is apparently starving. Taking action.
Apr 23 09:07:47 anbobdb rtkit-daemon[3097]: Demoting known real-time threads.
Apr 23 09:07:47 anbobdb rtkit-daemon[3097]: Demoted 0 threads.
Apr 23 09:07:48 anbobdb kernel: [] ? alloc_pages_current+0xaa/0x110
Apr 23 09:07:52 anbobdb kernel: [] ? pte_alloc_one+0x1b/0x50
Apr 23 09:07:52 anbobdb kernel: [] ? __pte_alloc+0x32/0x160
Apr 23 09:07:52 anbobdb kernel: [] ? handle_mm_fault+0x1c0/0x300
Apr 23 09:07:52 anbobdb kernel: [] ? down_read_trylock+0x1a/0x30

Note:  OS messages indicating resource shortage, OOM killer etc (TFA will collect this)

another case

Mar 10 17:25:17 anbob kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Mar 10 17:25:17 anbob kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Mar 10 17:25:19 anbob kernel: oracle cpuset=/ mems_allowed=0
Mar 10 17:25:19 anbob kernel: CPU: 3 PID: 10485 Comm: oracle Tainted: GF O-------------- 3.10.0-123.el7.x86_64 #1
Mar 10 17:25:19 anbob kernel: Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.10.2-1ubuntu1 04/01/2014
Mar 10 17:25:19 anbob kernel: ffff88038ee8b8e0 0000000068e706a6 ffff880025da1938 ffffffff815e19ba
Mar 10 17:25:19 anbob kernel: ffff880025da19c8 ffffffff815dd02d ffffffff810b68f8 ffff8801a4defde0
Mar 10 17:25:19 anbob kernel: 0000000000000202 ffff88038ee8b8e0 ffff880025da19b0 ffffffff81102eff
Mar 10 17:25:19 anbob kernel: Call Trace:
Mar 10 17:25:20 anbob kernel: [<ffffffff815e19ba>] dump_stack+0x19/0x1b
Mar 10 17:25:20 anbob kernel: [<ffffffff815dd02d>] dump_header+0x8e/0x214
Mar 10 17:25:20 anbob kernel: [<ffffffff810b68f8>] ? ktime_get_ts+0x48/0xe0
Mar 10 17:25:20 anbob kernel: [<ffffffff81102eff>] ? delayacct_end+0x8f/0xb0
Mar 10 17:25:20 anbob kernel: [<ffffffff8114520e>] oom_kill_process+0x24e/0x3b0
Mar 10 17:25:20 anbob kernel: [<ffffffff81144d76>] ? find_lock_task_mm+0x56/0xc0
Mar 10 17:25:20 anbob kernel: [<ffffffff8106af3e>] ? has_capability_noaudit+0x1e/0x30
Mar 10 17:25:20 anbob kernel: [<ffffffff81145a36>] out_of_memory+0x4b6/0x4f0
Mar 10 17:25:20 anbob kernel: [<ffffffff8114b579>] __alloc_pages_nodemask+0xa09/0xb10
Mar 10 17:25:20 anbob kernel: [<ffffffff81188779>] alloc_pages_current+0xa9/0x170
Mar 10 17:25:20 anbob kernel: [<ffffffff811419f7>] __page_cache_alloc+0x87/0xb0
Mar 10 17:25:20 anbob kernel: [<ffffffff81143d48>] filemap_fault+0x188/0x430
Mar 10 17:25:20 anbob kernel: [<ffffffff811682ce>] __do_fault+0x7e/0x520
Mar 10 17:25:20 anbob kernel: [<ffffffff8116c615>] handle_mm_fault+0x3e5/0xd90
Mar 10 17:25:20 anbob kernel: [<ffffffff8104606f>] ? kvm_clock_read+0x1f/0x30
Mar 10 17:25:21 anbob kernel: [<ffffffff8101a679>] ? sched_clock+0x9/0x10
Mar 10 17:25:21 anbob kernel: [<ffffffff81099dcd>] ? sched_clock_local+0x1d/0x80
Mar 10 17:25:21 anbob kernel: [<ffffffff815ed186>] __do_page_fault+0x156/0x540
Mar 10 17:25:21 anbob kernel: [<ffffffff8109adfb>] ? thread_group_cputime+0x8b/0xd0
Mar 10 17:25:21 anbob kernel: [<ffffffff8109ae90>] ? thread_group_cputime_adjusted+0x50/0x70
Mar 10 17:25:21 anbob kernel: [<ffffffff815ed58a>] do_page_fault+0x1a/0x70
Mar 10 17:25:21 anbob kernel: [<ffffffff815ecc19>] do_async_page_fault+0x29/0xe0
Mar 10 17:25:21 anbob kernel: [<ffffffff815e97f8>] async_page_fault+0x28/0x30
Mar 10 17:25:23 anbob kernel: Mem-Info:
Mar 10 17:25:23 anbob kernel: Node 0 DMA per-cpu:
Mar 10 17:25:23 anbob kernel: CPU 0: hi: 0, btch: 1 usd: 0
Mar 10 17:25:23 anbob kernel: CPU 1: hi: 0, btch: 1 usd: 0
Mar 10 17:25:23 anbob kernel: CPU 2: hi: 0, btch: 1 usd: 0
Mar 10 17:25:23 anbob kernel: CPU 3: hi: 0, btch: 1 usd: 0

Mar 10 17:25:26 anbob kernel: CPU 14: hi: 186, btch: 31 usd: 0
Mar 10 17:25:26 anbob kernel: CPU 15: hi: 186, btch: 31 usd: 31
Mar 10 17:25:26 anbob kernel: active_anon:1467236 inactive_anon:224984 isolated_anon:0
active_file:2329 inactive_file:2606 isolated_file:0
unevictable:0 dirty:0 writeback:0 unstable:0
free:49972 slab_reclaimable:111184 slab_unreclaimable:33656
mapped:738956 shmem:969265 pagetables:6185943 bounce:0
free_cma:0
Mar 10 17:25:26 anbob kernel: Node 0 DMA free:15908kB min:32kB low:40kB high:48kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Mar 10 17:25:26 anbob kernel: lowmem_reserve[]: 0 2806 31987 31987
Mar 10 17:25:26 anbob kernel: Node 0 DMA32 free:122608kB min:5924kB low:7404kB high:8884kB active_anon:71684kB inactive_anon:71720kB active_file:812kB inactive_file:976kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3129192kB managed:2874324kB mlocked:0kB dirty:0kB writeback:0kB mapped:117900kB shmem:120984kB slab_reclaimable:74880kB slab_unreclaimable:31176kB kernel_stack:4072kB pagetables:2479148kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:7790 all_unreclaimable? yes
Mar 10 17:25:26 anbob kernel: lowmem_reserve[]: 0 0 29180 29180
Mar 10 17:25:26 anbob kernel: Node 0 Normal free:61372kB min:61620kB low:77024kB high:92428kB active_anon:5797260kB 
inactive_anon:828216kB active_file:8504kB inactive_file:9448kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:30408704kB 
managed:29881328kB mlocked:0kB dirty:0kB writeback:0kB mapped:2837924kB shmem:3756076kB slab_reclaimable:369856kB slab_unreclaimable:103448kB 
kernel_stack:8472kB pagetables:22264624kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:93277 all_unreclaimable? yes
Mar 10 17:25:26 anbob kernel: lowmem_reserve[]: 0 0 0 0
Mar 10 17:25:26 anbob kernel: Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (R) 3*4096kB (M) = 15908kB
Mar 10 17:25:26 anbob kernel: Node 0 DMA32: 18804*4kB (UEM) 5876*8kB (UM) 14*16kB (MR) 3*32kB (R) 1*64kB (R) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 122608kB
Mar 10 17:25:27 anbob kernel: Node 0 Normal: 12483*4kB (UEMR) 414*8kB (UEM) 168*16kB (UEM) 96*32kB (UEM) 33*64kB (UEM) 2*128kB (U) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 61372kB
Mar 10 17:25:27 anbob kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Mar 10 17:25:27 anbob kernel: 1059304 total pagecache pages
Mar 10 17:25:27 anbob kernel: 85104 pages in swap cache
Mar 10 17:25:27 anbob kernel: Swap cache stats: add 462584273, delete 462499169, find 377223118/426059204
Mar 10 17:25:27 anbob kernel: Free swap = 0kB
Mar 10 17:25:27 anbob kernel: Total swap = 16777212kB
Mar 10 17:25:27 anbob kernel: 8388607 pages RAM
Mar 10 17:25:27 anbob kernel: 193081 pages reserved
Mar 10 17:25:27 anbob kernel: 245495148 pages shared

Note:
It not’t use Hugepages, 32g memory+16g swap, swap free 0kb, pagetables 20gb

What is OOM Killer?
The OOM killer, a feature enabled by default on Linux kernel, is a self protection mechanism employed the Linux kernel when under severe memory pressure.If kernel can not find memory to allocate when it’s needed, it puts in-use user data pages on the swap-out queue, to be swapped out. If the Virtual Memory (VM) cannot allocate memory and canot swap out in-use memory, the Out-of-memory killer may begin killing current userspace processes. it will sacrifice one or more processes in order to free up memory for the system when all else fails.

The behavior of OOM killer in principle is as follows:
– Lose the minimum amount of work done
– Recover as much as memory it can Do not kill anything actually not using a lot memory alone
– Kill the minimum amount of processes (one)
– Try to kill the process the user expects to kill

Reason Probable Cause:

1 Spike in memory usage based on a load event (additional processes are needed for increased load).
2 Spike in memory usage based on additional services being added or migrated to the system. (Added another app or started a new service on the system)
3 Spike in memory usage due to failed hardware such as a DIMM memory module.
4 Spike in memory usage due to undersizing of hardware resources for the running application(s).
5 There’s a memory leak in a running application.

If the application uses mlock() or HugeTLB pages (HugePages), it may not be able to use its swap space for that application (because locked pages or HugePages are not swappable). If this happens, SwapFree may still have a very large value when the OOM occurs. However overusing them may exhaust system memory and leave the system with no other recourse.

Troubleshooting
Check to see how often the Out of memory (OOM) killer process is running.

$ egrep 'Out of memory:' /var/log/messages

Check to see how large the memory consumption is of the processes being killed.

$ egrep 'total-vm' /var/log/messages

Further analysis, we can check the system activity reporter (SAR) data to see what it’s captured about the OS.

Check swap statistics with the -S flag: A high % of swpusedindicates swapping and memory shortage

 $ sar -S -f /var/log/sa/sa2

Check CPU and IOwait statistics: High %user or %systemindicate a busy system, also high %iowait the system is spending important time waiting on underlying storage

 $ sar -f /var/log/sa/sa31

Check memory statistics: High %memused and %commit values tell us the system is using nearly all of its memory, and memory that is committed to processes (high %commit is more concerning)

 $ sar -r -f /var/log/sa/sa

Lastly, check the amount of memory on the system, and how much is free/available:

$ free -m 
or 
cat /proc/meminfo 
or 
dmidecode -t memory

In the oracle environment, first check whether the SGA and PGA configuration is reasonable. In this case, we later reduced the size of these memory areas, reserved more available memory for the operating system, and configured hugepage. The benefits of hugepage are not described in multiple descriptions, BTW, If you increase the hugepages, check if the check has reached the upper limit of kernel.shmall. and check application process memory leak, even PGA leak.   config hugepage linux

Solution:
1 add more RAM / swap to server to avoid this issue.
2, hugepage for oracle
3, pre_page_sga and lock_sga

 

 

References  Linux: Out-of-Memory (OOM)Killer (文档 ID 452000.1) and RHEL online docs.

打赏

目前这篇文章有1条评论(Rss)评论关闭。

  1. Big Daddy's Orlando | #1
    2019-04-29 at 20:08

    Thanks for another magnificent post. Where else could anyone get
    that type of info in such an ideal method of writing?
    I’ve a presentation subsequent week, and I am at the look for
    such info. http://Bigdaddysorlando.com/