首页 » 系统相关 » 故障诊断 RHEL7 Slab SUnreclaim (kmalloc-8192) 内存占用高

故障诊断 RHEL7 Slab SUnreclaim (kmalloc-8192) 内存占用高

最近遇到两起运行在 Linux 7 上的数据库主机问题。由于操作系统内核的内存使用率高,导致 Oracle RAC 的性能受损或无法使用。内存主要被 Slab 的 SUnreclaim 区域占用。这些案例有一个共同特点:都使用了分布式文件存储系统。这次的情况是生产环境中有 750G 的内存,而 SLAB 使用了接近 200G 的内存,且主要是由 SUnreclaim 区域占用的。特此记录这个案例。

什么是Slab

在Linux操作系统中,”slab” 是一种内存分配机制,属于内核的内存管理子系统。它专门用于管理小块内存对象的分配和释放。slab分配器(Slab Allocator) 通过将内存分成多个“缓存区(slab caches)”,每个缓存区包含多个相同大小的对象,这些对象可以快速分配和释放。这种方法有助于减少内存碎片,提高分配和释放小对象的效率,同时保持系统的内存利用率。SLAB分为SReclaimable可回收和SUnreclaim不可回收.

Slab的两个主要作用:

  • Slab对小对象进行分配,不用为每个小对象分配一个页,节省了空间。
  • 内核中一些小对象创建析构很频繁,Slab对这些小对象做缓存,可以重复利用一些相同的对象,减少内存分配次数。

问题现象

操作系统内存使用率超过90%,主要是有SLAB的SUnreclaim使用.

oracle@anbob:/home/oracle> cat /etc/os-release 
NAME="Red Hat Enterprise Linux Server"
VERSION="7.6 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="7.6"
PRETTY_NAME="Red Hat Enterprise Linux Server 7.6 (Maipo)"

oracle@anbob:/home/oracle> free -g
              total        used        free      shared  buff/cache   available
Mem:            753         493          36          14         223          30
Swap:            19           6          13

oracle@anbob:/home/oracle> cat /proc/meminfo
MemTotal: 790552132 kB
MemFree: 38262416 kB
MemAvailable: 32045452 kB
Buffers: 177444 kB
Cached: 17232144 kB
SwapCached: 234392 kB
Active: 69777460 kB
Inactive: 15421664 kB
Active(anon): 69205652 kB
Inactive(anon): 14676100 kB
Active(file): 571808 kB
Inactive(file): 745564 kB
Unevictable: 4246792 kB
Mlocked: 4246792 kB
SwapTotal: 20971516 kB
SwapFree: 14217060 kB
Dirty: 2092 kB
Writeback: 0 kB
AnonPages: 72415332 kB
Mapped: 4372664 kB
Shmem: 15343984 kB
Slab: 216883280 kB
SReclaimable: 806496 kB
SUnreclaim: 216076784 kB
KernelStack: 133184 kB
PageTables: 2595304 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 218627868 kB
Committed_AS: 119295936 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 2417732 kB
VmallocChunk: 34357109612 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
CmaTotal: 0 kB
CmaFree: 0 kB
HugePages_Total: 192988
HugePages_Free: 39806
HugePages_Rsvd: 419
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 49613824 kB
DirectMap2M: 344141824 kB
DirectMap1G: 412090368 kB

Note:
buff/cache占用200G+, 主要是Slab占用, 其中又主要是SUnreclaim占用200G+。

Matching output of free -k to /proc/meminfo

Red Hat Enterprise Linux 7.1 or later

free output coresponding /proc/meminfo fields
Mem: total MemTotal
Mem: used MemTotal - MemFree - Buffers - Cached - Slab
Mem: free MemFree
Mem: shared Shmem
Mem: buff/cache Buffers + Cached + Slab
Mem:available MemAvailable
Swap: total SwapTotal
Swap: used SwapTotal - SwapFree
Swap: free SwapFree

 

RHEL 6, 7, 8 & 9.
  • Active(anon): Anonymous memory that has been used more recently and usually not swapped out
  • Inactive(anon): Anonymous memory that has not been used recently and can be swapped out
  • Active(file): Pagecache memory that has been used more recently and usually not reclaimed until needed
  • Inactive(file): Pagecache memory that can be reclaimed without huge performance impact
  • Unevictable: Unevictable pages can’t be swapped out for a variety of reasons
  • Mlocked: Pages locked to memory using the mlock() system call. Mlocked pages are also Unevictable.
  • SwapTotal: Total swap space available
  • SwapFree: The remaining swap space available
  • Dirty: Memory waiting to be written back to disk
  • Writeback: Memory which is actively being written back to disk
  • AnonPages: Non-file backed pages mapped into userspace page tables
  • Mapped: Files which have been mmaped, such as libraries
  • Slab: In-kernel data structures cache
  • PageTables: Amount of memory dedicated to the lowest level of page tables. This can increase to a high value if a lot of processes are attached to the same shared memory segment.
  • Shmem: Total used shared memory (shared between several processes, thus including RAM disks, SYS-V-IPC and BSD like SHMEM)
  • SReclaimable: The part of the Slab that might be reclaimed (such as caches)
  • SUnreclaim: The part of the Slab that can’t be reclaimed under memory pressure
  • KernelStack: The memory the kernel stack uses. This is not reclaimable.
  • WritebackTmp: Memory used by FUSE for temporary writeback buffers
  • HardwareCorrupted: The amount of RAM the kernel identified as corrupted / not working
  • AnonHugePages: Non-file backed huge pages mapped into userspace page tables
  • HugePages_Surp: The number of hugepages in the pool above the value in vm.nr_hugepages. The maximum number of surplus hugepages is controlled by vm.nr_overcommit_hugepages.
  • DirectMap4k: The amount of memory being mapped into the kernel space with 4k size pages.
  • DirectMap2M: The amount of memory being mapped into the kernel space with 2MB size pages.
  • DirectMap1G. The amount of memory being mapped into the kernel space with 1GB size pages.

More  Interpreting /proc/meminfo and free output for Red Hat Enterprise Linux

/proc/slabinfo文件信息

在Slab中,可分配内存块称为对象,下图中kmalloc-8表示每个对象占用8Bit大小的普通Slab,同理kmalloc-16中每个对象占用16B,依次类推,找出Slab中占用量较大的对象是哪些?

每种对象占用总内存量 = num_objs*objsize

root@anbob:/root> cat /proc/slabinfo
slabinfo - version: 2.1
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
inode_cache       107092 107253    592   55    8 : tunables    0    0    0 : slabdata   1975   1975      0
dentry            424294 432054    192   42    2 : tunables    0    0    0 : slabdata  10287  10287      0
...
kmalloc-8192      29176803 29176803   8192    4    8 : tunables    0    0    0 : slabdata 7320546 7320546      0  --8192*29176803/1024/1024/1024 = 222G
kmalloc-4096        8205   9064   4096    8    8 : tunables    0    0    0 : slabdata   1133   1133      0
kmalloc-2048       35899  36690   2048   16    8 : tunables    0    0    0 : slabdata   2371   2371      0
kmalloc-1024       67641  69952   1024   32    8 : tunables    0    0    0 : slabdata   2186   2186      0
kmalloc-512       689591 709656    512   64    8 : tunables    0    0    0 : slabdata  11140  11140      0
kmalloc-256       1137831 1324864    256   64    4 : tunables    0    0    0 : slabdata  20701  20701      0
kmalloc-192       763850 816186    192   42    2 : tunables    0    0    0 : slabdata  19433  19433      0
kmalloc-128       485959 499008    128   64    2 : tunables    0    0    0 : slabdata   7797   7797      0
...

另外也可以使用slabtop 查看TOP

slabtop --sort c --once | head -n12

/bin/slabtop --once

可以使用crash工具进行静态分析,也可以使用perf工具进行动态分析,排查造成slab内存泄露的原因。

 
crash> kem -S kmalloc-8192|tail -n 10
crash> rd [memory address] 512 -S
-- or --

perf record -a -e kmem:kmalloc --filter 'bytes_alloc == 8192' -e kmem:kfree --filter ' ptr != 0' sleep 200
perf script > testperf.txt
cat testperf.txt

解决方法

SUnreclaim内存超过系统总内存的10%时,可能存在slab内存泄漏。slab内存是内核组件(或驱动)通过kmalloc类接口向buddy系统申请的内存,而内核组件(或驱动)没有正常释放。实例一旦发生slab内存泄漏,无法通过kill进程的方式回收内存,只能重启实例。slab内存泄漏会导致实例上可供业务操作使用的内存减少,内存碎片化,还可能触发系统OOM Killer,造成系统性能抖动。

在Oracle DOC High Slab SUnreclaim (Doc ID 2913967.1) 记录在 Linux OS – Version Oracle Linux 7.9 and later 存在一个问题。

Cause

The issue is reported in the internal Bug 34670124. It is caused by the *ksplice* patches below:
(1) CVE-2021-4197: Privilege escalation in Control Groups.
(2) Allow to preserve anonymous memory through exec syscalls.

Solution
Rebooting the server as a workaround and the issue is fxied in V4.14.35-2047.516.0 or later.

目前没有有效的解决办法(比如dentry对象与kmalloc-xxx对象), 建议监控Slab内存的使用,有计划重启操作系统,之前同事在一个客户使用sync&slabinfo -s命令可以在线的释放。或通过crash和perf等工具确定了内存泄露的函数调用路径或者影响的内核数据结构后,建议在内核开发者或专业运维人员指导下确定内存泄露的具体源头,然后解决内存泄露问题。

打赏

对不起,这篇文章暂时关闭评论。