故障诊断 RHEL7 Slab SUnreclaim (kmalloc-8192) 内存占用高
最近遇到两起运行在 Linux 7 上的数据库主机问题。由于操作系统内核的内存使用率高,导致 Oracle RAC 的性能受损或无法使用。内存主要被 Slab 的 SUnreclaim 区域占用。这些案例有一个共同特点:都使用了分布式文件存储系统。这次的情况是生产环境中有 750G 的内存,而 SLAB 使用了接近 200G 的内存,且主要是由 SUnreclaim 区域占用的。特此记录这个案例。
什么是Slab
在Linux操作系统中,”slab” 是一种内存分配机制,属于内核的内存管理子系统。它专门用于管理小块内存对象的分配和释放。slab分配器(Slab Allocator) 通过将内存分成多个“缓存区(slab caches)”,每个缓存区包含多个相同大小的对象,这些对象可以快速分配和释放。这种方法有助于减少内存碎片,提高分配和释放小对象的效率,同时保持系统的内存利用率。SLAB分为SReclaimable可回收和SUnreclaim不可回收.
Slab的两个主要作用:
- Slab对小对象进行分配,不用为每个小对象分配一个页,节省了空间。
- 内核中一些小对象创建析构很频繁,Slab对这些小对象做缓存,可以重复利用一些相同的对象,减少内存分配次数。
问题现象
操作系统内存使用率超过90%,主要是有SLAB的SUnreclaim使用.
oracle@anbob:/home/oracle> cat /etc/os-release NAME="Red Hat Enterprise Linux Server" VERSION="7.6 (Maipo)" ID="rhel" ID_LIKE="fedora" VARIANT="Server" VARIANT_ID="server" VERSION_ID="7.6" PRETTY_NAME="Red Hat Enterprise Linux Server 7.6 (Maipo)" oracle@anbob:/home/oracle> free -g total used free shared buff/cache available Mem: 753 493 36 14 223 30 Swap: 19 6 13 oracle@anbob:/home/oracle> cat /proc/meminfo MemTotal: 790552132 kB MemFree: 38262416 kB MemAvailable: 32045452 kB Buffers: 177444 kB Cached: 17232144 kB SwapCached: 234392 kB Active: 69777460 kB Inactive: 15421664 kB Active(anon): 69205652 kB Inactive(anon): 14676100 kB Active(file): 571808 kB Inactive(file): 745564 kB Unevictable: 4246792 kB Mlocked: 4246792 kB SwapTotal: 20971516 kB SwapFree: 14217060 kB Dirty: 2092 kB Writeback: 0 kB AnonPages: 72415332 kB Mapped: 4372664 kB Shmem: 15343984 kB Slab: 216883280 kB SReclaimable: 806496 kB SUnreclaim: 216076784 kB KernelStack: 133184 kB PageTables: 2595304 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 218627868 kB Committed_AS: 119295936 kB VmallocTotal: 34359738367 kB VmallocUsed: 2417732 kB VmallocChunk: 34357109612 kB HardwareCorrupted: 0 kB AnonHugePages: 0 kB CmaTotal: 0 kB CmaFree: 0 kB HugePages_Total: 192988 HugePages_Free: 39806 HugePages_Rsvd: 419 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 49613824 kB DirectMap2M: 344141824 kB DirectMap1G: 412090368 kB
Note:
buff/cache占用200G+, 主要是Slab占用, 其中又主要是SUnreclaim占用200G+。
Matching output of free -k to /proc/meminfo
Red Hat Enterprise Linux 7.1 or later
free output |
coresponding /proc/meminfo fields |
---|---|
Mem: total | MemTotal |
Mem: used |
MemTotal - MemFree - Buffers - Cached - Slab |
Mem: free |
MemFree |
Mem: shared |
Shmem |
Mem: buff/cache |
Buffers + Cached + Slab |
Mem:available |
MemAvailable |
Swap: total |
SwapTotal |
Swap: used |
SwapTotal - SwapFree |
Swap: free |
SwapFree |
RHEL 6, 7, 8 & 9.
Active(anon):
Anonymous memory that has been used more recently and usually not swapped outInactive(anon):
Anonymous memory that has not been used recently and can be swapped outActive(file):
Pagecache memory that has been used more recently and usually not reclaimed until neededInactive(file):
Pagecache memory that can be reclaimed without huge performance impactUnevictable:
Unevictable pages can’t be swapped out for a variety of reasonsMlocked:
Pages locked to memory using themlock()
system call. Mlocked pages are also Unevictable.SwapTotal:
Total swap space availableSwapFree:
The remaining swap space availableDirty:
Memory waiting to be written back to diskWriteback:
Memory which is actively being written back to diskAnonPages:
Non-file backed pages mapped into userspace page tablesMapped:
Files which have been mmaped, such as librariesSlab:
In-kernel data structures cachePageTables:
Amount of memory dedicated to the lowest level of page tables. This can increase to a high value if a lot of processes are attached to the same shared memory segment.Shmem:
Total used shared memory (shared between several processes, thus including RAM disks, SYS-V-IPC and BSD like SHMEM)SReclaimable:
The part of the Slab that might be reclaimed (such as caches)SUnreclaim:
The part of the Slab that can’t be reclaimed under memory pressureKernelStack:
The memory the kernel stack uses. This is not reclaimable.WritebackTmp:
Memory used by FUSE for temporary writeback buffersHardwareCorrupted:
The amount of RAM the kernel identified as corrupted / not workingAnonHugePages:
Non-file backed huge pages mapped into userspace page tablesHugePages_Surp:
The number of hugepages in the pool above the value invm.nr_hugepages
. The maximum number of surplus hugepages is controlled byvm.nr_overcommit_hugepages
.DirectMap4k:
The amount of memory being mapped into the kernel space with 4k size pages.DirectMap2M:
The amount of memory being mapped into the kernel space with 2MB size pages.DirectMap1G.
The amount of memory being mapped into the kernel space with 1GB size pages.
More Interpreting /proc/meminfo and free output for Red Hat Enterprise Linux
/proc/slabinfo文件信息
在Slab中,可分配内存块称为对象,下图中kmalloc-8
表示每个对象占用8Bit大小的普通Slab,同理kmalloc-16
中每个对象占用16B,依次类推,找出Slab中占用量较大的对象是哪些?
每种对象占用总内存量 = num_objs*objsize
root@anbob:/root> cat /proc/slabinfo
slabinfo - version: 2.1
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
inode_cache 107092 107253 592 55 8 : tunables 0 0 0 : slabdata 1975 1975 0
dentry 424294 432054 192 42 2 : tunables 0 0 0 : slabdata 10287 10287 0
...
kmalloc-8192 29176803 29176803 8192 4 8 : tunables 0 0 0 : slabdata 7320546 7320546 0 --8192*29176803/1024/1024/1024 = 222G
kmalloc-4096 8205 9064 4096 8 8 : tunables 0 0 0 : slabdata 1133 1133 0
kmalloc-2048 35899 36690 2048 16 8 : tunables 0 0 0 : slabdata 2371 2371 0
kmalloc-1024 67641 69952 1024 32 8 : tunables 0 0 0 : slabdata 2186 2186 0
kmalloc-512 689591 709656 512 64 8 : tunables 0 0 0 : slabdata 11140 11140 0
kmalloc-256 1137831 1324864 256 64 4 : tunables 0 0 0 : slabdata 20701 20701 0
kmalloc-192 763850 816186 192 42 2 : tunables 0 0 0 : slabdata 19433 19433 0
kmalloc-128 485959 499008 128 64 2 : tunables 0 0 0 : slabdata 7797 7797 0
...
另外也可以使用slabtop 查看TOP
slabtop --sort c --once | head -n12 /bin/slabtop --once
可以使用crash工具进行静态分析,也可以使用perf工具进行动态分析,排查造成slab内存泄露的原因。
crash> kem -S kmalloc-8192|tail -n 10 crash> rd [memory address] 512 -S -- or -- perf record -a -e kmem:kmalloc --filter 'bytes_alloc == 8192' -e kmem:kfree --filter ' ptr != 0' sleep 200 perf script > testperf.txt cat testperf.txt
解决方法
当SUnreclaim内存超过系统总内存的10%时,可能存在slab内存泄漏。slab内存是内核组件(或驱动)通过kmalloc类接口向buddy系统申请的内存,而内核组件(或驱动)没有正常释放。实例一旦发生slab内存泄漏,无法通过kill进程的方式回收内存,只能重启实例。slab内存泄漏会导致实例上可供业务操作使用的内存减少,内存碎片化,还可能触发系统OOM Killer,造成系统性能抖动。
在Oracle DOC High Slab SUnreclaim (Doc ID 2913967.1) 记录在 Linux OS – Version Oracle Linux 7.9 and later 存在一个问题。
Cause
The issue is reported in the internal Bug 34670124. It is caused by the *ksplice* patches below:
(1) CVE-2021-4197: Privilege escalation in Control Groups.
(2) Allow to preserve anonymous memory through exec syscalls.Solution
Rebooting the server as a workaround and the issue is fxied in V4.14.35-2047.516.0 or later.
目前没有有效的解决办法(比如dentry对象与kmalloc-xxx对象), 建议监控Slab内存的使用,有计划重启操作系统,之前同事在一个客户使用sync&slabinfo -s命令可以在线的释放。或通过crash和perf等工具确定了内存泄露的函数调用路径或者影响的内核数据结构后,建议在内核开发者或专业运维人员指导下确定内存泄露的具体源头,然后解决内存泄露问题。
对不起,这篇文章暂时关闭评论。