Troubleshooting CRS Node Evictions on RHEL7 hang messages show ‘NMI watchdog: BUG: soft lockup’
最近上海某客户一套ORACLE RAC发生1节点驱逐, 问题前CPU利用率并不高, CRS日志有I/O响应和IPC超时错误, 部分进程hang死,操作系统是RHEL7.5, 在操作系统meesage提示如下信息:
“kernel: NMI watchdog: BUG: soft lockup – CPU#25 stuck for 22s!”
OS message log
Sep 19 22:46:04 anbob1 kernel: NMI watchdog: BUG: soft lockup - CPU#29 stuck for 22s! [migration/29:155] Sep 19 22:46:04 anbob1 kernel: Modules linked in: tcp_lp nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache joydev tcp_diag inet_diag fuse bonding rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib sunrpc rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm skx_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel i40iw kvm ib_core irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd vfat fat ipmi_ssif dm_service_time ses i40e enclosure pcspkr sg hpwdt mei_me hpilo mei lpc_ich shpchp ipmi_si ipmi_devintf ipmi_msghandler wmi acpi_power_meter dm_multipath binfmt_misc ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic mgag200 i2c_algo_bit drm_kms_helper Sep 19 22:46:04 anbob1 kernel: qla2xxx syscopyarea sysfillrect uas sysimgblt fb_sys_fops ttm drm usb_storage crct10dif_pclmul crct10dif_common crc32c_intel smartpqi tg3 scsi_transport_sas scsi_transport_fc i2c_core ptp scsi_tgt pps_core dm_mirror dm_region_hash dm_log dm_mod Sep 19 22:46:04 anbob1 kernel: CPU: 29 PID: 155 Comm: migration/29 Tainted: G L ------------ 3.10.0-862.el7.x86_64 #1 Sep 19 22:46:04 anbob1 kernel: Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 06/20/2018 Sep 19 22:46:04 anbob1 kernel: task: ffff97ea9336cf10 ti: ffff97ea933a0000 task.ti: ffff97ea933a0000 Sep 19 22:46:04 anbob1 kernel: RIP: 0010:[] [] multi_cpu_stop+0x4a/0x110 Sep 19 22:46:04 anbob1 kernel: RSP: 0000:ffff97ea933a3d98 EFLAGS: 00000246 Sep 19 22:46:04 anbob1 kernel: RAX: 0000000000000001 RBX: ffff97faf7bb2f70 RCX: dead000000000200 Sep 19 22:46:04 anbob1 kernel: RDX: ffff9809becd4030 RSI: 0000000000000282 RDI: ffff97faf47e7ac0 Sep 19 22:46:04 anbob1 kernel: RBP: ffff97ea933a3dc0 R08: ffff97faf47e7a90 R09: 0000000000000001 Sep 19 22:46:04 anbob1 kernel: R10: 000000000000b881 R11: 0000000000000001 R12: ffffffffa4ec8ab7 Sep 19 22:46:04 anbob1 kernel: R13: ffff97ea933a3d20 R14: 0000000000000000 R15: ffff97faf7bb3774 Sep 19 22:46:04 anbob1 kernel: FS: 0000000000000000(0000) GS:ffff9809becc0000(0000) knlGS:0000000000000000 Sep 19 22:46:04 anbob1 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Sep 19 22:46:04 anbob1 kernel: CR2: 0000000000845000 CR3: 0000003181e74000 CR4: 00000000007607e0 Sep 19 22:46:04 anbob1 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Sep 19 22:46:04 anbob1 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Sep 19 22:46:04 anbob1 kernel: PKRU: 00000000 Sep 19 22:46:04 anbob1 kernel: Call Trace: Sep 19 22:46:04 anbob1 kernel: [] ? cpu_stop_should_run+0x50/0x50 Sep 19 22:46:04 anbob1 kernel: [] cpu_stopper_thread+0x99/0x150 Sep 19 22:46:04 anbob1 kernel: [] ? __schedule+0x41c/0xa20 Sep 19 22:46:04 anbob1 kernel: [] smpboot_thread_fn+0x144/0x1a0 Sep 19 22:46:04 anbob1 kernel: [] ? lg_double_unlock+0x40/0x40 Sep 19 22:46:04 anbob1 kernel: [] kthread+0xd1/0xe0 Sep 19 22:46:04 anbob1 kernel: [] ? insert_kthread_work+0x40/0x40 Sep 19 22:46:04 anbob1 kernel: [] ret_from_fork_nospec_begin+0x7/0x21 Sep 19 22:46:04 anbob1 kernel: [] ? insert_kthread_work+0x40/0x40 Sep 19 22:46:04 anbob1 kernel: Code: 1f 44 00 00 49 89 c5 48 8b 47 18 48 85 c0 0f 84 b3 00 00 00 0f a3 18 19 db 85 db 41 0f 95 c6 45 31 ff 31 c0 0f 1f 44 00 00 f3 90 <41> 8b 5c 24 20 39 c3 74 5d 83 fb 02 74 68 83 fb 03 75 05 45 84 Sep 19 22:46:16 anbob1 kernel: NMI watchdog: BUG: soft lockup - CPU#26 stuck for 22s! [migration/26:140] Sep 19 22:46:16 anbob1 kernel: Modules linked in: tcp_lp nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache joydev tcp_diag inet_diag fuse bonding rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib sunrpc rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm skx_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel i40iw kvm ib_core irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd vfat fat ipmi_ssif dm_service_time ses i40e enclosure pcspkr sg hpwdt mei_me hpilo mei lpc_ich shpchp ipmi_si ipmi_devintf ipmi_msghandler wmi acpi_power_meter dm_multipath binfmt_misc ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic mgag200 i2c_algo_bit drm_kms_helper Sep 19 22:46:16 anbob1 kernel: qla2xxx syscopyarea sysfillrect uas sysimgblt fb_sys_fops ttm drm usb_storage crct10dif_pclmul crct10dif_common crc32c_intel smartpqi tg3 scsi_transport_sas scsi_transport_fc i2c_core ptp scsi_tgt pps_core dm_mirror dm_region_hash dm_log dm_mod Sep 19 22:46:16 anbob1 kernel: CPU: 26 PID: 140 Comm: migration/26 Tainted: G L ------------ 3.10.0-862.el7.x86_64 #1 Sep 19 22:46:16 anbob1 kernel: Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 06/20/2018 Sep 19 22:46:16 anbob1 kernel: task: ffff97ea932cdee0 ti: ffff97ea93308000 task.ti: ffff97ea93308000 Sep 19 22:46:16 anbob1 kernel: RIP: 0010:[] [] multi_cpu_stop+0x4a/0x110 Sep 19 22:46:16 anbob1 kernel: RSP: 0000:ffff97ea9330bd98 EFLAGS: 00000246 Sep 19 22:46:16 anbob1 kernel: RAX: 0000000000000001 RBX: ffff97fabde20000 RCX: dead000000000200 Sep 19 22:46:16 anbob1 kernel: RDX: ffff9809bec14030 RSI: 0000000000000282 RDI: ffff97fafbfdbac0 Sep 19 22:46:16 anbob1 kernel: RBP: ffff97ea9330bdc0 R08: ffff97fafbfdba90 R09: 0000000000000001 Sep 19 22:46:16 anbob1 kernel: R10: 000000000000b81d R11: 0000000000000001 R12: ffffffffa4ec8ab7 Sep 19 22:46:16 anbob1 kernel: R13: ffff97ea9330bd20 R14: 0000000000000000 R15: ffff97fabde20804 Sep 19 22:46:16 anbob1 kernel: FS: 0000000000000000(0000) GS:ffff9809bec00000(0000) knlGS:0000000000000000 Sep 19 22:46:16 anbob1 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Sep 19 22:46:16 anbob1 kernel: CR2: 0000000000bfc230 CR3: 00000030fbbba000 CR4: 00000000007607e0 Sep 19 22:46:16 anbob1 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Sep 19 22:46:16 anbob1 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Sep 19 22:46:16 anbob1 kernel: PKRU: 00000000 Sep 19 22:46:16 anbob1 kernel: Call Trace: Sep 19 22:46:16 anbob1 kernel: [] ? cpu_stop_should_run+0x50/0x50 Sep 19 22:46:16 anbob1 kernel: [] cpu_stopper_thread+0x99/0x150 Sep 19 22:46:16 anbob1 kernel: [] ? __schedule+0x41c/0xa20 Sep 19 22:46:16 anbob1 kernel: [] smpboot_thread_fn+0x144/0x1a0 Sep 19 22:46:16 anbob1 kernel: [] ? lg_double_unlock+0x40/0x40 Sep 19 22:46:16 anbob1 kernel: [] kthread+0xd1/0xe0 Sep 19 22:46:16 anbob1 kernel: [] ? insert_kthread_work+0x40/0x40 Sep 19 22:46:16 anbob1 kernel: [] ret_from_fork_nospec_begin+0x7/0x21 Sep 19 22:46:16 anbob1 kernel: [] ? insert_kthread_work+0x40/0x40 Sep 19 22:46:16 anbob1 kernel: Code: 1f 44 00 00 49 89 c5 48 8b 47 18 48 85 c0 0f 84 b3 00 00 00 0f a3 18 19 db 85 db 41 0f 95 c6 45 31 ff 31 c0 0f 1f 44 00 00 f3 90 <41> 8b 5c 24 20 39 c3 74 5d 83 fb 02 74 68 83 fb 03 75 05 45 84 Sep 19 22:46:20 anbob1 kernel: NMI watchdog: BUG: soft lockup - CPU#25 stuck for 22s! [migration/25:135] Sep 19 22:46:20 anbob1 kernel: Modules linked in: tcp_lp nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache joydev tcp_diag inet_diag fuse bonding rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib sunrpc rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm skx_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel i40iw kvm ib_core irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd vfat fat ipmi_ssif dm_service_time ses i40e enclosure pcspkr sg hpwdt mei_me hpilo mei lpc_ich shpchp ipmi_si ipmi_devintf ipmi_msghandler wmi acpi_power_meter dm_multipath binfmt_misc ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic mgag200 i2c_algo_bit drm_kms_helper Sep 19 22:46:20 anbob1 kernel: qla2xxx syscopyarea sysfillrect uas sysimgblt fb_sys_fops ttm drm usb_storage crct10dif_pclmul crct10dif_common crc32c_intel smartpqi tg3 scsi_transport_sas scsi_transport_fc i2c_core ptp scsi_tgt pps_core dm_mirror dm_region_hash dm_log dm_mod Sep 19 22:46:20 anbob1 kernel: CPU: 25 PID: 135 Comm: migration/25 Tainted: G L ------------ 3.10.0-862.el7.x86_64 #1 Sep 19 22:46:20 anbob1 kernel: Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 06/20/2018 Sep 19 22:46:20 anbob1 kernel: task: ffff97ea932c8fd0 ti: ffff97ea932d4000 task.ti: ffff97ea932d4000 Sep 19 22:46:20 anbob1 kernel: RIP: 0010:[] [] multi_cpu_stop+0x4a/0x110 Sep 19 22:46:20 anbob1 kernel: RSP: 0000:ffff97ea932d7d98 EFLAGS: 00000246 Sep 19 22:46:20 anbob1 kernel: RAX: 0000000000000001 RBX: ffffffffa5512b08 RCX: dead000000000200 Sep 19 22:46:20 anbob1 kernel: RDX: ffff9809bebd4030 RSI: 0000000000000282 RDI: ffff97dbcead3ac0 Sep 19 22:46:20 anbob1 kernel: RBP: ffff97ea932d7dc0 R08: ffff97dbcead3a90 R09: 0000000000000001 Sep 19 22:46:20 anbob1 kernel: R10: 000000000000bea5 R11: 0000000000000001 R12: ffffffffa4ec8ab7 Sep 19 22:46:20 anbob1 kernel: R13: ffff97ea932d7d20 R14: 0000000000000000 R15: ffff9809bc5566e4 Sep 19 22:46:20 anbob1 kernel: FS: 0000000000000000(0000) GS:ffff9809bebc0000(0000) knlGS:0000000000000000 Sep 19 22:46:20 anbob1 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Sep 19 22:46:20 anbob1 kernel: CR2: 00007ffff1857088 CR3: 00000011d76da000 CR4: 00000000007607e0 Sep 19 22:46:20 anbob1 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Sep 19 22:46:20 anbob1 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Note:
从multi_cpu_stop cpu_stop_should_run cpu_stopper_thread 函数猜测这是Linux Kernel层关于cpu_stop 相关函数调用。 OS运行时间 CPU STOP 很可能是一些OS调度比如降频、节能, 想起前段时间协助成都客户分析的也是RHEL 频繁重启的案例 《Troubleshooting Oracle RAC node crash frequently on OEL 6.9 (Kernel panic) 》。
查看CPU主频
Note:
可以看到当前CPU 主频在1G-3.4G不等,说明大量CPU出于降频。
查找RHEL 相关BUG.
Soft lockup in multi_cpu_stop+0x7f 文章记录的现象一致,属于linux已知BUG. The issue was tracked on a private bugzilla, RHBZ#1432844.
原因:
This is the potential C-states problem that has long been a known issue. It’s a power saving mode of the CPU chip. Simply put, the CPU chip “turns off” various portions of itself to save/reduce power. The problem however is that in doing so, it can have problems “waking up”. As a result, various problems where a processor appears to “stall” and does nothing because it doesn’t get woken up. Changing the c-states can disable this power sleeping mode
与之前RHEL 6上C STATE的问题相似, 建议RHEL 7 通过tuned profile 切到throughput-performance 模式禁用 C-states.
对不起,这篇文章暂时关闭评论。