Troubleshooting Exadata X8 machine node reboot frequently rds_send_remove_from_sock
最近有个客户的Oracle Exadata x8 数据库主机操作系统总是频繁的重启,重启前在DB,CRS层没有任何错误信息, 当时的OS负载也比较低,仅从exawatcher的mpstat能发现在重启前15s左右部分CPU core sys使用达100%。 操作系统有配置kdump生成了dump信息,发现在CPU在等待Watchdog detected hard LOCKUP on cpu 11, 堆栈调用中包含rds_send_remove_from_sock,简单记录。
查找kdump生成的vmcore
# egrep '^path' /etc/kdump.conf path /var/crash
Note:
注意检查路径文件系统可用空间要大于物理内存大小。也可以配置文件压缩使用 -c 参数。
core_collector makedumpfile -c
可以使用crash工具分析vmcore文件
[2646529.732110] megaraid_sas 0000:65:00.0: Application firmware crash dump mode set success [2646530.281897] megaraid_sas 0000:65:00.0: Application firmware crash dump mode set success [2696963.826326] NMI watchdog: Watchdog detected hard LOCKUP on cpu 11 [2696963.826327] Modules linked in: oracleacfs(PO) oracleadvm(PO) oracleoks(PO) ipmi_poweroff scsi_transport_iscsi ipmi_ssif bonding ib_umad mlx4_en mlx4_ib vfat fat skx_edac intel_powerclamp coretemp kvm_intel iTCO_wdt iTCO_vendor_support kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel mlx4_core crypto_simd glue_helper ipmi_si cryptd ipmi_devintf ipmi_msghandler pcc_cpufreq ioatdma i2c_i801 lpc_ich wmi shpchp rds_rdma resilient_rdmaip dm_multipath ib_ipoib rds rdma_ucm rdma_cm iw_cm ib_cm ib_uverbs binfmt_misc sg ib_core fuse ip_tables ext4 mbcache jbd2 fscrypto sd_mod igb ahci libahci megaraid_sas crc32c_intel bnxt_en libata ptp pps_core i2c_algo_bit dca dm_mirror dm_region_hash dm_log dm_mod [2696963.826338] CPU: 11 PID: 218135 Comm: oracle_218135_p Tainted: P O 4.14.35-1902.5.1.2.el7uek.x86_64 #2 [2696963.826339] Hardware name: Oracle Corporation ORACLE SERVER X8-2/ASM,MB,X8-2, BIOS 51020500 08/26/2019 [2696963.826339] task: ffff9e6f50968000 task.stack: ffffbd275b750000 [2696963.826339] RIP: 0010:native_queued_spin_lock_slowpath+0x155/0x1f8 [2696963.826339] RSP: 0018:ffff9e66408c3dd0 EFLAGS: 00000002 [2696963.826340] RAX: 0000000000000101 RBX: ffff9e7f9e958d10 RCX: 0000000000000001 [2696963.826340] RDX: 0000000000000101 RSI: 0000000000000101 RDI: ffff9e7f9e958d10 [2696963.826340] RBP: ffff9e66408c3de8 R08: 0000000000000000 R09: ffff9e3aed816e70 [2696963.826341] R10: 0000000000000001 R11: ffff9e2e11a1c800 R12: 0000000000000001 [2696963.826341] R13: 0000000000000000 R14: ffff9e6b55223400 R15: ffff9e6b55223418 [2696963.826341] FS: 00007f9392b652c0(0000) GS:ffff9e66408c0000(0000) knlGS:0000000000000000 [2696963.826341] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [2696963.826342] CR2: 000000247bff3118 CR3: 0000004c2a31a001 CR4: 00000000007606e0 [2696963.826342] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [2696963.826342] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [2696963.826342] PKRU: 55555554 [2696963.826342] Call Trace: [2696963.826343] [2696963.826343] queued_spin_lock_slowpath+0xb/0xf [2696963.826343] _raw_spin_lock+0x2e/0x32 [2696963.826343] rds_send_remove_from_sock+0xb9/0x420 [rds] [2696963.826344] rds_send_path_drop_acked+0x138/0x160 [rds] [2696963.826344] rds_send_drop_acked+0x1f/0x30 [rds] [2696963.826344] rds_ib_rx+0x1ca/0x220 [rds_rdma] [2696963.826344] rds_ib_tasklet_fn_recv+0x30/0x40 [rds_rdma] [2696963.826344] tasklet_action+0x10e/0x117 [2696963.826345] __do_softirq+0xd9/0x28d [2696963.826345] do_softirq_own_stack+0x2a/0x35 [2696963.826345] [2696963.826345] do_softirq+0x4d/0x6a [2696963.826345] __local_bh_enable_ip+0x57/0x59 [2696963.826346] _raw_spin_unlock_bh+0x23/0x25 [2696963.826346] rds_add_buf_info+0x3ac/0x830 [rds] [2696963.826346] rds_sendmsg+0x3b6/0xda0 [rds] [2696963.826346] ? rw_copy_check_uvector+0x6b/0x138 [2696963.826346] ? copy_msghdr_from_user+0xdb/0x135 [2696963.826347] sock_sendmsg+0x3e/0x4a [2696963.826347] ___sys_sendmsg+0x2b5/0x2c6 [2696963.826347] ? huge_pmd_share+0x2b9/0x4a7 [2696963.826347] ? huge_pte_alloc+0xab/0x145 [2696963.826347] ? mutex_lock+0x12/0x3a [2696963.826348] ? audit_filter_syscall+0x33/0xce [2696963.826348] __sys_sendmsg+0x54/0x8d [2696963.826348] SyS_sendmsg+0x12/0x1c [2696963.826348] do_syscall_64+0x79/0x1b9 [2696963.826348] entry_SYSCALL_64_after_hwframe+0x169/0x0 [2696963.826349] RIP: 0033:0x7f938ed4cca0 [2696963.826349] RSP: 002b:00007fff64b54d18 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [2696963.826349] RAX: ffffffffffffffda RBX: 00007f9392b33690 RCX: 00007f938ed4cca0 [2696963.826350] RDX: 0000000000000000 RSI: 00007fff64b54d20 RDI: 000000000000000f [2696963.826350] RBP: 00007fff64b54db0 R08: 00007fff64b54ed8 R09: 0000000000000003 [2696963.826350] R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000001 [2696963.826350] R13: 00000000126fdc58 R14: 00007f9392aa9050 R15: 00000000126f31f8 [2696963.826351] Code: c0 74 3f 41 0f 0d 08 eb 39 f3 90 44 8b 23 41 81 fc 00 01 00 00 74 f2 e9 1b ff ff ff 83 fa 01 0f 84 ed fe ff ff 8b 03 84 c0 74 08 90 8b 03 84 c0 75 f8 b8 01 00 00 00 66 89 03 e9 d2 fe ff ff [2696963.826357] Kernel panic - not syncing: Hard LOCKUP [2696963.826358] CPU: 11 PID: 218135 Comm: oracle_218135_p Tainted: P O 4.14.35-1902.5.1.2.el7uek.x86_64 #2 [2696963.826358] Hardware name: Oracle Corporation ORACLE SERVER X8-2/ASM,MB,X8-2, BIOS 51020500 08/26/2019 [2696963.826358] Call Trace: [2696963.826358] [2696963.826359] dump_stack+0x63/0x7f [2696963.826359] panic+0xe8/0x24d [2696963.826359] nmi_panic+0x3f/0x3f [2696963.826359] watchdog_overflow_callback+0xed/0x117 [2696963.826359] __perf_event_overflow+0x57/0xe3 [2696963.826360] perf_event_overflow+0x14/0x16 [2696963.826360] intel_pmu_handle_irq+0x216/0x4a5 [2696963.826360] ? __set_pte_vaddr+0x37/0x4b [2696963.826360] ? set_pte_vaddr_p4d+0x25/0x2a [2696963.826360] ? set_pte_vaddr+0x36/0x58 [2696963.826361] ? __native_set_fixmap+0x28/0x33 [2696963.826361] ? native_set_fixmap+0x35/0x5f [2696963.826361] ? ghes_copy_tofrom_phys+0xeb/0x19c [2696963.826361] ? ghes_read_estatus+0x98/0x16b [2696963.826362] perf_event_nmi_handler+0x33/0x4c [2696963.826362] nmi_handle+0x68/0x127 [2696963.826362] default_do_nmi+0x46/0x128 [2696963.826362] do_nmi+0xf6/0x15e [2696963.826363] end_repeat_nmi+0x16/0x0 [2696963.826363] RIP: 0010:native_queued_spin_lock_slowpath+0x155/0x1f8 [2696963.826363] RSP: 0018:ffff9e66408c3dd0 EFLAGS: 00000002 [2696963.826363] RAX: 0000000000000101 RBX: ffff9e7f9e958d10 RCX: 0000000000000001 [2696963.826364] RDX: 0000000000000101 RSI: 0000000000000101 RDI: ffff9e7f9e958d10 [2696963.826364] RBP: ffff9e66408c3de8 R08: 0000000000000000 R09: ffff9e3aed816e70 [2696963.826364] R10: 0000000000000001 R11: ffff9e2e11a1c800 R12: 0000000000000001 [2696963.826365] R13: 0000000000000000 R14: ffff9e6b55223400 R15: ffff9e6b55223418 [2696963.826365] ? native_queued_spin_lock_slowpath+0x155/0x1f8 [2696963.826365] ? native_queued_spin_lock_slowpath+0x155/0x1f8 [2696963.826365] [2696963.826365] [2696963.826366] queued_spin_lock_slowpath+0xb/0xf [2696963.826366] _raw_spin_lock+0x2e/0x32 [2696963.826366] rds_send_remove_from_sock+0xb9/0x420 [rds] [2696963.826366] rds_send_path_drop_acked+0x138/0x160 [rds] [2696963.826366] rds_send_drop_acked+0x1f/0x30 [rds] [2696963.826367] rds_ib_rx+0x1ca/0x220 [rds_rdma] [2696963.826367] rds_ib_tasklet_fn_recv+0x30/0x40 [rds_rdma] [2696963.826367] tasklet_action+0x10e/0x117 [2696963.826367] __do_softirq+0xd9/0x28d [2696963.826367] do_softirq_own_stack+0x2a/0x35 [2696963.826368] [2696963.826368] do_softirq+0x4d/0x6a [2696963.826368] __local_bh_enable_ip+0x57/0x59 [2696963.826368] _raw_spin_unlock_bh+0x23/0x25 [2696963.826368] rds_add_buf_info+0x3ac/0x830 [rds] [2696963.826369] rds_sendmsg+0x3b6/0xda0 [rds] [2696963.826369] ? rw_copy_check_uvector+0x6b/0x138 [2696963.826369] ? copy_msghdr_from_user+0xdb/0x135 [2696963.826369] sock_sendmsg+0x3e/0x4a [2696963.826369] ___sys_sendmsg+0x2b5/0x2c6 [2696963.826370] ? huge_pmd_share+0x2b9/0x4a7 [2696963.826370] ? huge_pte_alloc+0xab/0x145 [2696963.826370] ? mutex_lock+0x12/0x3a [2696963.826370] ? audit_filter_syscall+0x33/0xce [2696963.826370] __sys_sendmsg+0x54/0x8d [2696963.826371] SyS_sendmsg+0x12/0x1c [2696963.826371] do_syscall_64+0x79/0x1b9 [2696963.826371] entry_SYSCALL_64_after_hwframe+0x169/0x0 [2696963.826371] RIP: 0033:0x7f938ed4cca0 [2696963.826371] RSP: 002b:00007fff64b54d18 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [2696963.826372] RAX: ffffffffffffffda RBX: 00007f9392b33690 RCX: 00007f938ed4cca0
Note:
RIP: 0010:native_queued_spin_lock_slowpath+0x155/0x1f8 系统中断在这里,rds访问存储盘的过程中出现的异常,oracle_218135_p进程执行的时候触发的,cpu11被锁死了,造成系统panic.
UEK5: Kernel Panics with CPU Stall at rds_send_remove_from_sock (Doc ID 2637173.1)
called rhashtable_lookup_fast(). This function at the end called local_bh_enable() which caused the softirq() to be called.It went upto rds_send_remove_from_sock() function which tried to lock the same lock:’rds_snd_lock’. This resulted in a deadlock situation.
算是一个未公开unpublished internal Bug 30734590.
升级内核kernel-uek-4.14.35-1902.11.0.el17 or later. Ksplice available in uptrack-updates 20200219 and later
以下参数可以尝试增加,可能改善。
1. In the DB Node, edit the file “/etc/modprobe.d/exadata.conf” and add the line
options ib_ipoib send_queue_size=2048 recv_queue_size=2048 at the end.
2. Reboot the node and make sure the values are getting reflected as shown below.
对不起,这篇文章暂时关闭评论。