首页 » Exadata, ORACLE 9i-23ai » Troubleshooting Exadata X8 machine node reboot frequently rds_send_remove_from_sock

Troubleshooting Exadata X8 machine node reboot frequently rds_send_remove_from_sock

最近有个客户的Oracle Exadata x8 数据库主机操作系统总是频繁的重启,重启前在DB,CRS层没有任何错误信息, 当时的OS负载也比较低,仅从exawatcher的mpstat能发现在重启前15s左右部分CPU core sys使用达100%。 操作系统有配置kdump生成了dump信息,发现在CPU在等待Watchdog detected hard LOCKUP on cpu 11, 堆栈调用中包含rds_send_remove_from_sock,简单记录。

查找kdump生成的vmcore

# egrep '^path' /etc/kdump.conf
path /var/crash

Note:
注意检查路径文件系统可用空间要大于物理内存大小。也可以配置文件压缩使用 -c 参数。

core_collector makedumpfile -c

可以使用crash工具分析vmcore文件

[2646529.732110] megaraid_sas 0000:65:00.0: Application firmware crash dump mode set success
[2646530.281897] megaraid_sas 0000:65:00.0: Application firmware crash dump mode set success
[2696963.826326] NMI watchdog: Watchdog detected hard LOCKUP on cpu 11
[2696963.826327] Modules linked in: oracleacfs(PO) oracleadvm(PO) oracleoks(PO) ipmi_poweroff scsi_transport_iscsi ipmi_ssif bonding ib_umad mlx4_en mlx4_ib vfat fat skx_edac intel_powerclamp coretemp kvm_intel iTCO_wdt iTCO_vendor_support kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel mlx4_core crypto_simd glue_helper ipmi_si cryptd ipmi_devintf ipmi_msghandler pcc_cpufreq ioatdma i2c_i801 lpc_ich wmi shpchp rds_rdma resilient_rdmaip dm_multipath ib_ipoib rds rdma_ucm rdma_cm iw_cm ib_cm ib_uverbs binfmt_misc sg ib_core fuse ip_tables ext4 mbcache jbd2 fscrypto sd_mod igb ahci libahci megaraid_sas crc32c_intel bnxt_en libata ptp pps_core i2c_algo_bit dca dm_mirror dm_region_hash dm_log dm_mod
[2696963.826338] CPU: 11 PID: 218135 Comm: oracle_218135_p Tainted: P           O    4.14.35-1902.5.1.2.el7uek.x86_64 #2
[2696963.826339] Hardware name: Oracle Corporation ORACLE SERVER X8-2/ASM,MB,X8-2, BIOS 51020500 08/26/2019
[2696963.826339] task: ffff9e6f50968000 task.stack: ffffbd275b750000
[2696963.826339] RIP: 0010:native_queued_spin_lock_slowpath+0x155/0x1f8
[2696963.826339] RSP: 0018:ffff9e66408c3dd0 EFLAGS: 00000002
[2696963.826340] RAX: 0000000000000101 RBX: ffff9e7f9e958d10 RCX: 0000000000000001
[2696963.826340] RDX: 0000000000000101 RSI: 0000000000000101 RDI: ffff9e7f9e958d10
[2696963.826340] RBP: ffff9e66408c3de8 R08: 0000000000000000 R09: ffff9e3aed816e70
[2696963.826341] R10: 0000000000000001 R11: ffff9e2e11a1c800 R12: 0000000000000001
[2696963.826341] R13: 0000000000000000 R14: ffff9e6b55223400 R15: ffff9e6b55223418
[2696963.826341] FS:  00007f9392b652c0(0000) GS:ffff9e66408c0000(0000) knlGS:0000000000000000
[2696963.826341] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[2696963.826342] CR2: 000000247bff3118 CR3: 0000004c2a31a001 CR4: 00000000007606e0
[2696963.826342] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[2696963.826342] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[2696963.826342] PKRU: 55555554
[2696963.826342] Call Trace:
[2696963.826343]  
[2696963.826343]  queued_spin_lock_slowpath+0xb/0xf
[2696963.826343]  _raw_spin_lock+0x2e/0x32
[2696963.826343]  rds_send_remove_from_sock+0xb9/0x420 [rds]
[2696963.826344]  rds_send_path_drop_acked+0x138/0x160 [rds]
[2696963.826344]  rds_send_drop_acked+0x1f/0x30 [rds]
[2696963.826344]  rds_ib_rx+0x1ca/0x220 [rds_rdma]
[2696963.826344]  rds_ib_tasklet_fn_recv+0x30/0x40 [rds_rdma]
[2696963.826344]  tasklet_action+0x10e/0x117
[2696963.826345]  __do_softirq+0xd9/0x28d
[2696963.826345]  do_softirq_own_stack+0x2a/0x35
[2696963.826345]  
[2696963.826345]  do_softirq+0x4d/0x6a
[2696963.826345]  __local_bh_enable_ip+0x57/0x59
[2696963.826346]  _raw_spin_unlock_bh+0x23/0x25
[2696963.826346]  rds_add_buf_info+0x3ac/0x830 [rds]
[2696963.826346]  rds_sendmsg+0x3b6/0xda0 [rds]
[2696963.826346]  ? rw_copy_check_uvector+0x6b/0x138
[2696963.826346]  ? copy_msghdr_from_user+0xdb/0x135
[2696963.826347]  sock_sendmsg+0x3e/0x4a
[2696963.826347]  ___sys_sendmsg+0x2b5/0x2c6
[2696963.826347]  ? huge_pmd_share+0x2b9/0x4a7
[2696963.826347]  ? huge_pte_alloc+0xab/0x145
[2696963.826347]  ? mutex_lock+0x12/0x3a
[2696963.826348]  ? audit_filter_syscall+0x33/0xce
[2696963.826348]  __sys_sendmsg+0x54/0x8d
[2696963.826348]  SyS_sendmsg+0x12/0x1c
[2696963.826348]  do_syscall_64+0x79/0x1b9
[2696963.826348]  entry_SYSCALL_64_after_hwframe+0x169/0x0
[2696963.826349] RIP: 0033:0x7f938ed4cca0
[2696963.826349] RSP: 002b:00007fff64b54d18 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[2696963.826349] RAX: ffffffffffffffda RBX: 00007f9392b33690 RCX: 00007f938ed4cca0
[2696963.826350] RDX: 0000000000000000 RSI: 00007fff64b54d20 RDI: 000000000000000f
[2696963.826350] RBP: 00007fff64b54db0 R08: 00007fff64b54ed8 R09: 0000000000000003
[2696963.826350] R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000001
[2696963.826350] R13: 00000000126fdc58 R14: 00007f9392aa9050 R15: 00000000126f31f8
[2696963.826351] Code: c0 74 3f 41 0f 0d 08 eb 39 f3 90 44 8b 23 41 81 fc 00 01 00 00 74 f2 e9 1b ff ff ff 83 fa 01 0f 84 ed fe ff ff 8b 03 84 c0 74 08  90 8b 03 84 c0 75 f8 b8 01 00 00 00 66 89 03 e9 d2 fe ff ff 
[2696963.826357] Kernel panic - not syncing: Hard LOCKUP
[2696963.826358] CPU: 11 PID: 218135 Comm: oracle_218135_p Tainted: P           O    4.14.35-1902.5.1.2.el7uek.x86_64 #2
[2696963.826358] Hardware name: Oracle Corporation ORACLE SERVER X8-2/ASM,MB,X8-2, BIOS 51020500 08/26/2019
[2696963.826358] Call Trace:
[2696963.826358]  
[2696963.826359]  dump_stack+0x63/0x7f
[2696963.826359]  panic+0xe8/0x24d
[2696963.826359]  nmi_panic+0x3f/0x3f
[2696963.826359]  watchdog_overflow_callback+0xed/0x117
[2696963.826359]  __perf_event_overflow+0x57/0xe3
[2696963.826360]  perf_event_overflow+0x14/0x16
[2696963.826360]  intel_pmu_handle_irq+0x216/0x4a5
[2696963.826360]  ? __set_pte_vaddr+0x37/0x4b
[2696963.826360]  ? set_pte_vaddr_p4d+0x25/0x2a
[2696963.826360]  ? set_pte_vaddr+0x36/0x58
[2696963.826361]  ? __native_set_fixmap+0x28/0x33
[2696963.826361]  ? native_set_fixmap+0x35/0x5f
[2696963.826361]  ? ghes_copy_tofrom_phys+0xeb/0x19c
[2696963.826361]  ? ghes_read_estatus+0x98/0x16b
[2696963.826362]  perf_event_nmi_handler+0x33/0x4c
[2696963.826362]  nmi_handle+0x68/0x127
[2696963.826362]  default_do_nmi+0x46/0x128
[2696963.826362]  do_nmi+0xf6/0x15e
[2696963.826363]  end_repeat_nmi+0x16/0x0
[2696963.826363] RIP: 0010:native_queued_spin_lock_slowpath+0x155/0x1f8
[2696963.826363] RSP: 0018:ffff9e66408c3dd0 EFLAGS: 00000002
[2696963.826363] RAX: 0000000000000101 RBX: ffff9e7f9e958d10 RCX: 0000000000000001
[2696963.826364] RDX: 0000000000000101 RSI: 0000000000000101 RDI: ffff9e7f9e958d10
[2696963.826364] RBP: ffff9e66408c3de8 R08: 0000000000000000 R09: ffff9e3aed816e70
[2696963.826364] R10: 0000000000000001 R11: ffff9e2e11a1c800 R12: 0000000000000001
[2696963.826365] R13: 0000000000000000 R14: ffff9e6b55223400 R15: ffff9e6b55223418
[2696963.826365]  ? native_queued_spin_lock_slowpath+0x155/0x1f8
[2696963.826365]  ? native_queued_spin_lock_slowpath+0x155/0x1f8
[2696963.826365]  
[2696963.826365]  
[2696963.826366]  queued_spin_lock_slowpath+0xb/0xf
[2696963.826366]  _raw_spin_lock+0x2e/0x32
[2696963.826366]  rds_send_remove_from_sock+0xb9/0x420 [rds]
[2696963.826366]  rds_send_path_drop_acked+0x138/0x160 [rds]
[2696963.826366]  rds_send_drop_acked+0x1f/0x30 [rds]
[2696963.826367]  rds_ib_rx+0x1ca/0x220 [rds_rdma]
[2696963.826367]  rds_ib_tasklet_fn_recv+0x30/0x40 [rds_rdma]
[2696963.826367]  tasklet_action+0x10e/0x117
[2696963.826367]  __do_softirq+0xd9/0x28d
[2696963.826367]  do_softirq_own_stack+0x2a/0x35
[2696963.826368]  
[2696963.826368]  do_softirq+0x4d/0x6a
[2696963.826368]  __local_bh_enable_ip+0x57/0x59
[2696963.826368]  _raw_spin_unlock_bh+0x23/0x25
[2696963.826368]  rds_add_buf_info+0x3ac/0x830 [rds]
[2696963.826369]  rds_sendmsg+0x3b6/0xda0 [rds]
[2696963.826369]  ? rw_copy_check_uvector+0x6b/0x138
[2696963.826369]  ? copy_msghdr_from_user+0xdb/0x135
[2696963.826369]  sock_sendmsg+0x3e/0x4a
[2696963.826369]  ___sys_sendmsg+0x2b5/0x2c6
[2696963.826370]  ? huge_pmd_share+0x2b9/0x4a7
[2696963.826370]  ? huge_pte_alloc+0xab/0x145
[2696963.826370]  ? mutex_lock+0x12/0x3a
[2696963.826370]  ? audit_filter_syscall+0x33/0xce
[2696963.826370]  __sys_sendmsg+0x54/0x8d
[2696963.826371]  SyS_sendmsg+0x12/0x1c
[2696963.826371]  do_syscall_64+0x79/0x1b9
[2696963.826371]  entry_SYSCALL_64_after_hwframe+0x169/0x0
[2696963.826371] RIP: 0033:0x7f938ed4cca0
[2696963.826371] RSP: 002b:00007fff64b54d18 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[2696963.826372] RAX: ffffffffffffffda RBX: 00007f9392b33690 RCX: 00007f938ed4cca0

Note:
RIP: 0010:native_queued_spin_lock_slowpath+0x155/0x1f8 系统中断在这里,rds访问存储盘的过程中出现的异常,oracle_218135_p进程执行的时候触发的,cpu11被锁死了,造成系统panic.

UEK5: Kernel Panics with CPU Stall at rds_send_remove_from_sock (Doc ID 2637173.1)
called rhashtable_lookup_fast(). This function at the end called local_bh_enable() which caused the softirq() to be called.It went upto rds_send_remove_from_sock() function which tried to lock the same lock:’rds_snd_lock’. This resulted in a deadlock situation.

算是一个未公开unpublished internal Bug 30734590.

升级内核kernel-uek-4.14.35-1902.11.0.el17 or later. Ksplice available in uptrack-updates 20200219 and later

以下参数可以尝试增加,可能改善。
1. In the DB Node, edit the file “/etc/modprobe.d/exadata.conf” and add the line
options ib_ipoib send_queue_size=2048 recv_queue_size=2048 at the end.
2. Reboot the node and make sure the values are getting reflected as shown below.

打赏

, ,

对不起,这篇文章暂时关闭评论。