测试环境使用基于20.03 LTS开发的二次发行版系统,发现在多个socket高并发物理机集群场景下,频发backtrace报协议栈空指针宕机问题。
calltrace如下:
[图片上传中…(image-VulbjsBQV3cZPoWVA6Go)]
[ 2117.610371] CPU: 61 PID: 12946 Comm: gaussdb Not tainted 4.19.90 #5
[ 2117.616609] Hardware name: Huawei TaiShan 200 (Model 2280)/BC82AMDD, BIOS 1.25 01/17/2020
[ 2117.624748] pstate: 80400009 (Nzcv daif +PAN -UAO)
[ 2117.629525] pc : inet_sock_destruct+0x38/0x1c0
[ 2117.633959] lr : __sk_destruct+0x2c/0x210
[ 2117.637952] sp : ffff817ffff9e620
[ 2117.641251] x29: ffff817ffff9e620 x28: ffff817f81326640
[ 2117.646539] x27: ffffa0ffcefdaac0 x26: ffff80ffc95040a0
[ 2117.651827] x25: 0000000000000042 x24: 0000000000000001
[ 2117.657114] x23: ffffa0ffcf4c0678 x22: 0000000000000eb9
[ 2117.662402] x21: ffff80b8cf7dd9e8 x20: ffff80b8cf7ddb00
[ 2117.667689] x19: ffff80b8cf7ddbc8 x18: 0000000000000000
[ 2117.672976] x17: 0000000000000000 x16: 0000000000000000
[ 2117.678263] x15: 0000000000000000 x14: 0000000000000000
[ 2117.683551] x13: 0000000000000000 x12: 0000000000000000
[ 2117.688837] x11: 0000000000000000 x10: 0000000000000040
[ 2117.694124] x9 : 000000001bef64d2 x8 : ffff80ffa00b76a0
[ 2117.699413] x7 : 0000000000000000 x6 : 0000000000000002
[ 2117.704701] x5 : 0000000000000000 x4 : 0000000000000020
[ 2117.709988] x3 : 0000000000000000 x2 : 0000000000000000
[ 2117.715276] x1 : ffffa0b65d26a800 x0 : ffffa0b65d26a800
[ 2117.720568] Call trace:
[ 2117.723008] inet_sock_destruct+0x38/0x1c0
[ 2117.727086] __sk_destruct+0x2c/0x210
[ 2117.730734] sk_destruct+0x48/0x60
[ 2117.734121] __sk_free+0x38/0xd0
[ 2117.737337] __sock_wfree+0x30/0x40
[ 2117.740816] skb_release_head_state+0x5c/0xf8
[ 2117.745155] skb_release_all+0x14/0x30
[ 2117.748889] consume_skb+0x2c/0x58
[ 2117.752280] __dev_kfree_skb_any+0x3c/0x48
[ 2117.756394] hinic_tx_poll+0x1a8/0x2d0 [hinic]
[ 2117.760822] hinic_poll+0x3c/0x180 [hinic]
[ 2117.764902] net_rx_action+0x16c/0x320
[ 2117.768636] __do_softirq+0x114/0x22c
[ 2117.772286] irq_exit+0x120/0x128
[ 2117.775592] __handle_domain_irq+0x60/0xb0
[ 2117.779669] gic_handle_irq+0x94/0x1b8
[ 2117.783401] el0_irq_naked+0x50/0x58
[ 2117.786962] Code: 51000442 b9001262 a9400823 a9007c3f (f9000462)
[ 2117.793082] ---[ end trace 67549940934fb05f ]---
[ 2117.797679] Kernel panic - not syncing: Fatal exception in interrupt
[ 2117.804051] SMP: stopping secondary CPUs
[ 2117.807975] Kernel Offset: disabled
[ 2117.811449] CPU features: 0x12,a2200a18
[ 2117.815267] Memory Limit: none
[ 2117.818355] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
Hey AlexZ11, Welcome to openEuler Community.
All of the projects in openEuler Community are maintained by @openeuler-ci-bot.
That means the developers can comment below every pull request or issue to trigger Bot Commands.
Please follow instructions at https://gitee.com/openeuler/community/blob/master/en/sig-infrastructure/command.md to find the details.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
鲲鹏920 + 某数据库 + openEuler 20.03 LTS
X测试使用Y数据库在调测时,没有确切的条件,发生概率性宕机,同时另一个集群10台服务器正常运行几十天没发生问题,后来通过定位发现在数据库预热阶段将cpu加满运行至100%以后,在30分钟之内会出现问题,问题基本必现。
在上面定位中怀疑报文已经被释放过一次,加上下面的调测补丁,将上一次调用释放skb的堆栈保存到skb的cb中,重新生成vmcore
--- net/core/skbuff.c.orig 2020-11-21 11:01:44.569912940 +0800
+++ net/core/skbuff.c 2020-11-21 11:14:20.009912940 +0800
@@ -607,6 +607,17 @@ fastpath:
kmem_cache_free(skbuff_fclone_cache, fclones);
}
+static void __save_stack_trace(unsigned long *trace)
+{
+ struct stack_trace stack_trace;
+
+ stack_trace.max_entries = 5;
+ stack_trace.nr_entries = 0;
+ stack_trace.entries = trace;
+ stack_trace.skip = 1;
+ save_stack_trace(&stack_trace);
+}
+
void skb_release_head_state(struct sk_buff *skb)
{
skb_dst_drop(skb);
@@ -615,6 +626,7 @@ void skb_release_head_state(struct sk_bu
WARN_ON(in_irq());
skb->destructor(skb);
}
+ __save_stack_trace((unsigned long *)skb->cb);
#if IS_ENABLED(CONFIG_NF_CONNTRACK)
nf_conntrack_put(skb_nfct(skb));
#endif
分析添加调测补丁的vmcore (路径: )
在新的vmcore中查找sk_buff 的cb(control block)字段
读取cb中的内容
在cb中看到保存栈的调用路径为:
tcp_done–>inet_csk_destroy_sock->sk_stream_kill_queues->kfree_skb->skb_release_all
在此同时生成了另外一份vmcore,从堆栈信息中可以看到在sk_stream_kill_queues中报空指针
反汇编sk_stream_kill_queues函数
查看x2寄存器的skb
从上面的vmcore分析,可以看出有两条路径同时操作同一个sock的receive队列
cpu0
--------------------------------------------------
net_rx_action
-->mlx5e_napi_poll
-->mlx5e_poll_rx_cq
-->mlx5_handle_rx_cqe
-->napi_gro_receive
-->netif_receive_skb_internal
-->__netif_receive_skb
-->__netif_receive_skb_one_core
-->ip_rcv
-->ip_rcv_finish
-->ip_local_deliver
-->ip_local_deliver_finish
-->tcp_v4_rcv
-->tcp_v4_do_rcv
-->tcp_rcv_state_process
-->tcp_data_queue
-->tcp_fin
-->tcp_time_wait
-->tcp_done
-->inet_csk_destroy_sock
-->sk_stream_kill_queues
// 释放报文
-->__skb_queue_purge(&sk_receive_queue)
cpu1
--------------------------------------------------
net_rx_action
-->mlx5e_napi_poll
-->mlx5e_poll_tx_cq
-->napi_consume_skb
-->skb_release_all
-->skb_release_head_state
-->skb->destructor
-->__sock_wfree
-->if(refcount_sub_and_test(skb->truesize, &sk_sk_wmem_alloc))
-->__sk_free
-->sk_destruct
-->inet_sock_destruct
-->__skb_queue_purge(&sk->sk_receive_queue)
// cpu0在处理释放ack报文,sk_receive_queue在cpu0已经被清空了,同时cpu1从同一个sock获取数据到receive队列,此时再次释放报文而发生线程同步问题
// 问题原因cpu0释放内存后,未刷入内存导致cpu1上获取到了之前的数据
与某网卡厂商FAE工程师交流确认,网卡在对同一条流做hash时,收发流程可能在不同的cpu核上,并发时会产生上面场景
网卡在对一条tcp流做hash时,收发分别会走到不同的cpu上进行处理,cpu1收到FIN报文处理流程会调用__skb_queue_purge清空sock的receive队列,同时cpu0上在释放回收发送的ack报文时也会调用__skb_queue_purge清空receive队列,cpu0 __sock_wfree函数无内存屏障,存在内存序问题,导致cpu0获取到sk_receive_queue接收队列之前的数据,从而导致double free.
void __sock_wfree(struct sk_buff *skb)
{
struct sock *sk = skb->sk;
if (refcount_sub_and_test(skb->truesize, &sk->sk_wmem_alloc)) {
smp_rmb(); // 添加内存屏障验证问题解决
__sk_free(sk);
}
}
排查社区补丁
通过排查发现社区在5.1 在refcount_sub_and_test函数中加了内存屏障,在修复另外的问题时将这块引入,在5.10主线最新版本上做过测试,测试12H+没有复现问题
commit 47b8f3ab9c49daa824af848f9e02889662d8638f
Author: Elena Reshetova <elena.reshetova@intel.com>
Date: Wed Jan 30 13:18:51 2019 +0200
refcount_t: Add ACQUIRE ordering on success for dec(sub)_and_test() variants
This adds an smp_acquire__after_ctrl_dep() barrier on successful
decrease of refcounter value from 1 to 0 for refcount_dec(sub)_and_test
variants and therefore gives stronger memory ordering guarantees than
prior versions of these functions.
Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: dvyukov@google.com
Cc: keescook@chromium.org
Cc: stern@rowland.harvard.edu
Link: https://lkml.kernel.org/r/1548847131-27854-2-git-send-email-elena.reshetova@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
diff --git a/lib/refcount.c b/lib/refcount.c
index ebcf8cd49e05..6e904af0fb3e 100644
--- a/lib/refcount.c
+++ b/lib/refcount.c
@@ -33,6 +33,9 @@
* Note that the allocator is responsible for ordering things between free()
* and alloc().
*
+ * The decrements dec_and_test() and sub_and_test() also provide acquire
+ * ordering on success.
+ *
*/
#include <linux/mutex.h>
@@ -164,8 +167,8 @@ EXPORT_SYMBOL(refcount_inc_checked);
* at UINT_MAX.
*
* Provides release memory ordering, such that prior loads and stores are done
- * before, and provides a control dependency such that free() must come after.
- * See the comment on top.
+ * before, and provides an acquire ordering on success such that free()
+ * must come after.
*
* Use of this function is not recommended for the normal reference counting
* use case in which references are taken and released one at a time. In these
@@ -190,7 +193,12 @@ bool refcount_sub_and_test_checked(unsigned int i, refcount_t *r)
} while (!atomic_try_cmpxchg_release(&r->refs, &val, new));
- return !new;
+ if (!new) {
+ smp_acquire__after_ctrl_dep();
+ return true;
+ }
+ return false;
+
}
EXPORT_SYMBOL(refcount_sub_and_test_checked);
准确地说,这个问题不属于 race condition这类问题,而是cache数据没刷下内存,导致后面执行的线程读取了不正确的数据
+1
登录 后才可以发表评论