20.03 LTS在多个socket高并发物理机集群场景下backtrace报协议栈空指针宕机问题

测试环境使用基于20.03 LTS开发的二次发行版系统，发现在多个socket高并发物理机集群场景下，频发backtrace报协议栈空指针宕机问题。
calltrace如下：
[图片上传中…(image-VulbjsBQV3cZPoWVA6Go)]
[ 2117.610371] CPU: 61 PID: 12946 Comm: gaussdb Not tainted 4.19.90 #5
[ 2117.616609] Hardware name: Huawei TaiShan 200 (Model 2280)/BC82AMDD, BIOS 1.25 01/17/2020
[ 2117.624748] pstate: 80400009 (Nzcv daif +PAN -UAO)
[ 2117.629525] pc : inet_sock_destruct+0x38/0x1c0
[ 2117.633959] lr : __sk_destruct+0x2c/0x210
[ 2117.637952] sp : ffff817ffff9e620
[ 2117.641251] x29: ffff817ffff9e620 x28: ffff817f81326640
[ 2117.646539] x27: ffffa0ffcefdaac0 x26: ffff80ffc95040a0
[ 2117.651827] x25: 0000000000000042 x24: 0000000000000001
[ 2117.657114] x23: ffffa0ffcf4c0678 x22: 0000000000000eb9
[ 2117.662402] x21: ffff80b8cf7dd9e8 x20: ffff80b8cf7ddb00
[ 2117.667689] x19: ffff80b8cf7ddbc8 x18: 0000000000000000
[ 2117.672976] x17: 0000000000000000 x16: 0000000000000000
[ 2117.678263] x15: 0000000000000000 x14: 0000000000000000
[ 2117.683551] x13: 0000000000000000 x12: 0000000000000000
[ 2117.688837] x11: 0000000000000000 x10: 0000000000000040
[ 2117.694124] x9 : 000000001bef64d2 x8 : ffff80ffa00b76a0
[ 2117.699413] x7 : 0000000000000000 x6 : 0000000000000002
[ 2117.704701] x5 : 0000000000000000 x4 : 0000000000000020
[ 2117.709988] x3 : 0000000000000000 x2 : 0000000000000000
[ 2117.715276] x1 : ffffa0b65d26a800 x0 : ffffa0b65d26a800
[ 2117.720568] Call trace:
[ 2117.723008] inet_sock_destruct+0x38/0x1c0
[ 2117.727086] __sk_destruct+0x2c/0x210
[ 2117.730734] sk_destruct+0x48/0x60
[ 2117.734121] __sk_free+0x38/0xd0
[ 2117.737337] __sock_wfree+0x30/0x40
[ 2117.740816] skb_release_head_state+0x5c/0xf8
[ 2117.745155] skb_release_all+0x14/0x30
[ 2117.748889] consume_skb+0x2c/0x58
[ 2117.752280] __dev_kfree_skb_any+0x3c/0x48
[ 2117.756394] hinic_tx_poll+0x1a8/0x2d0 [hinic]
[ 2117.760822] hinic_poll+0x3c/0x180 [hinic]
[ 2117.764902] net_rx_action+0x16c/0x320
[ 2117.768636] __do_softirq+0x114/0x22c
[ 2117.772286] irq_exit+0x120/0x128
[ 2117.775592] __handle_domain_irq+0x60/0xb0
[ 2117.779669] gic_handle_irq+0x94/0x1b8
[ 2117.783401] el0_irq_naked+0x50/0x58
[ 2117.786962] Code: 51000442 b9001262 a9400823 a9007c3f (f9000462)
[ 2117.793082] ---[ end trace 67549940934fb05f ]---
[ 2117.797679] Kernel panic - not syncing: Fatal exception in interrupt
[ 2117.804051] SMP: stopping secondary CPUs
[ 2117.807975] Kernel Offset: disabled
[ 2117.811449] CPU features: 0x12,a2200a18
[ 2117.815267] Memory Limit: none
[ 2117.818355] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---

Hey AlexZ11, Welcome to openEuler Community.
All of the projects in openEuler Community are maintained by @openeuler-ci-bot.
That means the developers can comment below every pull request or issue to trigger Bot Commands.
Please follow instructions at https://gitee.com/openeuler/community/blob/master/en/sig-infrastructure/command.md to find the details.

1 服务器配置

鲲鹏920 + 某数据库 + openEuler 20.03 LTS

2 问题现象描述：

X测试使用Y数据库在调测时，没有确切的条件，发生概率性宕机，同时另一个集群10台服务器正常运行几十天没发生问题，后来通过定位发现在数据库预热阶段将cpu加满运行至100%以后，在30分钟之内会出现问题，问题基本必现。

3 问题定位分析：

vmcore分析(路径: 172.20.1.16:/root/vmcore_log/vmcore_first)

看到的是vmcore的基本信息，发生了空指针异常，如下图1

查看堆栈信息如图2

通过反汇编找到发生问题的点，如图3

通过上面反汇编发现是x3寄存器偏移8个字节保存到x2寄存器，然后从x2寄存器中取值时为空指针，x3为sk_buff结构体，查看sk_buff偏移，如图4

上面图1 x3寄存器保存了数据结构sk_buff，如图5，这里是空指针

内核源码对应位置(linux 4.19),如图6

继续分析inet_sock_destruct函数的反汇编，如图7

在上面的反汇编中x19存放的是sk->sk_receive_queue,通过解析数据结构，可以看到qlen为0，说明receive队列的内容已经被释放

总结：从上面这个vmcore分析中，可以得出结论，队列可能被重复释放，还无法找到第一次释放队列的位置

增加调试信息

在上面定位中怀疑报文已经被释放过一次，加上下面的调测补丁，将上一次调用释放skb的堆栈保存到skb的cb中，重新生成vmcore

--- net/core/skbuff.c.orig      2020-11-21 11:01:44.569912940 +0800
+++ net/core/skbuff.c   2020-11-21 11:14:20.009912940 +0800
@@ -607,6 +607,17 @@ fastpath:
        kmem_cache_free(skbuff_fclone_cache, fclones);
}

+static void __save_stack_trace(unsigned long *trace)
+{
+       struct stack_trace stack_trace;
+
+       stack_trace.max_entries = 5;
+       stack_trace.nr_entries = 0;
+       stack_trace.entries = trace;
+       stack_trace.skip = 1;
+       save_stack_trace(&stack_trace);
+}
+
void skb_release_head_state(struct sk_buff *skb)
{
        skb_dst_drop(skb);
@@ -615,6 +626,7 @@ void skb_release_head_state(struct sk_bu
                WARN_ON(in_irq());
                skb->destructor(skb);
        }
+       __save_stack_trace((unsigned long *)skb->cb);
#if IS_ENABLED(CONFIG_NF_CONNTRACK)
        nf_conntrack_put(skb_nfct(skb));
#endif

分析添加调测补丁的vmcore (路径: )
在新的vmcore中查找sk_buff 的cb(control block)字段
输入图片说明

读取cb中的内容

在cb中看到保存栈的调用路径为：
tcp_done–>inet_csk_destroy_sock->sk_stream_kill_queues->kfree_skb->skb_release_all
在此同时生成了另外一份vmcore，从堆栈信息中可以看到在sk_stream_kill_queues中报空指针

反汇编sk_stream_kill_queues函数

查看x2寄存器的skb

从上面的vmcore分析，可以看出有两条路径同时操作同一个sock的receive队列

cpu0
--------------------------------------------------
net_rx_action
  -->mlx5e_napi_poll
    -->mlx5e_poll_rx_cq
      -->mlx5_handle_rx_cqe
        -->napi_gro_receive
          -->netif_receive_skb_internal
            -->__netif_receive_skb
              -->__netif_receive_skb_one_core
                -->ip_rcv
                  -->ip_rcv_finish
                    -->ip_local_deliver
                      -->ip_local_deliver_finish
                        -->tcp_v4_rcv
                          -->tcp_v4_do_rcv
                            -->tcp_rcv_state_process
                              -->tcp_data_queue
                                -->tcp_fin
                                  -->tcp_time_wait
                                    -->tcp_done
                                      -->inet_csk_destroy_sock
                                        -->sk_stream_kill_queues
                                          // 释放报文
                                          -->__skb_queue_purge(&sk_receive_queue)



cpu1
--------------------------------------------------
net_rx_action
  -->mlx5e_napi_poll
    -->mlx5e_poll_tx_cq
      -->napi_consume_skb
        -->skb_release_all
          -->skb_release_head_state
            -->skb->destructor
              -->__sock_wfree
                -->if(refcount_sub_and_test(skb->truesize, &sk_sk_wmem_alloc))
                  -->__sk_free
                    -->sk_destruct
                      -->inet_sock_destruct
                        -->__skb_queue_purge(&sk->sk_receive_queue)
                        // cpu0在处理释放ack报文，sk_receive_queue在cpu0已经被清空了，同时cpu1从同一个sock获取数据到receive队列，此时再次释放报文而发生线程同步问题
                        // 问题原因cpu0释放内存后，未刷入内存导致cpu1上获取到了之前的数据

与某网卡厂商FAE工程师交流确认，网卡在对同一条流做hash时,收发流程可能在不同的cpu核上，并发时会产生上面场景

4 问题根因总结

网卡在对一条tcp流做hash时，收发分别会走到不同的cpu上进行处理，cpu1收到FIN报文处理流程会调用__skb_queue_purge清空sock的receive队列，同时cpu0上在释放回收发送的ack报文时也会调用__skb_queue_purge清空receive队列,cpu0 __sock_wfree函数无内存屏障，存在内存序问题,导致cpu0获取到sk_receive_queue接收队列之前的数据,从而导致double free.

5 修改方案

__sock_wfree在执行sk_free前需要加内存屏障

void __sock_wfree(struct sk_buff *skb)
{
        struct sock *sk = skb->sk;

        if (refcount_sub_and_test(skb->truesize, &sk->sk_wmem_alloc)) {
                smp_rmb(); // 添加内存屏障验证问题解决
                __sk_free(sk);
        }
}

排查社区补丁
通过排查发现社区在5.1 在refcount_sub_and_test函数中加了内存屏障，在修复另外的问题时将这块引入,在5.10主线最新版本上做过测试，测试12H+没有复现问题

commit 47b8f3ab9c49daa824af848f9e02889662d8638f
Author: Elena Reshetova <elena.reshetova@intel.com>
Date:   Wed Jan 30 13:18:51 2019 +0200

    refcount_t: Add ACQUIRE ordering on success for dec(sub)_and_test() variants

    This adds an smp_acquire__after_ctrl_dep() barrier on successful
    decrease of refcounter value from 1 to 0 for refcount_dec(sub)_and_test
    variants and therefore gives stronger memory ordering guarantees than
    prior versions of these functions.

    Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Andrea Parri <andrea.parri@amarulasolutions.com>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Will Deacon <will.deacon@arm.com>
    Cc: dvyukov@google.com
    Cc: keescook@chromium.org
    Cc: stern@rowland.harvard.edu
    Link: https://lkml.kernel.org/r/1548847131-27854-2-git-send-email-elena.reshetova@intel.com
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

diff --git a/lib/refcount.c b/lib/refcount.c
index ebcf8cd49e05..6e904af0fb3e 100644
--- a/lib/refcount.c
+++ b/lib/refcount.c
@@ -33,6 +33,9 @@
  * Note that the allocator is responsible for ordering things between free()
  * and alloc().
  *
+ * The decrements dec_and_test() and sub_and_test() also provide acquire
+ * ordering on success.
+ *
  */

 #include <linux/mutex.h>
@@ -164,8 +167,8 @@ EXPORT_SYMBOL(refcount_inc_checked);
  * at UINT_MAX.
  *
  * Provides release memory ordering, such that prior loads and stores are done
- * before, and provides a control dependency such that free() must come after.
- * See the comment on top.
+ * before, and provides an acquire ordering on success such that free()
+ * must come after.
  *
  * Use of this function is not recommended for the normal reference counting
  * use case in which references are taken and released one at a time.  In these
@@ -190,7 +193,12 @@ bool refcount_sub_and_test_checked(unsigned int i, refcount_t *r)

        } while (!atomic_try_cmpxchg_release(&r->refs, &val, new));

-       return !new;
+       if (!new) {
+               smp_acquire__after_ctrl_dep();
+               return true;
+       }
+       return false;
+
}
EXPORT_SYMBOL(refcount_sub_and_test_checked);

准确地说，这个问题不属于 race condition这类问题，而是cache数据没刷下内存，导致后面执行的线程读取了不正确的数据

20.03 LTS 已合入：
https://gitee.com/openeuler/kernel/commit/421d5d58c99d3c13ba3eed3869b13e1ec79a2aa3

+1

src-openEuler / kernel

内容风险标识

评论 (5)