401 Star 1.4K Fork 1.3K

GVPopenEuler / kernel

 / 详情

[openEuler-20.03-SP2长稳] LTP membarrier01 测试触发NULL指针

已完成
缺陷
创建于  
2021-06-07 10:40

【环境信息】
环境信息:arm物理机
OS版本:20.03-SP2-round4
内核:4.19.90-2105.6.0.0090.oe1.aarch64
【问题复现步骤】
1.使用ISO选中最小化模式安装
2.执行长稳用例
【预期结果】
执行无异常
【实际结果】
执行长稳用例一段时间后机器挂掉,产生core文件

[31077.812327] vcan: Virtual CAN interface driver
[31077.812329] vcan: enabled echo on driver level.
[31077.839576] LTP: starting dma_thread_diotest6 (dma_thread_diotest -a 3072)
[31077.852754] LTP: starting msgrcv07
[31077.872154] LTP: starting msgrcv08
[31077.872154] LTP: starting msgrcv08
[31077.905499] LTP: starting timer_settime02
[31077.913540] LTP: starting msgsnd01
[31077.922597] LTP: starting msgsnd02
[31077.930673] LTP: starting msgsnd05
[31077.955532] LTP: starting msgsnd06
[31077.976741] EINJ: Error INJection is initialized.
[31078.002799] LTP: starting semctl01
[31078.009272] LTP: starting membarrier01
[31078.002799] LTP: starting semctl01
[31078.009272] LTP: starting membarrier01
[31078.136464] Unable to handle kernel NULL pointer dereference at virtual address 000000000000002c
[31078.151187] Mem abort info:
[31078.156945] ESR = 0x96000007
[31078.162892] Exception class = DABT (current EL), IL = 32 bits
[31078.171859] SET = 0, FnV = 0
[31078.177885] EA = 0, S1PTW = 0
[31078.183930] Data abort info:
[31078.189730] ISV = 0, ISS = 0x00000007
[31078.196529] CM = 0, WnR = 0
[31078.202450] user pgtable: 64k pages, 48-bit VAs, pgdp = 0000000007cf6fb5
[31078.212388] [000000000000002c] pgd=0000005fceab0003, pud=0000005fceab0003, pmd=0000005f98410003, pte=0000000000000000
[31078.229818] Internal error: Oops: 96000007 [#1] SMP
[31078.238192] Modules linked in: vcan can_raw can authenc ccm usb_storage usbatm atm sit tunnel4 ip_tunnel nbd vhost_net tap uhid uinput vhost
vsock vmw_vsock_virtio_transport_common vhost vsock tun scsi_debug cuse nls_koi8_u nls_cp932 vfio_iommu_type1 vfio unix_diag hdma_mgmt ccp vet
h vrf sm4_generic cmac ansi_cprng vmac sha3_generic seed cts aes_ce_ccm fcrypt pcbc anubis khazad tea michael_mic cast5_generic blowfish_generi
c blowfish_common des_generic sctp ts_kmp nf_log_arp nf_log_ipv6 nf_log_ipv4 nf_log_common brd fuse overlay salsa20_generic camellia_generic ca
st6_generic cast_common serpent_generic twofish_generic twofish_common xts lrw tgr192 wp512 rmd320 rmd256 rmd160 rmd128 md4 sha512_generic aes
neon_blk loop jprob(OE) binfmt_misc ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4
[31078.364569] xt_conntrack ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security
iptable_nat nf_nat_ipv4 nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ebtable_filter ebt
ables ip6table_filter ip6_tables iptable_filter vfat fat ipmi_ssif dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c raid1 ses e
nclosure aes_ce_blk crypto_simd cryptd aes_ce_cipher ghash_ce hns_roce_hw_v2 ofpart sha2_ce cmdlinepart sha256_arm64 hns_roce sha1_ce sg ib_cor
e sbsa_gwdt ipmi_si hi_sfc ipmi_devintf mtd spi_dw_mmio ipmi_msghandler sch_fq_codel ip_tables ext4 mbcache jbd2 sr_mod cdrom sd_mod realtek hi
si_sas_v3_hw hisi_sas_main libsas ahci hclge scsi_transport_sas libahci hns3 hinic libata hnae3
[31078.517251] megaraid_sas i2c_designware_platform i2c_designware_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: einj]
[31078.545923] Process membarrier01 (pid: 3995063, stack limit = 0x00000000c5abde19)
[31078.570778] CPU: 35 PID: 3995063 Comm: membarrier01 Kdump: loaded Tainted: G WC OE 4.19.90-2105.6.0.0090.oe1.aarch64 #1
[31078.600187] Hardware name: Huawei TaiShan 200 (Model 2280)/BC82AMDDA, BIOS 1.06 10/29/2019
[31078.605325] LTP: starting semctl02
[31078.626327] pstate: 60400009 (nZCv daif +PAN -UAO)
[31078.626334] pc : membarrier_global_expedited+0xe8/0x1a8
[31078.626337] lr : membarrier_global_expedited+0xe8/0x1a8
[31078.678926] sp : ffff00016ca6fd50
[31078.690382] x29: ffff00016ca6fd50 x28: ffff802d6005f000
[31078.703582] x27: 0000000000000000 x26: 0000000000000000
[31078.716619] x25: ffff00016ca6fda8 x24: ffff000008fcd500
[31078.729474] x23: ffff0000092d5000 x22: ffff000008fb0018
[31078.742120] x21: ffff0000092d3000 x20: ffff0000092d57f0
[31078.754541] x19: 0000000000000020 x18: 0000000000000000
[31078.757624] LTP: starting semctl03
[31078.766910] x17: 0000000000000000 x16: 0000000000000000
[31078.766912] x15: 0000000000000000 x14: 0000000000000000
[31078.766913] x13: 0000000000000000 x12: 0000000000000000
[31078.766914] x11: 0000000000000000 x10: 0000000000000000
[31078.766915] x9 : 0000000000000000 x8 : 0000000000000000
[31078.766915] x7 : 0000000000000000 x6 : ffff00016ca6fd48
[31078.766918] x5 : ffff00016ca6fd48 x4 : ffffffffffffffff
[31078.839878] LTP: starting semctl04
[31078.845022] x3 : 0000000000000000 x2 : 62512f9d263e2b00
[31078.845024] x1 : 0000000000000000 x0 : 0000000000000000
[31078.845026] Call trace:
[31078.845028] membarrier_global_expedited+0xe8/0x1a8
[31078.845031] __arm64_sys_membarrier+0xac/0x1f0
[31078.882494] LTP: starting semctl05
[31078.884100] el0_svc_common+0x78/0x178
[31078.901063] LTP: starting semctl06
[31078.909058] el0_svc_handler+0x38/0x78
[31078.909060] el0_svc+0x8/0x1f8
[31078.909063] Code: b949d401 361ffdc1 91264000 97fe707d (b9402c00)
[31078.955012] SMP: stopping secondary CPUs
[31078.964813] Starting crashdump kernel...
[31078.972460] Bye!

评论 (6)

6++ 创建了缺陷
6++ 关联仓库设置为openEuler/kernel
展开全部操作日志

Hey Classicriver_jia, Welcome to openEuler Community.
All of the projects in openEuler Community are maintained by @openeuler-ci-bot.
That means the developers can comment below every pull request or issue to trigger Bot Commands.
Please follow instructions at https://gitee.com/openeuler/community/blob/master/en/sig-infrastructure/command.md to find the details.

openeuler-ci-bot 添加了
 
sig/Kernel
标签
6++ 负责人设置为成坚 (CHENG Jian)
成坚 (CHENG Jian) 修改了描述
crash /usr/lib/debug/usr/lib/modules/4.19.90-2105.6.0.0090.oe1.aarch64/vmlinux ./vmcore
dis -x membarrier_global_expedited


0xffff000008160dc8 <membarrier_global_expedited+0xc0>:  b.eq    0xffff000008160d9c <membarrier_global_expedited+0x94>
0xffff000008160dcc <membarrier_global_expedited+0xc4>:  adrp    x0, 0xffff0000092d3000 <page_wait_table+0x14c0>
0xffff000008160dd0 <membarrier_global_expedited+0xc8>:  add     x0, x0, #0x7c8
0xffff000008160dd4 <membarrier_global_expedited+0xcc>:  mov     x1, x24
0xffff000008160dd8 <membarrier_global_expedited+0xd0>:  ldr     x0, [x0,w19,sxtw #3]
0xffff000008160ddc <membarrier_global_expedited+0xd4>:  add     x0, x0, x1
0xffff000008160de0 <membarrier_global_expedited+0xd8>:  ldr     w1, [x0,#2516]
0xffff000008160de4 <membarrier_global_expedited+0xdc>:  tbz     w1, #3, 0xffff000008160d9c <membarrier_global_expedited+0x94>
0xffff000008160de8 <membarrier_global_expedited+0xe0>:  add     x0, x0, #0x990
0xffff000008160dec <membarrier_global_expedited+0xe4>:  bl      0xffff0000080fcfe0 <task_rcu_dereference>
0xffff000008160df0 <membarrier_global_expedited+0xe8>:  ldr     w0, [x0,#44]
0xffff000008160df4 <membarrier_global_expedited+0xec>:  tbnz    w0, #21, 0xffff000008160d9c <membarrier_global_expedited+0x94>
0xffff000008160df8 <membarrier_global_expedited+0xf0>:  cmp     w19, #0x0
0xffff000008160dfc <membarrier_global_expedited+0xf4>:  add     w0, w19, #0x3f
0xffff000008160e00 <membarrier_global_expedited+0xf8>:  csel    w0, w0, w19, lt
0xffff000008160e04 <membarrier_global_expedited+0xfc>:  negs    w2, w19
6++ 计划截止日期设置为2021-06-08
6++ 计划开始日期设置为2021-06-07
6++ 优先级设置为主要

membarrier_global_expedited 中获取到的 p 为 NULL,导致直接访问 p->flags 的时候出现了 NULL 指针。

输入图片说明

2 问题分析


2.1 最早的一个 BUG(rq->curr 未受保护)


最早出现的问题
https://lore.kernel.org/patchwork/patch/508120/
https://lore.kernel.org/patchwork/patch/508946/
https://lore.kernel.org/patchwork/patch/508980/
https://lore.kernel.org/patchwork/patch/509526/

2.2 引入 task_rcu_dereference


bac7857319bc sched/fair: Use task_rcu_dereference()
150593bf8693 sched/api: Introduce task_rcu_dereference() and try_get_task_struct()

鉴于此问题, redhat 14 年推出的补丁. introduce task_rcu_dereference?
https://lkml.org/lkml/2014/10/22/833
https://lore.kernel.org/patchwork/cover/510962/
如果一个进程不能在 rcu_read_unlock 之前被释放的时候, 他才会返回非 NULL.
否则返回 NULL, 意味着这个进程已经释放或者正在释放的过程中, 且在 unlock 之前就会被释放.

这个补丁一直到 2016 年才合入, 然后将上面的补丁修改为 raw_spin_lock_irq 对 rq->curr 的保护修改为 task_rcu_dereference 的保护

输入图片说明

2.3 问题的引入, 之前的补丁删除了对 p 的判 NULL


随后补丁 commit 227a4aadc75b sched/membarrier: Fix p->mm->membarrier_state racy load
删除了 p 的判断. 这里有问题, 应该对 p 进行判 NULL.

2.4 保证进程出队后一个宽限期才可以释放, 安全的使用 rcu_dereference


接着 2019/08/20 一个 BUG, 直指 task_rcu_dereference() 函数有一定问题.
https://lkml.org/lkml/2019/8/30/574
合入了如下补丁, 保证了进程在离开 RQ 后或者退出后, 经历一个宽限期才可以被正常释放掉, 于是我们可以保证在安全的使用 rcu_dereference(), 而不是使用 task_rcu_dereference()
https://lore.kernel.org/patchwork/patch/1127742/

5311a98fef7d tasks, sched/core: RCUify the assignment of rq->curr
154abafc68bf tasks, sched/core: With a grace period after finish_task_switch(), remove unnecessary code
0ff7b2cfbae3 tasks, sched/core: Ensure tasks are available for a grace period after leaving the runqueue
3fbd7ee285b2 tasks: Add a count of task RCU users

至此, 我们可以安全的使用 rcu_dereference, 并且不再需要判 NULL 操作.

3 openEuler 问题引入


3.1 最早 membarrier_global_expedited 函数的引入


membarrier_global_expedited 函数引入的时候, p 是进行了判 NULL 的.

输入图片说明

3.2 合入引入问题的补丁


合入了 08946eccabb9 sched/membarrier: Fix p->mm->membarrier_state racy load.
删除了原来代码对 p 的判 NULL.

输入图片说明

08946eccabb9 sched/membarrier: Fix p->mm->membarrier_state racy load
cfd49aa06b94 sched: Clean up active_mm reference counting
987805770a3f sched/membarrier: Remove redundant check

因此引入了问题.

3.3 引入问题分析


mainline 无此问题的原因是,
mainline 是先合入了 2.4 的补丁, 此时使用 rcu_dereference 获取 rq->curr 不会为 NULL. 因此再合入 3.2 的补丁, 不会有问题.
而 openEuler 的版本, 先合入了 3.2 的补丁, 删除了对 p 的判 NULL, 但是此时由于使用的是 task_rcu_dereference, 因此 p 可能为 NULL.

修复方案:

  1. 可以对 p 增加判 NULL 操作, 该操作可以规避当前 issue 描述的问题, 但是会有其他问题, 导致 siginfo 未定义, 参见 https://lkml.org/lkml/2019/8/30/574

  2. 合入 task rcu users 的补丁, 保证即将退出的进程在出队后, 经过一个宽限期才能被释放掉. 这样我们可以使用rcu_dereference() 替换 task_rcu_dereference(), 这样就不需要再判 NULL 了.

2021/06/16 18:00 CCB 结论如下:

  1. 先用方案 1 规避掉该问题。
  2. 补丁 2 除了可以修复 NULL 指针的问题,还可以修复 task_rcu_dereference() 中引用了 task 未定义字段 sighand 的问题。后期验证没问题,再合入。

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(3)
5329419 openeuler ci bot 1632792936
C
1
https://gitee.com/openeuler/kernel.git
git@gitee.com:openeuler/kernel.git
openeuler
kernel
kernel

搜索帮助