402 Star 1.4K Fork 1.3K

GVPopenEuler / kernel

 / 详情

[OLK-6.6] kernel panic on boot

已完成
缺陷 成员
创建于  
2024-01-31 21:26

Looks like the following commit:

16257e430641 crypto: kabi: KABI reservation for crypto

causes the following panic stack. And after remove the KABI_RESERVE() in "include/crypto/cryptd.h" and "include/crypto/hash.h", the panic disappears.

===============================================================

[   15.642274][ T1469] BUG: kernel NULL pointer dereference, address: 000000000000002c
[   15.659247][ T1469] #PF: supervisor read access in kernel mode                                                                                                                                     [   15.659248][ T1469] #PF: error_code(0x0000) - not-present page
[   15.659249][ T1469] PGD 12e952067 P4D 0
[   15.659251][ T1469] Oops: 0000 [#1] PREEMPT SMP NOPTI
[   15.659252][ T1469] CPU: 80 PID: 1469 Comm: cryptomgr_test Not tainted 6.6.0-iommufd66+ #1
[   15.659254][ T1469] Hardware name: Intel Corporation M50CYP2SBSTD/M50CYP2SBSTD, BIOS SE5C620.86B.01.01.0005.2202160810 02/16/2022
[   15.659255][ T1469] RIP: 0010:crypto_ahash_setkey+0x11/0x60
[   15.659260][ T1469] Code: 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 48 8b 47 70 53 48 89 fb <85> 70 2c 75 17 48 8b 47 38 ff d0 0f 1f 00 85 c0 75 13 83 63 5c fe                                                                                                                                                                              [   15.659261][ T1469] RSP: 0018:ffa00000209dfbf8 EFLAGS: 00010286
[   15.659262][ T1469] RAX: 0000000000000000 RBX: ff11000109abe3d0 RCX: 00000000fff000ff
[   15.659263][ T1469] RDX: 0000000000000010 RSI: ffffffffb62aea35 RDI: ff11000109abe3d0
[   15.659264][ T1469] RBP: 0000000000000000 R08: 00000000fff000ff R09: ffa00000209dfca0
[   15.659265][ T1469] R10: 0000000000000000 R11: ffa00000209dfc68 R12: ff11000109abf858
[   15.659265][ T1469] R13: ffa00000209dfca0 R14: 0000000000000000 R15: ffffffffb5e8c940
[   15.659266][ T1469] FS:  0000000000000000(0000) GS:ff11001c8e100000(0000) knlGS:0000000000000000
[   15.659267][ T1469] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   15.659268][ T1469] CR2: 000000000000002c CR3: 000000012e9cc002 CR4: 0000000000771ee0
[   15.659269][ T1469] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   15.659270][ T1469] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   15.659270][ T1469] PKRU: 55555554
[   15.659271][ T1469] Call Trace:
[   15.659272][ T1469]  <TASK>
[   15.659273][ T1469]  ? __die+0x23/0x70
[   15.659277][ T1469]  ? page_fault_oops+0x81/0x150
[   15.659280][ T1469]  ? exc_page_fault+0x5ea/0x7d0
[   15.659283][ T1469]  ? asm_exc_page_fault+0x26/0x30
[   15.659287][ T1469]  ? crypto_ahash_setkey+0x11/0x60
[   15.659289][ T1469]  crypto_ahash_setkey+0x1c/0x60
[   15.659291][ T1469]  test_ahash_vec_cfg+0x165/0x840
[   15.659294][ T1469]  ? vsnprintf+0x44d/0x630
[   15.659296][ T1469]  ? sprintf+0x5a/0x80
[   15.659297][ T1469]  __alg_test_hash.isra.0+0x1aa/0x3a0
[   15.659299][ T1469]  alg_test+0x199/0x610
[   15.659300][ T1469]  ? __schedule+0x611/0xc30
[   15.659302][ T1469]  ? __pfx_cryptomgr_test+0x10/0x10
[   15.659305][ T1469]  cryptomgr_test+0x24/0x40
[   15.659307][ T1469]  kthread+0xe5/0x120
[   15.659310][ T1469]  ? __pfx_kthread+0x10/0x10
[   15.659312][ T1469]  ret_from_fork+0x31/0x50
[   15.659315][ T1469]  ? __pfx_kthread+0x10/0x10
[   15.659317][ T1469]  ret_from_fork_asm+0x1b/0x30
[   15.659321][ T1469]  </TASK>

评论 (21)

Jason Zeng 创建了缺陷

Hi x56Jason, welcome to the openEuler Community.
I'm the Bot here serving you. You can find the instructions on how to interact with me at Here.
If you have any questions, please contact the SIG: Kernel, and any of the maintainers.

openeuler-ci-bot 添加了
 
sig/Kernel
标签
Jason Zeng 修改了描述
Jason Zeng 修改了描述

我使用openeuler_defconfig没能将问题复现出来,编译的时候的config是否可以提供下?另外是否所有的ko都重新编译过了?

我复现的步骤:
make distclean
cp arch/x86/configs/openeuler_defconfig .config
make olddefconfig
make -j256
make modules_install -j256
make install

我在SPR和ICX两个平台都复现了。
难道是跟平台相关?

这两个平台是arm64的?

这两个平台是arm64的?

是x86架构的?

哦我说错了,我是在EMR和ICX两个平台复现的(我这个EMR是从SPR升级的)。
是x86平台。是Intel比较新的两个平台。
目前Intel最新的平台是EMR(Emerald Rapids),次之是SPR(Sapphire Rapids),然后是ICX(IceLake)。

这个问题应该是启动的时候算法自测试的时候报的,环境上有插入额外的ko吗?

没有额外的ko

grub configuration

title Fedora Linux (6.6.0-iommufd66+) 38 (Server Edition)
version 6.6.0-iommufd66+
linux /vmlinuz-6.6.0-iommufd66+
initrd /initramfs-6.6.0-iommufd66+.img
options root=UUID=255d6584-02dc-49e2-848a-36d9f9af7eb6 ro console=tty0 console=ttyS0,115200n8 ignore_loglevel panic=5 kernel.softlockup_panic=1 crashkernel=2G intel_iommu=on
grub_users $grub_users
grub_arg --unrestricted
grub_class fedora

我这边复现不了,要不帮忙加个打印看看具体是哪个算法出的问题?

alg_test函数里将alg入参打印出来

好的,我试试。

另外,我抓了一个kdump:

crash> dis -lr crypto_ahash_setkey
/home/zengz/linux-OLK-6.6/crypto/ahash.c: 167
0xffffffffb41c9de0 <crypto_ahash_setkey>:       endbr64
crash> dis -l crypto_ahash_setkey
/home/zengz/linux-OLK-6.6/crypto/ahash.c: 167
0xffffffffb41c9de0 <crypto_ahash_setkey>:       endbr64
0xffffffffb41c9de4 <crypto_ahash_setkey+4>:     nopl   0x0(%rax,%rax,1)
/home/zengz/linux-OLK-6.6/crypto/ahash.c: 171
0xffffffffb41c9de9 <crypto_ahash_setkey+9>:     mov    0x70(%rdi),%rax
/home/zengz/linux-OLK-6.6/crypto/ahash.c: 167
0xffffffffb41c9ded <crypto_ahash_setkey+13>:    push   %rbx
0xffffffffb41c9dee <crypto_ahash_setkey+14>:    mov    %rdi,%rbx
/home/zengz/linux-OLK-6.6/crypto/ahash.c: 171
0xffffffffb41c9df1 <crypto_ahash_setkey+17>:    test   %esi,0x2c(%rax)   <=======================
0xffffffffb41c9df4 <crypto_ahash_setkey+20>:    jne    0xffffffffb41c9e0d <crypto_ahash_setkey+45>
/home/zengz/linux-OLK-6.6/crypto/ahash.c: 174
0xffffffffb41c9df6 <crypto_ahash_setkey+22>:    mov    0x38(%rdi),%rax
0xffffffffb41c9dfa <crypto_ahash_setkey+26>:    call   *%rax
0xffffffffb41c9dfc <crypto_ahash_setkey+28>:    nopl   (%rax)
/home/zengz/linux-OLK-6.6/crypto/ahash.c: 176
0xffffffffb41c9dff <crypto_ahash_setkey+31>:    test   %eax,%eax
0xffffffffb41c9e01 <crypto_ahash_setkey+33>:    jne    0xffffffffb41c9e16 <crypto_ahash_setkey+54>
/home/zengz/linux-OLK-6.6/./include/linux/crypto.h: 492
0xffffffffb41c9e03 <crypto_ahash_setkey+35>:    andl   $0xfffffffe,0x5c(%rbx)
/home/zengz/linux-OLK-6.6/crypto/ahash.c: 182
0xffffffffb41c9e07 <crypto_ahash_setkey+39>:    pop    %rbx

应该是进入crypto_ahash_key()的时候,参数tfm->__crt_alg == NULL了,所以crypto_tfm_alg_alignmask()里面访问tfm->__crt_alg->cra_alignmas的时候panic了。

crash> bt
PID: 1494     TASK: ff1100011cbd5f40  CPU: 98   COMMAND: "cryptomgr_test"
 #0 [ffa000002006f950] machine_kexec at ffffffffb3c78617
 #1 [ffa000002006f9a8] __crash_kexec at ffffffffb3e0143e
 #2 [ffa000002006fa68] crash_kexec at ffffffffb3e0247c
 #3 [ffa000002006fa70] oops_end at ffffffffb3c3346d
 #4 [ffa000002006fa90] page_fault_oops at ffffffffb3c8ca78
 #5 [ffa000002006fae8] exc_page_fault at ffffffffb4918d6a                                                                                                                                              #6 [ffa000002006fb40] asm_exc_page_fault at ffffffffb4a012c6
    [exception RIP: crypto_ahash_setkey+17]
    RIP: ffffffffb41c9df1  RSP: ffa000002006fbf8  RFLAGS: 00010286
    RAX: 0000000000000000  RBX: ff11001d07ecc9d0  RCX: 00000000fff000ff
    RDX: 0000000000000010  RSI: ffffffffb50ae93a  RDI: ff11001d07ecc9d0
    RBP: 0000000000000000   R8: 00000000fff000ff   R9: ffa000002006fca0
    R10: 0000000000000000  R11: ffa000002006fc68  R12: ff11001d07eaf0d8
    R13: ffa000002006fca0  R14: 0000000000000000  R15: ffffffffb4c8c900
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #7 [ffa000002006fc00] crypto_ahash_setkey at ffffffffb41c9dfc
 #8 [ffa000002006fc10] test_ahash_vec_cfg at ffffffffb41d5215
 #9 [ffa000002006fd90] __alg_test_hash at ffffffffb41d5aaa
#10 [ffa000002006fe08] alg_test at ffffffffb41d4c29
#11 [ffa000002006fee8] cryptomgr_test at ffffffffb41d0144
#12 [ffa000002006fef8] kthread at ffffffffb3d302c5
#13 [ffa000002006ff30] ret_from_fork at ffffffffb3c3eb41
#14 [ffa000002006ff50] ret_from_fork_asm at ffffffffb3c0265b
crash> struct crypto_ahash ff11001d07ecc9d0
struct crypto_ahash {
  init = 0xffffffffb41e3f80 <cryptd_hash_final_enqueue>,
  update = 0xffffffffb41e3020 <cryptd_hash_finup_enqueue>,
  final = 0xffffffffb41e3e90 <cryptd_hash_digest_enqueue>,
  finup = 0xffffffffb41e25d0 <cryptd_hash_export>,
  digest = 0xffffffffb41e2610 <cryptd_hash_import>,
  export = 0xffffffffb41e3440 <cryptd_hash_setkey>,
  import = 0x3c00000014,
  setkey = 0x0,
  statesize = 0,
  reqsize = 0,
  kabi_reserved1 = 4294967297,
  kabi_reserved2 = 4294967295,
  base = {
    refcnt = {
      refs = {
        counter = -1273194992
      }
    },
    crt_flags = 4293918975,
    node = 416071784,
    exit = 0x0,
    __crt_alg = 0x0,
    kabi_reserved1 = 1,
    kabi_reserved2 = 18379471679260441120,
    __crt_ctx = 0xff11001d07ecca58
  }
}

噢我复现出来了!应该是某个加速ko出了问题

在测试ghash的时候panic

[   15.473769][ T1468] alg_test: alg: ghash
[   15.474946][ T1271] ACPI: bus type drm_connector registered                                                                                                                                        [   15.479992][ T1468] ahash: alg_cra_name: ghash
[   15.492485][ T1468] BUG: kernel NULL pointer dereference, address: 000000000000002c
[   15.492486][ T1468] #PF: supervisor read access in kernel mode
[   15.492487][ T1468] #PF: error_code(0x0000) - not-present page
[   15.492488][ T1468] PGD 0                                                                                                                                                                          [   15.492490][ T1468] Oops: 0000 [#1] PREEMPT SMP NOPTI
[   15.492493][ T1468] CPU: 43 PID: 1468 Comm: cryptomgr_test Not tainted 6.6.0-iommufd66+ #4
[   15.492550][ T1269] dca service started, version 1.12.1
[   15.538232][ T1468] Hardware name: Intel Corporation M50CYP2SBSTD/M50CYP2SBSTD, BIOS SE5C620.86B.01.01.0005.2202160810 02/16/2022
[   15.538233][ T1468] RIP: 0010:crypto_ahash_setkey+0x11/0x60

嗯嗯是的,ghash的硬件加速算法

还得看看这个算法实现有啥特殊的

从下面这个数据结构的开头的几个函数指针来看,正好错位了2个函数指针,也就是错位了2个64bit。
可能是某个数据结构在编译的时候,没有包括那两个KABI_RESERVE()的field?

crash> struct crypto_ahash ff11001d07ecc9d0
struct crypto_ahash {
  init = 0xffffffffb41e3f80 <cryptd_hash_final_enqueue>,
  update = 0xffffffffb41e3020 <cryptd_hash_finup_enqueue>,
  final = 0xffffffffb41e3e90 <cryptd_hash_digest_enqueue>,
  finup = 0xffffffffb41e25d0 <cryptd_hash_export>,
  digest = 0xffffffffb41e2610 <cryptd_hash_import>,
  export = 0xffffffffb41e3440 <cryptd_hash_setkey>,
  import = 0x3c00000014,
  setkey = 0x0,
  statesize = 0,
  reqsize = 0,
  kabi_reserved1 = 4294967297,
  kabi_reserved2 = 4294967295,
  base = {
    refcnt = {
      refs = {
        counter = -1273194992
      }
    },
    crt_flags = 4293918975,
    node = 416071784,
    exit = 0x0,
    __crt_alg = 0x0,
    kabi_reserved1 = 1,
    kabi_reserved2 = 18379471679260441120,
    __crt_ctx = 0xff11001d07ecca58
  }
}

是由于类型强转。详情看下面的解释。

问题定位了。问题原因是cryptd_alloc_ahash()里面会把crypto_ahash结构体强转成cryptd_ahsah结构体。

cryptd_ahash结构体里面是直接用一个base成员指到crypto_ahash结构体。

因此cryptd这类结构体是不能在base之前增加任何成员的。否则cryptd_ahash和crypto_ahash就会错位。

zhengzengkai 关联分支设置为OLK-6.6

启动验证成功:
启动截图

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(3)
5329419 openeuler ci bot 1632792936
C
1
https://gitee.com/openeuler/kernel.git
git@gitee.com:openeuler/kernel.git
openeuler
kernel
kernel

搜索帮助