109 Star 72 Fork 297

src-openEuler/kernel

 / 详情

ARM 5.10内核无法有效处理用户态进程消费到1GB大页UCE故障触发的SEA异常,导致系统panic

已完成
缺陷
创建于  
2022-11-16 11:02

【标题描述】ARM 5.10内核无法有效处理用户态进程消费到1GB大页UCE故障触发的SEA异常,导致系统panic
【环境信息】
硬件信息:
ARM物理机 CPU: Kunpeng-920
软件信息:
OS版本及分支: openEuler 22.03 LTS
内核信息: kernel-5.10.0-60.18.0.50
【问题复现步骤】
具体操作步骤
构造进程映射1GB hugetlb大页内存,对大页内存注入UCE故障,触发进程访问故障页面触发用户态SEA
出现概率: 必现
【预期结果】
消费到大页UCE故障内存的进程会被kill
【实际结果】
反复触发用户态SEA,最终演变为Fatal级别故障,导致系统panic
【附件信息】
复位日志:

[ 2188.908674] Memory failure: 0x5f00000: recovery action for non-pmd-sized huge page: Ignored
[ 2189.229725] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 9
[ 2189.229727] {2}[Hardware Error]: event severity: recoverable
[ 2189.229728] {2}[Hardware Error]:  Error 0, type: recoverable
[ 2189.229730] {2}[Hardware Error]:   section_type: ARM processor error
[ 2189.229731] {2}[Hardware Error]:   MIDR: 0x00000000481fd010
[ 2189.229732] {2}[Hardware Error]:   Multiprocessor Affinity Register (MPIDR): 0x00000000811f0200
[ 2189.229734] {2}[Hardware Error]:   error affinity level: 0
[ 2189.229735] {2}[Hardware Error]:   running state: 0x1
[ 2189.229736] {2}[Hardware Error]:   Power State Coordination Interface state: 0
[ 2189.229737] {2}[Hardware Error]:   Error info structure 0:
[ 2189.229738] {2}[Hardware Error]:   num errors: 1
[ 2189.229739] {2}[Hardware Error]:    error_type: 0, cache error
[ 2189.229741] {2}[Hardware Error]:    error_info: 0x0000000000400014
[ 2189.229742] {2}[Hardware Error]:     cache level: 1
[ 2189.229743] {2}[Hardware Error]:     the error has not been corrected
[ 2189.229744] {2}[Hardware Error]:    virtual fault address: 0x0000000000000000
[ 2189.229745] {2}[Hardware Error]:    physical fault address: 0x0000005f00000000
[ 2189.229747] {2}[Hardware Error]:   Vendor specific error info has 16 bytes:
[ 2189.229749] {2}[Hardware Error]:    00000000: 00000000 00000000 00000000 00000000  ................
[ 2189.229764] Memory failure: 0x5f00000: already hardware poisoned
[ 2189.549968] Memory failure: 0x5f00000: already hardware poisoned
[ 2189.867734] Memory failure: 0x5f00000: already hardware poisoned
[ 2189.869878] Memory failure: 0x5f00000: already hardware poisoned
[ 2190.191621] Memory failure: 0x5f00000: already hardware poisoned
[ 2190.196252] Memory failure: 0x5f00000: already hardware poisoned
[ 2190.507312] Memory failure: 0x5f00000: already hardware poisoned
[ 2190.812978] Memory failure: 0x5f00000: already hardware poisoned
[ 2191.118606] Memory failure: 0x5f00000: already hardware poisoned
[ 2191.424222] Memory failure: 0x5f00000: already hardware poisoned
[ 2191.729837] Memory failure: 0x5f00000: already hardware poisoned
[ 2192.035495] Memory failure: 0x5f00000: already hardware poisoned
[ 2192.341133] Memory failure: 0x5f00000: already hardware poisoned
[ 2192.646752] Memory failure: 0x5f00000: already hardware poisoned
[ 2192.952392] Memory failure: 0x5f00000: already hardware poisoned
[ 2193.258030] Memory failure: 0x5f00000: already hardware poisoned
[ 2193.563676] Memory failure: 0x5f00000: already hardware poisoned
[ 2193.565805] Memory failure: 0x5f00000: already hardware poisoned
[ 2193.871427] Memory failure: 0x5f00000: already hardware poisoned
[ 2194.177079] ghes_print_estatus: 18 callbacks suppressed
[ 2194.177081] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 9
[ 2194.177083] {3}[Hardware Error]: event severity: recoverable
[ 2194.177085] {3}[Hardware Error]:  Error 0, type: recoverable
[ 2194.177087] {3}[Hardware Error]:   section_type: ARM processor error
[ 2194.177088] {3}[Hardware Error]:   MIDR: 0x00000000481fd010
[ 2194.177089] {3}[Hardware Error]:   Multiprocessor Affinity Register (MPIDR): 0x00000000811b0300
[ 2194.177090] {3}[Hardware Error]:   error affinity level: 0
[ 2194.177091] {3}[Hardware Error]:   running state: 0x1
[ 2194.177092] {3}[Hardware Error]:   Power State Coordination Interface state: 0
[ 2194.177094] {3}[Hardware Error]:   Error info structure 0:
[ 2194.177095] {3}[Hardware Error]:   num errors: 1
[ 2194.177096] {3}[Hardware Error]:    error_type: 0, cache error
[ 2194.177097] {3}[Hardware Error]:    error_info: 0x0000000000400014
[ 2194.177098] {3}[Hardware Error]:     cache level: 1
[ 2194.177099] {3}[Hardware Error]:     the error has not been corrected
[ 2194.177101] {3}[Hardware Error]:    virtual fault address: 0x0000000000000000
[ 2194.177101] {3}[Hardware Error]:    physical fault address: 0x0000005f00000000
[ 2194.177103] {3}[Hardware Error]:   Vendor specific error info has 16 bytes:
[ 2194.177106] {3}[Hardware Error]:    00000000: 00000000 00000000 00000000 00000000  ................
[ 2194.177120] Memory failure: 0x5f00000: already hardware poisoned
[ 2195.411327] {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 10
[ 2195.420405] {4}[Hardware Error]: event severity: fatal
[ 2195.426254] {4}[Hardware Error]:  Error 0, type: fatal
[ 2195.432099] {4}[Hardware Error]:   section_type: ARM processor error
[ 2195.432108] {4}[Hardware Error]:   MIDR: 0x00000000481fd010
[ 2195.445462] {4}[Hardware Error]:   Multiprocessor Affinity Register (MPIDR): 0x00000000811b0300
[ 2195.445469] {4}[Hardware Error]:   error affinity level: 0
[ 2195.461074] {4}[Hardware Error]:   running state: 0x1
[ 2195.461081] {4}[Hardware Error]:   Power State Coordination Interface state: 0
[ 2195.474773] {4}[Hardware Error]:   Error info structure 0:
[ 2195.474780] {4}[Hardware Error]:   num errors: 2
[ 2195.486292] {4}[Hardware Error]:    propagated error captured
[ 2195.486298] {4}[Hardware Error]:    error_type: 0, cache error
[ 2195.499299] {4}[Hardware Error]:    error_info: 0x0000000020400014
[ 2195.499306] {4}[Hardware Error]:     cache level: 1
[ 2195.511778] {4}[Hardware Error]:     the error has not been corrected
[ 2195.511784] {4}[Hardware Error]:    virtual fault address: 0x0000ffffe34fc7b0
[ 2195.526785] {4}[Hardware Error]:    physical fault address: 0x8000005f00000000
[ 2195.526792] {4}[Hardware Error]:   Vendor specific error info has 16 bytes:
[ 2195.542411] {4}[Hardware Error]:    00000000: 00000000 00000000 00000000 00000000  ................
[ 2195.552582] Kernel panic - not syncing: Fatal hardware error!
[ 2195.586244] Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 6.27 10/20/2022
[ 2195.586257] Workqueue: kacpi_notify acpi_os_execute_deferred
[ 2195.608553] 
[ 2195.608559] Call trace:
[ 2195.628744]  dump_backtrace+0x0/0x1e4
[ 2195.628751]  show_stack+0x20/0x2c
[ 2195.648350]  dump_stack+0xd8/0x140
[ 2195.648357]  panic+0x220/0x4d0
[ 2195.657648]  ghes_do_memory_failure+0x0/0xa0
[ 2195.657654]  ghes_proc+0x148/0x200
[ 2195.675706]  ghes_notify_hed+0xa0/0x154
[ 2195.675714]  blocking_notifier_call_chain+0x74/0xac
[ 2195.698853]  acpi_hed_notify+0x28/0x3c
[ 2195.698861]  acpi_device_notify+0x24/0x30
[ 2195.716889]  acpi_ev_notify_dispatch+0x68/0x78
[ 2195.716894]  acpi_os_execute_deferred+0x24/0x3c
[ 2195.727956]  process_one_work+0x1d0/0x490
[ 2195.727962]  worker_thread+0x158/0x3d0
[ 2195.751459]  kthread+0x108/0x13c
[ 2195.751465]  ret_from_fork+0x10/0x18
[ 2195.770474] kernel fault(0x5) notification starting on CPU 0
[ 2195.770479] kernel fault(0x5) notification finished on CPU 0
[ 2195.789504] SMP: stopping secondary CPUs
[ 2196.388923] Kernel Offset: 0x52b40e540000 from 0xffff800100000000
[ 2196.396184] PHYS_OFFSET: 0xffffbd2dc0000000
[ 2196.401534] CPU features: 0x0000,88000006,6aa08838
[ 2196.407482] Memory Limit: none
[ 2196.421504] Starting crashdump kernel...
[ 2196.426580] Bye!

评论 (3)

Lv Ying 创建了缺陷 2年前
openeuler-ci-bot 添加了
 
sig/Kernel
标签
2年前

Linux原生问题,正在探讨解决方案

openeuler-ci-bot 任务状态待办的 修改为已完成 2年前

登录 后才可以发表评论

状态
负责人
项目
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(3)
5329419 openeuler ci bot 1632792936 孙南勇-sun_nanyong Lv Ying-lvying6
1
https://gitee.com/src-openeuler/kernel.git
git@gitee.com:src-openeuler/kernel.git
src-openeuler
kernel
kernel

搜索帮助