428 Star 1.6K Fork 1.6K

GVPopenEuler/kernel

 / 详情

x86/quirks: Add parameter to clear MSIs early on boot

已完成
任务 成员
创建于  
2024-01-18 10:33

问题背景:
x86 + Hi1822网卡场景,虚拟机panic后,crash kernel启动生成vmcore过程中,
由于串口中断无法上报,导致1号进程卡住。

原因分析(liangyun):

  1. 第二内核启动初始化过程中,init_IRQ 函数预留了0-15 irq的vector(48--63)。
    panic之前,hinic 1822网卡中断恰好也是使用了这部分vector,在panic之后,仍有网卡中断上报,
    由于vector恰好与init_IRQ 函数预留的一样,就误走到了handle_level_irq这个处理函数。
    如果vector不冲突的情况下,走else分支,会写EOI寄存器。
 desc = __this_cpu_read(vector_irq[vector]);
 if (likely(!IS_ERR_OR_NULL(desc))) {
	handle_irq(desc, regs);//误走到handle_level_irq
 } else {
	ack_APIC_irq();//native_apic_msr_eoi_write,写EOI寄存器
	__this_cpu_write(vector_irq[vector], VECTOR_UNUSED);
 }

对网卡中断不回应EOI,会导致后续不再上报设备中断。
2. 第二内核启动阶段, apic_pending_intr_clear 函数,其实已经考虑到中断残留,
需要回应EOI的场景,会读取是否有pending的中断,并回复EOI。
参考https://lore.kernel.org/all/20190722105219.158847694@linutronix.de/T/#u

for (i = 0; i < 512; i++) {
    if (!apic_check_and_ack(&irr, &isr))//当没有中断pending时,退出循环,否则要执行512次。
	return;
}

但是,存在这种情况,在走完apic_pending_intr_clear函数后,仍有网卡中断上报,
此后就没有人帮忙回EOI了,导致后续设备中断不上报, 无法正常走完kdump流程。

解决方案:
ubuntu社区有类似问题,现象稍不同,但补丁可以解决问题。参考链接如下:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1797990
https://lore.kernel.org/linux-pci/20181018183721.27467-3-gpiccoli@canonical.com/
https://patchwork.kernel.org/project/linux-pci/patch/20181018183721.27467-3-gpiccoli@canonical.com/
相关补丁已回合ubuntu Xenial, Bionic, Cosmic, Disco release版本。

评论 (2)

zhengzengkai 创建了任务 1年前

Hi zhengzengkai, welcome to the openEuler Community.
I'm the Bot here serving you. You can find the instructions on how to interact with me at Here.
If you have any questions, please contact the SIG: Kernel, and any of the maintainers.

openeuler-ci-bot 添加了
 
sig/Kernel
标签
1年前
zhengzengkai 修改了描述 1年前
zhengzengkai 修改了描述 1年前

验证记录:
复现方法:
systemctl stop sysmonitor
systemctl stop irqbalance
for i in {31..40};do echo 1 > /proc/irq/$i/smp_affinity;done // Hi1822 中断绑0核
taskset -ca 0 echo c > /proc/sysrq-trigger // 在0核触发panic

before patch:
问题必现,启动卡住,kdump无法生成vmcore

[    1.300379][    T1] systemd[1]: Detected architecture x86-64.
[    1.300828][    T1] systemd[1]: Running in initial RAM disk.
[    6.902809][   T25] usb 1-1: New USB device found, idVendor=0627, idProduct=0001, bcdDevice= 0.00
[    6.903515][   T25] usb 1-1: New USB device strings: Mfr=1, Product=3, SerialNumber=10
[    6.904139][   T25] usb 1-1: Product: QEMU USB Tablet
[    6.904548][   T25] usb 1-1: Manufacturer: QEMU
[    6.904901][   T25] usb 1-1: SerialNumber: 28754-0000:00:01.2-1
[    8.646987][   T25] input: QEMU QEMU USB Tablet as /devices/pci0000:00/0000:00:01.2/usb1/1-1/1-1:1.0/0003:0627:0001.0001/input/input3
[    8.647966][   T25] hid-generic 0003:0627:0001.0001: input,hidraw0: USB HID v0.01 Mouse [QEMU QEMU USB Tablet] on usb-0000:00:01.2-1/input0
[   31.391202][    T1] systemd[1]: Hostname set to <unc-hwx1263137-x86-20240105-1005-sdu--1>.
[   31.408126][  T192] systemd-gpt-auto-generator[192]: EFI loader partition unknown, exiting.
[   31.414761][  T192] systemd-gpt-auto-generator[192]: (The boot loader did not set EFI variable LoaderDevicePartUUID.)
[   31.447690][    T1] systemd[1]: /usr/lib/systemd/system/kdump-capture.service:23: Standard output type syslog is obsolete, automatically updating to journal. Please update your unit file, and consider removing the setting altogether.
[   31.449311][    T1] systemd[1]: /usr/lib/systemd/system/kdump-capture.service:24: Standard output type syslog+console is obsolete, automatically updating to journal+console. Please update your unit file, and consider removing the setting altogether.
[   31.461690][    T1] systemd[1]: Queued start job for default target Initrd Default Target.
[   31.462521][    T1] systemd[1]: Started Dispatch Password Requests to Console Directory Watch.
[   61.598877][    T1] systemd[1]: Reached target Initrd Root Device.

after patch:
crash kernel boot success, vmcore 正常dump成功:

[   80.954241][T32803] [kbox] end panic event
[   80.954876][T32803] Kernel Offset: 0x3a000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[   80.958067][T32803] kexec: Bye!
[   80.958661][T32803] kexec: [Kexec]: timestamp: 1773495602854, tsc freq: 2300004
[    0.000000][    T0] Linux version 5.10.0-60.18.0.50.h1107.eulerosv2r11.x86_64 (abuild@pekphisprc30674) (gcc_old (GCC) 10.3.1, GNU ld (GNU Binutils) 2.37) #1 SMP Thu Jan 18 11:49:08 UTC 2024
[    0.000000][    T0] Command line: BOOT_IMAGE=/vmlinuz-5.10.0-60.18.0.50.xxxx.x86_64 ro net.ifnames=0 biosdevname=0 oops=panic softlockup_panic=1 crash_kexec_post_notifiers panic=3 console=tty0 nmi_watchdog=1 selinux=0 no-steal-time rd.shell=0 audit=1 files_panic_enable=1 console=ttyS0,115200 scsi_mod.scan=sync cma=0 nospec noibrs noibpb nopti spectre_v2=off nospec_store_bypass_disable force_split_huge_page hashdist=1 fsck.mode=auto fsck.repair=preen ext4.speed=0 exec_hugepages default_hugepagesz=2M isolcpus=managed_irq,domain,2 irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 acpi_no_memhotplug transparent_hugepage=never nokaslr hest_disable novmcoredd rd.timeout=120 rd.emergency=reboot intel_iommu=off cma=0 pci=clearmsi loglevel=7 disable_cpu_apicid=0 elfcorehdr=3075444K
......
[    0.009472][    T0] Clearing MSI/MSI-X enable bits early in boot (quirk)
......
[    7.251988][  T516] EXT4-fs (dm-0): recovery complete
[    7.253876][  T516] EXT4-fs (dm-0): mounted filesystem 234d0371-5f28-4c54-b9cd-d576d06b8d16 r/w with ordered data mode. Opts: errors=remount-ro,data_err=abort,prjquota
kdump: saving kbox to /kdumproot//opt//crash/127.0.0.1-2024-01-18-20:36:11/
kdump: delete /kdumproot/opt/crash/127.0.0.1-2024-01-18-20:33:22
Thu Jan 18 20:36:11 GMT 2024: kdump: start timer...
kdump: saving vmcore-dmesg.txt
watchdog interval 300 s.
kernel start 7 s.
now start timer, time interval 292 s.
kdump: saving vmcore-dmesg.txt complete
Thu Jan 18 20:36:11 GMT 2024: kdump: saving kbox.img.gz

问题环境lspci -tv输出:
输入图片说明
输入图片说明

登录 后才可以发表评论

状态
负责人
项目
预计工期 (小时)
开始日期   -   截止日期
-
置顶选项
优先级
分支
参与者(2)
5329419 openeuler ci bot 1632792936 zhengzengkai-zhengzengkai
C
1
https://gitee.com/openeuler/kernel.git
git@gitee.com:openeuler/kernel.git
openeuler
kernel
kernel

搜索帮助