402 Star 1.4K Fork 1.3K

GVPopenEuler / kernel

 / 详情

SATA盘NCQ场景下发多个IO,其中一个IO出错后,出现群体慢盘问题

已完成
缺陷
创建于  
2022-06-15 16:50

【标题描述】SATA盘NCQ场景下发多个IO,其中一个IO出错后,出现群体慢盘问题
【环境信息】
硬件信息:
1) 鲲鹏920
软件信息:
[root@localhost 0000:b4:02.0]# cat /etc/euleros-latest
eulerversion=EulerOS_Server_V200R008C00SPC300B630
compiletime=2019-12-27-10-58-38
kernelversion=4.19.36-vhulk1907.1.0.h619
[root@localhost 0000:b4:02.0]# uname -a
Linux localhost.localdomain 4.19.36-vhulk1907.1.0.h619.eulerosv2r8.aarch64 #1 SMP Mon Jul 22 00:00:00 UTC 2019 aarch64 aarch64 aarch64 GNU/Linux
【问题复现步骤】
1、对SATA盘注入UNC故障
2、对盘跑压测
3、磁盘框接入多块盘
【预期结果】
故障盘出问题时,对其他盘无明显影响
【实际结果】
其他盘IO卡住30秒左右
【问题根因】
NCQ场景IO出错是面向盘的报错,盘只会回一个tag全零的SDB帧。盘上的IO均不会返回。基于开源社区的IO错误处理机制,驱动处理单个异常IO后,内核会锁住host,等待所有IO超时失败后,才进错误处理(Error Handler),并执行完EH后再对host解锁。导致host锁住时长接近下发IO时设置的超时时间,引发慢盘现象。

【解决方案】
NCQ场景出错后,驱动将盘未返回的IO以出错的形式快速返回给内核,加速进入错误处理,消除群体慢盘问题,并通过内核EH机制给盘下发read log ext命令,获取错误详情,立即返回给用户。(同时解决群体慢盘和无法获取盘故障信息的问题)

【附件信息】
验证结果符合预期,无群体慢盘现象,UNC场景能内核能立即将盘的故障信息通过块设备层传递给用户,日志如下:
[ 79.348807] hisi_sas_v3_hw 0000:b4:02.0: erroneous completion ncq err dev id=1 sas_addr=0x5000000000000605 CQ hdr: 0x400903 0x10684 0x0 0x80470000
[ 79.348812] sas: sas_ata_task_done: SAS error 8d
[ 79.348819] sas: sas_ata_task_done: SAS error 8d
[ 79.348820] sas: sas_ata_task_done: SAS error 8d
[ 79.348821] sas: sas_ata_task_done: SAS error 8d
[ 79.348823] sas: sas_ata_task_done: SAS error 8d
[ 79.348824] sas: sas_ata_task_done: SAS error 8d
[ 79.348825] sas: sas_ata_task_done: SAS error 8d
[ 79.348826] sas: sas_ata_task_done: SAS error 8d
[ 79.348828] sas: sas_ata_task_done: SAS error 8d
[ 79.348829] sas: sas_ata_task_done: SAS error 8d
[ 79.348830] sas: sas_ata_task_done: SAS error 8d
[ 79.348831] sas: sas_ata_task_done: SAS error 8d
[ 79.348833] sas: sas_ata_task_done: SAS error 8d
[ 79.395693] sas: Enter sas_scsi_recover_host busy: 13 failed: 13
[ 79.395696] sas: ata6: end_device-6:1: cmd error handler
[ 79.395705] sas: ata5: end_device-6:0: dev error handler
[ 79.395707] sas: ata6: end_device-6:1: dev error handler
[ 79.395744] ata6.00: exception Emask 0x0 SAct 0x3f9f8000 SErr 0x0 action 0x6
[ 79.395746] ata6.00: failed command: READ FPDMA QUEUED
[ 79.395749] ata6.00: cmd 60/08:00:49:21:00/00:00:00:00:00/40 tag 15 ncq dma 4096 in
res 01/04:00:00:00:20/00:00:00:00:00/00 Emask 0x2 (HSM violation)
[ 79.395750] ata6.00: status: { ERR }
[ 79.395751] ata6.00: error: { ABRT }
[ 79.395753] ata6.00: failed command: READ FPDMA QUEUED
[ 79.395755] ata6.00: cmd 60/08:00:51:23:00/00:00:00:00:00/40 tag 16 ncq dma 4096 in
res 01/04:00:00:00:20/00:00:00:00:00/00 Emask 0x2 (HSM violation)
[ 79.395756] ata6.00: status: { ERR }
[ 79.395757] ata6.00: error: { ABRT }
[ 79.395758] ata6.00: failed command: READ FPDMA QUEUED
[ 79.395761] ata6.00: cmd 60/08:00:19:25:00/00:00:00:00:00/40 tag 17 ncq dma 4096 in
res 01/04:00:00:00:20/00:00:00:00:00/00 Emask 0x2 (HSM violation)
[ 79.395761] ata6.00: status: { ERR }
[ 79.395762] ata6.00: error: { ABRT }
[ 79.395763] ata6.00: failed command: READ FPDMA QUEUED
[ 79.395766] ata6.00: cmd 60/08:00:11:25:00/00:00:00:00:00/40 tag 18 ncq dma 4096 in
res 01/04:00:00:00:20/00:00:00:00:00/00 Emask 0x2 (HSM violation)
[ 79.395766] ata6.00: status: { ERR }
[ 79.395767] ata6.00: error: { ABRT }
[ 79.395768] ata6.00: failed command: READ FPDMA QUEUED
[ 79.395771] ata6.00: cmd 60/08:00:89:23:00/00:00:00:00:00/40 tag 19 ncq dma 4096 in
res 01/04:00:00:00:20/00:00:00:00:00/00 Emask 0x2 (HSM violation)
[ 79.395771] ata6.00: status: { ERR }
[ 79.395772] ata6.00: error: { ABRT }
[ 79.395773] ata6.00: failed command: READ FPDMA QUEUED
[ 79.395775] ata6.00: cmd 60/08:00:81:23:00/00:00:00:00:00/40 tag 20 ncq dma 4096 in
res 01/04:00:00:00:20/00:00:00:00:00/00 Emask 0x2 (HSM violation)
[ 79.395776] ata6.00: status: { ERR }
[ 79.395777] ata6.00: error: { ABRT }
[ 79.395778] ata6.00: failed command: READ FPDMA QUEUED
[ 79.395780] ata6.00: cmd 60/08:00:11:27:00/00:00:00:00:00/40 tag 23 ncq dma 4096 in
res 41/40:08:11:27:00/00:00:00:00:00/00 Emask 0x409 (media error)
[ 79.395781] ata6.00: status: { DRDY ERR }
[ 79.395782] ata6.00: error: { UNC }
[ 79.395783] ata6.00: failed command: READ FPDMA QUEUED
[ 79.395785] ata6.00: cmd 60/08:00:59:21:00/00:00:00:00:00/40 tag 24 ncq dma 4096 in
res 01/04:00:00:00:20/00:00:00:00:00/00 Emask 0x2 (HSM violation)
[ 79.395786] ata6.00: status: { ERR }
[ 79.395787] ata6.00: error: { ABRT }
[ 79.395788] ata6.00: failed command: READ FPDMA QUEUED
[ 79.395790] ata6.00: cmd 60/08:00:21:25:00/00:00:00:00:00/40 tag 25 ncq dma 4096 in
res 01/04:00:00:00:20/00:00:00:00:00/00 Emask 0x2 (HSM violation)
[ 79.395791] ata6.00: status: { ERR }
[ 79.395792] ata6.00: error: { ABRT }
[ 79.395793] ata6.00: failed command: READ FPDMA QUEUED
[ 79.395795] ata6.00: cmd 60/08:00:59:26:00/00:00:00:00:00/40 tag 26 ncq dma 4096 in
res 01/04:00:00:00:20/00:00:00:00:00/00 Emask 0x2 (HSM violation)
[ 79.395796] ata6.00: status: { ERR }
[ 79.395797] ata6.00: error: { ABRT }
[ 79.395798] ata6.00: failed command: READ FPDMA QUEUED
[ 79.395800] ata6.00: cmd 60/08:00:51:26:00/00:00:00:00:00/40 tag 27 ncq dma 4096 in
res 01/04:00:00:00:20/00:00:00:00:00/00 Emask 0x2 (HSM violation)
[ 79.395801] ata6.00: status: { ERR }
[ 79.395802] ata6.00: error: { ABRT }
[ 79.395803] ata6.00: failed command: READ FPDMA QUEUED
[ 79.395805] ata6.00: cmd 60/08:00:61:21:00/00:00:00:00:00/40 tag 28 ncq dma 4096 in
res 01/04:00:00:00:20/00:00:00:00:00/00 Emask 0x2 (HSM violation)
[ 79.395806] ata6.00: status: { ERR }
[ 79.395807] ata6.00: error: { ABRT }
[ 79.395808] ata6.00: failed command: READ FPDMA QUEUED
[ 79.395810] ata6.00: cmd 60/08:00:69:26:00/00:00:00:00:00/40 tag 29 ncq dma 4096 in
res 01/04:00:00:00:20/00:00:00:00:00/00 Emask 0x2 (HSM violation)
[ 79.395811] ata6.00: status: { ERR }
[ 79.395811] ata6.00: error: { ABRT }
[ 79.395814] ata6: hard resetting link
[ 79.399065] hisi_sas_v3_hw 0000:b4:02.0: phydown: phy5 phy_state=0x4
[ 79.399066] hisi_sas_v3_hw 0000:b4:02.0: ignore flutter phy5 down
[ 79.556245] hisi_sas_v3_hw 0000:b4:02.0: phyup: phy5 link_rate=10(sata)
[ 79.556251] sas: sas_form_port: phy5 belongs to port1 already(1)!
[ 79.719871] ata6.00: configured for UDMA/133
[ 79.719895] sd 6:0:1:0: [sdb] tag#24 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 79.719898] sd 6:0:1:0: [sdb] tag#24 Sense Key : Medium Error [current]
[ 79.719900] sd 6:0:1:0: [sdb] tag#24 Add. Sense: Unrecovered read error - auto reallocate failed
[ 79.719902] sd 6:0:1:0: [sdb] tag#24 CDB: Read(10) 28 00 00 00 27 11 00 00 08 00
[ 79.719904] print_req_error: I/O error, dev sdb, sector 10001
[ 79.719914] ata6: EH complete
[ 79.719920] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 13 tries: 1

评论 (1)

jamyyxg 创建了缺陷

Hi jamyyxg, welcome to the openEuler Community.
I'm the Bot here serving you. You can find the instructions on how to interact with me at Here.
If you have any questions, please contact the SIG: Kernel, and any of the maintainers: @YangYingliang , @pi3orama , @成坚 (CHENG Jian) , @jiaoff , @zhengzengkai , @Qiuuuuu , @刘勇强 , @Xie XiuQi

openeuler-ci-bot 添加了
 
sig/Kernel
标签
jamyyxg 修改了描述
jamyyxg 修改了描述
Qiuuuuu 通过src-openeuler/kernel Pull Request !677任务状态待办的 修改为已完成

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(2)
5329419 openeuler ci bot 1632792936
C
1
https://gitee.com/openeuler/kernel.git
git@gitee.com:openeuler/kernel.git
openeuler
kernel
kernel

搜索帮助