1. Origin pull request:

!863: Backport CVEs and bugfixes

2. Original pull request related issue(s):

#I6MQLP:【OLK-5.10】block: fix scan partition for exclusively open device again
#I6OMCC:【OLK-5.10】长稳环境中出现md_check_recovery相关D死锁
#I76XUJ:【OLK 同步】【syzkaller】KASAN: slab-out-of-bounds in crc16
#I798WQ:【OLK-5.10】arm64:oops in cgroup_apply_control_disable

3. Original pull request related commit(s):

Sha Datetime Message
1c498218 2023-05-31 14:58:40 +0800 CST drm/virtio: Fix error code in virtio_gpu_object_shmem_init()

stable inclusion
from stable-v5.10.173
commit c5fe3fba1b7bfecb6f17f93a433782b8500fe377
category: bugfix
bugzilla: #I6IKWF:CVE-2023-22998
CVE: CVE-2023-22998

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=c5fe3fba1b7bfecb6f17f93a433782b8500fe377

--------------------------------

In virtio_gpu_object_shmem_init() we are passing NULL to PTR_ERR, which
is returning 0/success.

Fix this by storing error value in 'ret' variable before assigning
shmem->pages to NULL.

Found using static analysis with Smatch.

Fixes: 64b88afbd92f ("drm/virtio: Correct drm_gem_shmem_get_sg_table() error handling")
Signed-off-by: Harshit Mogalapalli harshit.m.mogalapalli@oracle.com
Reviewed-by: Dmitry Osipenko dmitry.osipenko@collabora.com
Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
Signed-off-by: Guo Mengqi guomengqi3@huawei.com
Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com
Reviewed-by: Weilong Chen chenweilong@huawei.com
Signed-off-by: Jialin Zhang zhangjialin11@huawei.com
98019109 2023-05-31 14:58:39 +0800 CST drm/virtio: Correct drm_gem_shmem_get_sg_table() error handling

stable inclusion
from stable-v5.10.171
commit 87c647def389354c95263d6635c62ca0de7d12ca
category: bugfix
bugzilla: #I6IKWF:CVE-2023-22998
CVE: CVE-2023-22998

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=87c647def389354c95263d6635c62ca0de7d12ca

--------------------------------

commit 64b88afbd92fbf434759d1896a7cf705e1c00e79 upstream.

Previous commit fixed checking of the ERR_PTR value returned by
drm_gem_shmem_get_sg_table(), but it missed to zero out the shmem->pages,
which will crash virtio_gpu_cleanup_object(). Add the missing zeroing of
the shmem->pages.

Fixes: c24968734abf ("drm/virtio: Fix NULL vs IS_ERR checking in virtio_gpu_object_shmem_init")
Reviewed-by: Emil Velikov emil.l.velikov@gmail.com
Signed-off-by: Dmitry Osipenko dmitry.osipenko@collabora.com
Link: http://patchwork.freedesktop.org/patch/msgid/20220630200726.1884320-2-dmitry.osipenko@collabora.com
Signed-off-by: Gerd Hoffmann kraxel@redhat.com
Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
Signed-off-by: Ovidiu Panait ovidiu.panait@windriver.com
Signed-off-by: Guo Mengqi guomengqi3@huawei.com
Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com
Reviewed-by: Weilong Chen chenweilong@huawei.com
Signed-off-by: Jialin Zhang zhangjialin11@huawei.com
bb502cea 2023-05-31 14:58:38 +0800 CST drm/virtio: Fix NULL vs IS_ERR checking in virtio_gpu_object_shmem_init

stable inclusion
from stable-v5.10.171
commit 0a4181b23acf53e9c95b351df6a7891116b98f9b
category: bugfix
bugzilla: #I6IKWF:CVE-2023-22998
CVE: CVE-2023-22998

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=0a4181b23acf53e9c95b351df6a7891116b98f9b

--------------------------------

commit c24968734abfed81c8f93dc5f44a7b7a9aecadfa upstream.

Since drm_prime_pages_to_sg() function return error pointers.
The drm_gem_shmem_get_sg_table() function returns error pointers too.
Using IS_ERR() to check the return value to fix this.

Fixes: 2f2aa13724d5 ("drm/virtio: move virtio_gpu_mem_entry initialization to new function")
Signed-off-by: Miaoqian Lin linmq006@gmail.com
Link: http://patchwork.freedesktop.org/patch/msgid/20220602104223.54527-1-linmq006@gmail.com
Signed-off-by: Gerd Hoffmann kraxel@redhat.com
Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
Signed-off-by: Ovidiu Panait ovidiu.panait@windriver.com
Signed-off-by: Guo Mengqi guomengqi3@huawei.com
Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com
Reviewed-by: Weilong Chen chenweilong@huawei.com
Signed-off-by: Jialin Zhang zhangjialin11@huawei.com
e52586f4 2023-05-31 14:58:37 +0800 CST cgroup: Stop task iteration when rebinding subsystem

hulk inclusion
category: bugfix
bugzilla: #I798WQ:【OLK-5.10】arm64:oops in cgroup_apply_control_disable
CVE: NA

----------------------------------------------------------------------

We found a refcount UAF bug as follows:

refcount_t: addition on 0; use-after-free.
WARNING: CPU: 1 PID: 342 at lib/refcount.c:25 refcount_warn_saturate+0xa0/0x148
Workqueue: events cpuset_hotplug_workfn
Call trace:
refcount_warn_saturate+0xa0/0x148
__refcount_add.constprop.0+0x5c/0x80
css_task_iter_advance_css_set+0xd8/0x210
css_task_iter_advance+0xa8/0x120
css_task_iter_next+0x94/0x158
update_tasks_root_domain+0x58/0x98
rebuild_root_domains+0xa0/0x1b0
rebuild_sched_domains_locked+0x144/0x188
cpuset_hotplug_workfn+0x138/0x5a0
process_one_work+0x1e8/0x448
worker_thread+0x228/0x3e0
kthread+0xe0/0xf0
ret_from_fork+0x10/0x20

then a kernel panic will be triggered as below:

Unable to handle kernel paging request at virtual address 00000000c0000010
Call trace:
cgroup_apply_control_disable+0xa4/0x16c
rebind_subsystems+0x224/0x590
cgroup_destroy_root+0x64/0x2e0
css_free_rwork_fn+0x198/0x2a0
process_one_work+0x1d4/0x4bc
worker_thread+0x158/0x410
kthread+0x108/0x13c
ret_from_fork+0x10/0x18

The race that cause this bug can be shown as below:

(hotplug cpu)
d73bbd3f 2023-05-31 14:58:36 +0800 CST sched/topology: Fix exceptional memory access in sd_llc_free_all()

hulk inclusion
category: bugfix
bugzilla: #I6YJJQ:【olk5.10】BUG: unable to handle kernel paging request in build_sched_domains
CVE: NA

----------------------------------------

The function sd_llc_free_all() will be called to release allocated
resources when space allocation for the scheduling domain
structure fails. However, this function did not check if sd
is a null pointer when releasing sdd resources, resulting in
an error: "Unable to handle kernel paging request at virtual
address".

Fix this issue by adding null pointer discrimination.

Fixes: 79bec4c643fb ("sched/topology: Provide hooks to allocate data shared per LLC")
Signed-off-by: Xia Fukun xiafukun@huawei.com
Reviewed-by: songping yu yusongping@huawei.com
Signed-off-by: Jialin Zhang zhangjialin11@huawei.com
1ae011cf 2023-05-31 14:58:35 +0800 CST block: Fix the partition start may overflow in add_partition()

hulk inclusion
category: bugfix
bugzilla: 187268, #I76JDY:【OLK-5.10】WARNING in iomap_apply
CVE: NA

----------------------------------------

In the block_ioctl, we can pass in the unsigned number 0x8000000000000000
as an input parameter, like below:

block_ioctl
blkdev_ioctl
blkpg_ioctl
blkpg_do_ioctl
copy_from_user
bdev_add_partition
add_partition
p->start_sect = start; // start = 0x8000000000000000

Then, there was an warning when submit bio:

WARNING: CPU: 0 PID: 382 at fs/iomap/apply.c:54
Call trace:
iomap_apply+0x644/0x6e0
__iomap_dio_rw+0x5cc/0xa24
iomap_dio_rw+0x4c/0xcc
ext4_dio_read_iter
ext4_file_read_iter
ext4_file_read_iter+0x318/0x39c
call_read_iter
lo_rw_aio.isra.0+0x748/0x75c
do_req_filebacked+0x2d4/0x370
loop_handle_cmd
loop_queue_work+0x94/0x23c
kthread_worker_fn+0x160/0x6bc
loop_kthread_worker_fn+0x3c/0x50
kthread+0x20c/0x25c
ret_from_fork+0x10/0x18

Stack:

submit_bio_noacct
submit_bio_checks
blk_partition_remap
bio->bi_iter.bi_sector += p->start_sect
// bio->bi_iter.bi_sector = 0xffc0000000000000 + 65408
..
loop_queue_work
loop_handle_cmd
do_req_filebacked
pos = ((loff_t) blk_rq_pos(rq) << 9) + lo->lo_offset // pos < 0
lo_rw_aio
call_read_iter
ext4_dio_read_iter
__iomap_dio_rw
iomap_apply
ext4_iomap_begin
map.m_lblk = offset >> blkbits
ext4_set_iomap
iomap->offset = (u64) map->m_lblk << blkbits
// iomap->offset = 64512
WARN_ON(iomap.offset > pos) // iomap.offset = 64512 and pos < 0

This is unreasonable for start + length > disk->part0.nr_sects. There is
already a similar check in blk_add_partition().
Fix it by adding a check in bdev_add_partition().

Signed-off-by: Zhong Jinghua zhongjinghua@huawei.com
Reviewed-by: Yu Kuai yukuai3@huawei.com
Reviewed-by: Hou Tao houtao1@huawei.com
Signed-off-by: Jialin Zhang zhangjialin11@huawei.com
0d4053b9 2023-05-31 14:58:34 +0800 CST ext4: avoid a potential slab-out-of-bounds in ext4_group_desc_csum

stable inclusion
from stable-v5.10.180
commit 0dde3141c527b09b96bef1e7eeb18b8127810ce9
category: bugfix
bugzilla: 188791,#I76XUJ:【OLK 同步】【syzkaller】KASAN: slab-out-of-bounds in crc16

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=0dde3141c527b09b96bef1e7eeb18b8127810ce9

--------------------------------

commit 4f04351888a83e595571de672e0a4a8b74f4fb31 upstream.

When modifying the block device while it is mounted by the filesystem,
syzbot reported the following:

BUG: KASAN: slab-out-of-bounds in crc16+0x206/0x280 lib/crc16.c:58
Read of size 1 at addr ffff888075f5c0a8 by task syz-executor.2/15586

CPU: 1 PID: 15586 Comm: syz-executor.2 Not tainted 6.2.0-rc5-syzkaller-00205-gc96618275234 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/12/2023
Call Trace:

__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x1b1/0x290 lib/dump_stack.c:106
print_address_description+0x74/0x340 mm/kasan/report.c:306
print_report+0x107/0x1f0 mm/kasan/report.c:417
kasan_report+0xcd/0x100 mm/kasan/report.c:517
crc16+0x206/0x280 lib/crc16.c:58
ext4_group_desc_csum+0x81b/0xb20 fs/ext4/super.c:3187
ext4_group_desc_csum_set+0x195/0x230 fs/ext4/super.c:3210
ext4_mb_clear_bb fs/ext4/mballoc.c:6027 [inline]
ext4_free_blocks+0x191a/0x2810 fs/ext4/mballoc.c:6173
ext4_remove_blocks fs/ext4/extents.c:2527 [inline]
ext4_ext_rm_leaf fs/ext4/extents.c:2710 [inline]
ext4_ext_remove_space+0x24ef/0x46a0 fs/ext4/extents.c:2958
ext4_ext_truncate+0x177/0x220 fs/ext4/extents.c:4416
ext4_truncate+0xa6a/0xea0 fs/ext4/inode.c:4342
ext4_setattr+0x10c8/0x1930 fs/ext4/inode.c:5622
notify_change+0xe50/0x1100 fs/attr.c:482
do_truncate+0x200/0x2f0 fs/open.c:65
handle_truncate fs/namei.c:3216 [inline]
do_open fs/namei.c:3561 [inline]
path_openat+0x272b/0x2dd0 fs/namei.c:3714
do_filp_open+0x264/0x4f0 fs/namei.c:3741
do_sys_openat2+0x124/0x4e0 fs/open.c:1310
do_sys_open fs/open.c:1326 [inline]
__do_sys_creat fs/open.c:1402 [inline]
__se_sys_creat fs/open.c:1396 [inline]
__x64_sys_creat+0x11f/0x160 fs/open.c:1396
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f72f8a8c0c9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 f1 19 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f72f97e3168 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
RAX: ffffffffffffffda RBX: 00007f72f8bac050 RCX: 00007f72f8a8c0c9
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000020000280
RBP: 00007f72f8ae7ae9 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffd165348bf R14: 00007f72f97e3300 R15: 0000000000022000

Replace
le16_to_cpu(sbi->s_es->s_desc_size)
with
sbi->s_desc_size

It reduces ext4's compiled text size, and makes the code more efficient
(we remove an extra indirect reference and a potential byte
swap on big endian systems), and there is no downside. It also avoids the
potential KASAN / syzkaller failure, as a bonus.

Reported-by: syzbot+fc51227e7100c9294894@syzkaller.appspotmail.com
Reported-by: syzbot+8785e41224a3afd04321@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?id=70d28d11ab14bd7938f3e088365252aa923cff42
Link: https://syzkaller.appspot.com/bug?id=b85721b38583ecc6b5e72ff524c67302abbc30f3
Link: https://lore.kernel.org/all/000000000000ece18705f3b20934@google.com/
Fixes: 717d50e4971b ("Ext4: Uninitialized Block Groups")
Cc: stable@vger.kernel.org
Signed-off-by: Tudor Ambarus tudor.ambarus@linaro.org
Link: https://lore.kernel.org/r/20230504121525.3275886-1-tudor.ambarus@linaro.org
Signed-off-by: Theodore Ts'o tytso@mit.edu
Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
Signed-off-by: Baokun Li libaokun1@huawei.com
Reviewed-by: Yang Erkun yangerkun@huawei.com
Signed-off-by: Jialin Zhang zhangjialin11@huawei.com
6d98d507 2023-05-31 14:58:33 +0800 CST iomap: don't invalidate folios after writeback errors

mainline inclusion
from mainline-v5.19-rc1
commit e9c3a8e820ed0eeb2be05072f29f80d1b79f053b
category: bugfix
bugzilla: 188775, #I73IFH:【OLK同步】【syzkaller】WARNING in iomap_page_release

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e9c3a8e820ed0eeb2be05072f29f80d1b79f053b

--------------------------------

XFS has the unique behavior (as compared to the other Linux filesystems)
that on writeback errors it will completely invalidate the affected
folio and force the page cache to reread the contents from disk. All
other filesystems leave the page mapped and up to date.

This is a rude awakening for user programs, since (in the case where
write fails but reread doesn't) file contents will appear to revert to
old disk contents with no notification other than an EIO on fsync. This
might have been annoying back in the days when iomap dealt with one page
at a time, but with multipage folios, we can now throw away megabytes
worth of data for a single write error.

On most Linux filesystems, a program can respond to an EIO on write by
redirtying the entire file and scheduling it for writeback. This isn't
foolproof, since the page that failed writeback is no longer dirty and
could be evicted, but programs that want to recover properly also
have to detect XFS and regenerate every write they've made to the file.

When running xfs/314 on arm64, I noticed a UAF when xfs_discard_folio
invalidates multipage folios that could be undergoing writeback. If,
say, we have a 256K folio caching a mix of written and unwritten
extents, it's possible that we could start writeback of the first (say)
64K of the folio and then hit a writeback error on the next 64K. We
then free the iop attached to the folio, which is really bad because
writeback completion on the first 64k will trip over the "blocks per
folio > 1 && !iop" assertion.

This can't be fixed by only invalidating the folio if writeback fails at
the start of the folio, since the folio is marked !uptodate, which trips
other assertions elsewhere. Get rid of the whole behavior entirely.

Signed-off-by: Darrick J. Wong djwong@kernel.org
Reviewed-by: Matthew Wilcox (Oracle) willy@infradead.org
Reviewed-by: Jeff Layton jlayton@kernel.org
Reviewed-by: Christoph Hellwig hch@lst.de

Conflicts:
fs/xfs/xfs_aops.c
fs/iomap/buffered-io.c

Signed-off-by: Baokun Li libaokun1@huawei.com
Reviewed-by: Yang Erkun yangerkun@huawei.com
Reviewed-by: Zhang Yi yi.zhang@huawei.com
Signed-off-by: Jialin Zhang zhangjialin11@huawei.com
9df33786 2023-05-31 14:58:32 +0800 CST iomap: Don't create iomap_page objects in iomap_page_mkwrite_actor

mainline inclusion
from mainline-v5.14-rc2
commit 229adf3c64dbeae4e2f45fb561907ada9fcc0d0c
category: bugfix
bugzilla: 188764, #I736LW:【OLK同步】 WARNING in iomap_writepage_map

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=229adf3c64dbeae4e2f45fb561907ada9fcc0d0c

--------------------------------

Now that we create those objects in iomap_writepage_map when needed,
there's no need to pre-create them in iomap_page_mkwrite_actor anymore.

Signed-off-by: Andreas Gruenbacher agruenba@redhat.com
Reviewed-by: Christoph Hellwig hch@lst.de
Reviewed-by: Matthew Wilcox (Oracle) willy@infradead.org
Reviewed-by: Darrick J. Wong djwong@kernel.org
Signed-off-by: Darrick J. Wong djwong@kernel.org
Signed-off-by: Baokun Li libaokun1@huawei.com
Reviewed-by: Yang Erkun yangerkun@huawei.com
Reviewed-by: Zhang Yi yi.zhang@huawei.com
Signed-off-by: Jialin Zhang zhangjialin11@huawei.com
e841540c 2023-05-31 14:58:31 +0800 CST iomap: Don't create iomap_page objects for inline files

mainline inclusion
from mainline-v5.14-rc2
commit 637d3375953e052a62c0db409557e3b3354be88a
category: bugfix
bugzilla: 188764, #I736LW:【OLK同步】 WARNING in iomap_writepage_map

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=637d3375953e052a62c0db409557e3b3354be88a

--------------------------------

In iomap_readpage_actor, don't create iop objects for inline inodes.
Otherwise, iomap_read_inline_data will set PageUptodate without setting
iop->uptodate, and iomap_page_release will eventually complain.

To prevent this kind of bug from occurring in the future, make sure the
page doesn't have private data attached in iomap_read_inline_data.

Signed-off-by: Andreas Gruenbacher agruenba@redhat.com
Reviewed-by: Christoph Hellwig hch@lst.de
Reviewed-by: Darrick J. Wong djwong@kernel.org
Signed-off-by: Darrick J. Wong djwong@kernel.org
Signed-off-by: Baokun Li libaokun1@huawei.com
Reviewed-by: Yang Erkun yangerkun@huawei.com
Reviewed-by: Zhang Yi yi.zhang@huawei.com
Signed-off-by: Jialin Zhang zhangjialin11@huawei.com
e6eaa18c 2023-05-31 14:58:30 +0800 CST iomap: Permit pages without an iop to enter writeback

mainline inclusion
from mainline-v5.14-rc2
commit 8e1bcef8e18d0fec4afe527c074bb1fd6c2b140c
category: bugfix
bugzilla: 188764, #I736LW:【OLK同步】 WARNING in iomap_writepage_map

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8e1bcef8e18d0fec4afe527c074bb1fd6c2b140c

--------------------------------

Create an iop in the writeback path if one doesn't exist. This allows us
to avoid creating the iop in some cases. We'll initially do that for pages
with inline data, but it can be extended to pages which are entirely within
an extent. It also allows for an iop to be removed from pages in the
future (eg page split).

Co-developed-by: Matthew Wilcox (Oracle) willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) willy@infradead.org
Signed-off-by: Andreas Gruenbacher agruenba@redhat.com
Reviewed-by: Christoph Hellwig hch@lst.de
Reviewed-by: Darrick J. Wong djwong@kernel.org
Signed-off-by: Darrick J. Wong djwong@kernel.org
Signed-off-by: Baokun Li libaokun1@huawei.com
Reviewed-by: Yang Erkun yangerkun@huawei.com
Reviewed-by: Zhang Yi yi.zhang@huawei.com
Signed-off-by: Jialin Zhang zhangjialin11@huawei.com
7b99df55 2023-05-31 14:58:29 +0800 CST eulerfs: fix null-ptr-dereference when allocate page failed

hulk inclusion
category: bugfix
bugzilla: #I78RYS:【OLK-5.10】eulerfs: fix null-ptr-dereference when allocate page failed
CVE: NA

--------------------------------

Currently, the caller of eufs_alloc_page() and eufs_zalloc_page() expect
that allocation won't fail, otherwise null_ptr_dereference will be
triggered.

Fix this problem by adding flag __GFP_NOFAIL.

Signed-off-by: Yu Kuai yukuai3@huawei.com
Reviewed-by: Hou Tao houtao1@huawei.com
Signed-off-by: Jialin Zhang zhangjialin11@huawei.com
e15e6869 2023-05-31 14:58:28 +0800 CST eulerfs: add error handling for nv_init()

hulk inclusion
category: bugfix
bugzilla: #I78RUK:【OLK-5.10】eulerfs: add error handling for nv_init()
CVE: NA

--------------------------------

Currently nv_init() doesn't handle errors, null-ptr-dereference will be
triggered if errors occur.

Signed-off-by: Yu Kuai yukuai3@huawei.com
Reviewed-by: Hou Tao houtao1@huawei.com
Signed-off-by: Jialin Zhang zhangjialin11@huawei.com
2eb22263 2023-05-31 14:58:27 +0800 CST md: fix kabi broken in struct mddev

hulk inclusion
category: bugfix
bugzilla: #I6OMCC:【OLK-5.10】长稳环境中出现md_check_recovery相关D死锁
CVE: NA

--------------------------------

Struct mddev is just used inside raid, just in case that md_mod is compiled
from new kernel, and raid1/raid10 or other out-of-tree raid are compiled
from old kernel.

Signed-off-by: Yu Kuai yukuai3@huawei.com
Reviewed-by: Hou Tao houtao1@huawei.com
Signed-off-by: Jialin Zhang zhangjialin11@huawei.com
73f974e1 2023-05-31 14:58:26 +0800 CST md: use interruptible apis in idle/frozen_sync_thread

hulk inclusion
category: bugfix
bugzilla: #I6OMCC:【OLK-5.10】长稳环境中出现md_check_recovery相关D死锁
CVE: NA

--------------------------------

Before refactoring idle and frozen from action_store, interruptible apis
is used so that hungtask warning won't be triggered if it takes too long
to finish indle/frozen sync_thread. This patch do the same.

Signed-off-by: Yu Kuai yukuai3@huawei.com
Reviewed-by: Hou Tao houtao1@huawei.com
Signed-off-by: Jialin Zhang zhangjialin11@huawei.com
54570486 2023-05-31 14:58:25 +0800 CST md: wake up 'resync_wait' at last in md_reap_sync_thread()

hulk inclusion
category: bugfix
bugzilla: #I6OMCC:【OLK-5.10】长稳环境中出现md_check_recovery相关D死锁
CVE: NA

--------------------------------

We just replace md_reap_sync_thread() with wait_event(resync_wait, ...)
from action_store(), this patch just make sure action_store() will still
wait for everything to be done in md_reap_sync_thread().

Signed-off-by: Yu Kuai yukuai3@huawei.com
Reviewed-by: Hou Tao houtao1@huawei.com
Signed-off-by: Jialin Zhang zhangjialin11@huawei.com
1ade24b6 2023-05-31 14:58:24 +0800 CST md: refactor idle/frozen_sync_thread()

hulk inclusion
category: bugfix
bugzilla: #I6OMCC:【OLK-5.10】长稳环境中出现md_check_recovery相关D死锁
CVE: NA

--------------------------------

Our test found a following deadlock in raid10:

1) Issue a normal write, and such write failed:

raid10_end_write_request
set_bit(R10BIO_WriteError, &r10_bio->state)
one_write_done
reschedule_retry

// later from md thread
raid10d
handle_write_completed
list_add(&r10_bio->retry_list, &conf->bio_end_io_list)

// later from md thread
raid10d
if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
list_move(conf->bio_end_io_list.prev, &tmp)
r10_bio = list_first_entry(&tmp, struct r10bio, retry_list)
raid_end_bio_io(r10_bio)

Dependency chain 1: normal io is waiting for updating superblock

2) Trigger a recovery:

raid10_sync_request
raise_barrier

Dependency chain 2: sync thread is waiting for normal io

3) echo idle/frozen to sync_action:

action_store
mddev_lock
md_unregister_thread
kthread_stop

Dependency chain 3: drop 'reconfig_mutex' is waiting for sync thread

4) md thread can't update superblock:

raid10d
md_check_recovery
if (mddev_trylock(mddev))
md_update_sb

Dependency chain 4: update superblock is waiting for 'reconfig_mutex'

Hence cyclic dependency exist, in order to fix the problem, we must
break one of them. Dependency 1 and 2 can't be broken because they are
foundation design. Dependency 4 may be possible if it can be guaranteed
that no io can be inflight, however, this requires a new mechanism which
seems complex. Dependency 3 is a good choice, because idle/frozen only
requires sync thread to finish, which can be done asynchronously that is
already implemented, and 'reconfig_mutex' is not needed anymore.

This patch switch 'idle' and 'frozen' to wait sync thread to be done
asynchronously, and this patch also add a sequence counter to record how
many times sync thread is done, so that 'idle' won't keep waiting on new
started sync thread.

Signed-off-by: Yu Kuai yukuai3@huawei.com
Reviewed-by: Hou Tao houtao1@huawei.com
Signed-off-by: Jialin Zhang zhangjialin11@huawei.com
1c617ac5 2023-05-31 14:58:23 +0800 CST md: add a mutex to synchronize idle and frozen in action_store()

hulk inclusion
category: bugfix
bugzilla: #I6OMCC:【OLK-5.10】长稳环境中出现md_check_recovery相关D死锁
CVE: NA

--------------------------------

Currently, for idle and frozen, action_store will hold 'reconfig_mutex'
and call md_reap_sync_thread() to stop sync thread, however, this will
cause deadlock (explained in the next patch). In order to fix the
problem, following patch will release 'reconfig_mutex' and wait on
'resync_wait', like md_set_readonly() and do_md_stop() does.

Consider that action_store() will set/clear 'MD_RECOVERY_FROZEN'
unconditionally, which might cause unexpected problems, for example,
frozen just set 'MD_RECOVERY_FROZEN' and is still in progress, while
'idle' clear 'MD_RECOVERY_FROZEN' and new sync thread is started, which
might starve in progress frozen.

This patch add a mutex to synchronize idle and frozen from
action_store().

Signed-off-by: Yu Kuai yukuai3@huawei.com
Reviewed-by: Hou Tao houtao1@huawei.com
Signed-off-by: Jialin Zhang zhangjialin11@huawei.com
e98a235f 2023-05-31 14:58:22 +0800 CST md: refactor action_store() for 'idle' and 'frozen'

hulk inclusion
category: bugfix
bugzilla: #I6OMCC:【OLK-5.10】长稳环境中出现md_check_recovery相关D死锁
CVE: NA

--------------------------------

Prepare to handle 'idle' and 'frozen' differently to fix a deadlock, there
are no functional changes except that MD_RECOVERY_RUNNING is checked
again after 'reconfig_mutex' is held.

Signed-off-by: Yu Kuai yukuai3@huawei.com
Reviewed-by: Hou Tao houtao1@huawei.com
Signed-off-by: Jialin Zhang zhangjialin11@huawei.com
4a53e631 2023-05-31 14:58:21 +0800 CST Revert "md: unlock mddev before reap sync_thread in action_store"

hulk inclusion
category: bugfix
bugzilla: #I6OMCC:【OLK-5.10】长稳环境中出现md_check_recovery相关D死锁
CVE: NA

--------------------------------

This reverts commit 9dfbdafda3b34e262e43e786077bab8e476a89d1.

Because it will introduce a defect that sync_thread can be running while
MD_RECOVERY_RUNNING is cleared, which will cause some unexpected problems,
for example:

list_add corruption. prev->next should be next (ffff0001ac1daba0), but was ffff0000ce1a02a0. (prev=ffff0000ce1a02a0).
Call trace:
__list_add_valid+0xfc/0x140
insert_work+0x78/0x1a0
__queue_work+0x500/0xcf4
queue_work_on+0xe8/0x12c
md_check_recovery+0xa34/0xf30
raid10d+0xb8/0x900 [raid10]
md_thread+0x16c/0x2cc
kthread+0x1a4/0x1ec
ret_from_fork+0x10/0x18

This is because work is requeued while it's still inside workqueue:

t1: t2:
action_store
mddev_lock
if (mddev->sync_thread)
mddev_unlock
md_unregister_thread
// first sync_thread is done
md_check_recovery
mddev_try_lock
/*
* once MD_RECOVERY_DONE is set, new sync_thread
* can start.
*/
set_bit(MD_RECOVERY_RUNNING, &mddev->recovery)
INIT_WORK(&mddev->del_work, md_start_sync)
queue_work(md_misc_wq, &mddev->del_work)
test_and_set_bit(WORK_STRUCT_PENDING_BIT, ...)
// set pending bit
insert_work
list_add_tail
mddev_unlock
mddev_lock_nointr
md_reap_sync_thread
// MD_RECOVERY_RUNNING is cleared
mddev_unlock

t3:

// before queued work started from t2
md_check_recovery
// MD_RECOVERY_RUNNING is not set, a new sync_thread can be started
INIT_WORK(&mddev->del_work, md_start_sync)
work->data = 0
// work pending bit is cleared
queue_work(md_misc_wq, &mddev->del_work)
insert_work
list_add_tail
// list is corrupted

This patch revert the commit to fix the problem, the deadlock this
commit tries to fix will be fixed in following patches.

Signed-off-by: Yu Kuai yukuai3@huawei.com
Signed-off-by: Song Liu song@kernel.org
Link: https://lore.kernel.org/r/20230322064122.2384589-2-yukuai1@huaweicloud.com
Reviewed-by: Hou Tao houtao1@huawei.com
Signed-off-by: Jialin Zhang zhangjialin11@huawei.com
dd3bd170 2023-05-31 14:58:20 +0800 CST md: unlock mddev before reap sync_thread in action_store

mainline inclusion
from mainline-v6.0-rc1
commit 9dfbdafda3b34e262e43e786077bab8e476a89d1
category: bugfix
bugzilla: #I6OMCC:【OLK-5.10】长稳环境中出现md_check_recovery相关D死锁
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v6.3-rc3&id=9dfbdafda3b34e262e43e786077bab8e476a89d1

--------------------------------

Since the bug which commit 8b48ec23cc51a ("md: don't unregister sync_thread
with reconfig_mutex held") fixed is related with action_store path, other
callers which reap sync_thread didn't need to be changed.

Let's pull md_unregister_thread from md_reap_sync_thread, then fix previous
bug with belows.

1. unlock mddev before md_reap_sync_thread in action_store.
2. save reshape_position before unlock, then restore it to ensure position
not changed accidentally by others.

Signed-off-by: Guoqing Jiang guoqing.jiang@linux.dev
Signed-off-by: Song Liu song@kernel.org
Signed-off-by: Jens Axboe axboe@kernel.dk
Signed-off-by: Yu Kuai yukuai3@huawei.com
Reviewed-by: Hou Tao houtao1@huawei.com
Signed-off-by: Jialin Zhang zhangjialin11@huawei.com
7058c39d 2023-05-31 14:58:19 +0800 CST block: fix wrong mode for blkdev_put() from disk_scan_partitions()

mainline inclusion
from mainline-v6.3-rc2
commit 428913bce1e67ccb4dae317fd0332545bf8c9233
category: bugfix
bugzilla: #I6MQLP:【OLK-5.10】block: fix scan partition for exclusively open device again
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e5cfefa97bccf956ea0bb6464c1f6c84fd7a8d9f

--------------------------------

If disk_scan_partitions() is called with 'FMODE_EXCL',
blkdev_get_by_dev() will be called without 'FMODE_EXCL', however, follow
blkdev_put() is still called with 'FMODE_EXCL', which will cause
'bd_holders' counter to leak.

Fix the problem by using the right mode for blkdev_put().

Reported-by: syzbot+2bcc0d79e548c4f62a59@syzkaller.appspotmail.com
Link: https://lore.kernel.org/lkml/f9649d501bc8c3444769418f6c26263555d9d3be.camel@linux.ibm.com/T/
Tested-by: Julian Ruess julianr@linux.ibm.com
Fixes: e5cfefa97bcc ("block: fix scan partition for exclusively open device again")
Signed-off-by: Yu Kuai yukuai3@huawei.com
Reviewed-by: Jan Kara jack@suse.cz
Signed-off-by: Jens Axboe axboe@kernel.dk
Reviewed-by: Hou Tao houtao1@huawei.com
Signed-off-by: Jialin Zhang zhangjialin11@huawei.com
8f9c8fc5 2023-05-31 14:58:18 +0800 CST block: fix scan partition for exclusively open device again

mainline inclusion
from mainline-v6.3-rc1
commit e5cfefa97bccf956ea0bb6464c1f6c84fd7a8d9f
category: bugfix
bugzilla: #I6MQLP:【OLK-5.10】block: fix scan partition for exclusively open device again
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e5cfefa97bccf956ea0bb6464c1f6c84fd7a8d9f

--------------------------------

As explained in commit 36369f46e917 ("block: Do not reread partition table
on exclusively open device"), reread partition on the device that is
exclusively opened by someone else is problematic.

This patch will make sure partition scan will only be proceed if current
thread open the device exclusively, or the device is not opened
exclusively, and in the later case, other scanners and exclusive openers
will be blocked temporarily until partition scan is done.

Fixes: 10c70d95c0f2 ("block: remove the bd_openers checks in blk_drop_partitions")
Cc: stable@vger.kernel.org
Suggested-by: Jan Kara jack@suse.cz
Signed-off-by: Yu Kuai yukuai3@huawei.com
Reviewed-by: Christoph Hellwig hch@lst.de
Link: https://lore.kernel.org/r/20230217022200.3092987-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe axboe@kernel.dk

Conflicts:
block/genhd.c
block/ioctl.c
Signed-off-by: Yu Kuai yukuai3@huawei.com
Reviewed-by: Hou Tao houtao1@huawei.com
Signed-off-by: Jialin Zhang zhangjialin11@huawei.com
1c0b1b48 2023-05-31 14:58:17 +0800 CST block: merge disk_scan_partitions and blkdev_reread_part

mainline inclusion
from mainline-v5.17-rc1
commit e16e506ccd673a3a888a34f8f694698305840044
category: bugfix
bugzilla: #I6MQLP:【OLK-5.10】block: fix scan partition for exclusively open device again
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e16e506ccd673a3a888a34f8f694698305840044

--------------------------------

Unify the functionality that implements a partition rescan for a
gendisk.

Signed-off-by: Christoph Hellwig hch@lst.de
Link: https://lore.kernel.org/r/20211122130625.1136848-6-hch@lst.de
Signed-off-by: Jens Axboe axboe@kernel.dk

Conflicts:
block/blk.h
block/genhd.c
block/ioctl.c
Signed-off-by: Yu Kuai yukuai3@huawei.com
Reviewed-by: Hou Tao houtao1@huawei.com
Signed-off-by: Jialin Zhang zhangjialin11@huawei.com