From a66b953d2dfe9a56dc42246285b0b14841194b05 Mon Sep 17 00:00:00 2001 From: Stefan Roesch Date: Sat, 25 May 2024 15:50:28 +0800 Subject: [PATCH 1/4] mm/ksm: support fork/exec for prctl mainline inclusion from mainline-v6.7-rc1 commit 3c6f33b7273a7e2f2b2497b62c8400bd957b2fbe category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9GT87 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3c6f33b7273a7e2f2b2497b62c8400bd957b2fbe -------------------------------- Patch series "mm/ksm: add fork-exec support for prctl", v4. A process can enable KSM with the prctl system call. When the process is forked the KSM flag is inherited by the child process. However if the process is executing an exec system call directly after the fork, the KSM setting is cleared. This patch series addresses this problem. 1) Change the mask in coredump.h for execing a new process 2) Add a new test case in ksm_functional_tests This patch (of 2): Today we have two ways to enable KSM: 1) madvise system call This allows to enable KSM for a memory region for a long time. 2) prctl system call This is a recent addition to enable KSM for the complete process. In addition when a process is forked, the KSM setting is inherited. This change only affects the second case. One of the use cases for (2) was to support the ability to enable KSM for cgroups. This allows systemd to enable KSM for the seed process. By enabling it in the seed process all child processes inherit the setting. This works correctly when the process is forked. However it doesn't support fork/exec workflow. From the previous cover letter: .... Use case 3: With the madvise call sharing opportunities are only enabled for the current process: it is a workload-local decision. A considerable number of sharing opportunities may exist across multiple workloads or jobs (if they are part of the same security domain). Only a higler level entity like a job scheduler or container can know for certain if its running one or more instances of a job. That job scheduler however doesn't have the necessary internal workload knowledge to make targeted madvise calls. .... In addition it can also be a bit surprising that fork keeps the KSM setting and fork/exec does not. Link: https://lkml.kernel.org/r/20230922211141.320789-1-shr@devkernel.io Link: https://lkml.kernel.org/r/20230922211141.320789-2-shr@devkernel.io Signed-off-by: Stefan Roesch Fixes: d7597f59d1d3 ("mm: add new api to enable ksm per process") Reviewed-by: David Hildenbrand Reported-by: Carl Klemm Tested-by: Carl Klemm Cc: Johannes Weiner Cc: Rik van Riel Signed-off-by: Andrew Morton Conflicts: include/linux/sched/coredump.h [Context conflicts.] Signed-off-by: Jinjiang Tu --- include/linux/sched/coredump.h | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h index 6a4d85c7a5f3..103ca84e379c 100644 --- a/include/linux/sched/coredump.h +++ b/include/linux/sched/coredump.h @@ -70,13 +70,15 @@ static inline int get_dumpable(struct mm_struct *mm) #define MMF_UNSTABLE 22 /* mm is unstable for copy_from_user */ #define MMF_HUGE_ZERO_PAGE 23 /* mm has ever used the global huge zero page */ #define MMF_DISABLE_THP 24 /* disable THP for all VMAs */ +#define MMF_DISABLE_THP_MASK (1 << MMF_DISABLE_THP) #define MMF_OOM_VICTIM 25 /* mm is the oom victim */ #define MMF_OOM_REAP_QUEUED 26 /* mm was queued for oom_reaper */ #define MMF_MULTIPROCESS 27 /* mm is shared between processes */ -#define MMF_DISABLE_THP_MASK (1 << MMF_DISABLE_THP) +#define MMF_VM_MERGE_ANY 29 +#define MMF_VM_MERGE_ANY_MASK (1 << MMF_VM_MERGE_ANY) #define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\ - MMF_DISABLE_THP_MASK) + MMF_DISABLE_THP_MASK | MMF_VM_MERGE_ANY_MASK) #define MMF_VM_MERGE_ANY 29 #endif /* _LINUX_SCHED_COREDUMP_H */ -- Gitee From c5723c0b650f86e6c8ba069ac466b30855692b3f Mon Sep 17 00:00:00 2001 From: Jinjiang Tu Date: Sat, 25 May 2024 15:50:29 +0800 Subject: [PATCH 2/4] mm/ksm: fix ksm exec support for prctl mainline inclusion from mainline commit 3a9e567ca45fb5280065283d10d9a11f0db61d2b category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9GT87 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3a9e567ca45fb5280065283d10d9a11f0db61d2b -------------------------------- Patch series "mm/ksm: fix ksm exec support for prctl", v4. commit 3c6f33b7273a ("mm/ksm: support fork/exec for prctl") inherits MMF_VM_MERGE_ANY flag when a task calls execve(). However, it doesn't create the mm_slot, so ksmd will not try to scan this task. The first patch fixes the issue. The second patch refactors to prepare for the third patch. The third patch extends the selftests of ksm to verfity the deduplication really happens after fork/exec inherits ths KSM setting. This patch (of 3): commit 3c6f33b7273a ("mm/ksm: support fork/exec for prctl") inherits MMF_VM_MERGE_ANY flag when a task calls execve(). Howerver, it doesn't create the mm_slot, so ksmd will not try to scan this task. To fix it, allocate and add the mm_slot to ksm_mm_head in __bprm_mm_init() when the mm has MMF_VM_MERGE_ANY flag. Link: https://lkml.kernel.org/r/20240328111010.1502191-1-tujinjiang@huawei.com Link: https://lkml.kernel.org/r/20240328111010.1502191-2-tujinjiang@huawei.com Fixes: 3c6f33b7273a ("mm/ksm: support fork/exec for prctl") Signed-off-by: Jinjiang Tu Reviewed-by: David Hildenbrand Cc: Johannes Weiner Cc: Kefeng Wang Cc: Nanyong Sun Cc: Rik van Riel Cc: Stefan Roesch Signed-off-by: Andrew Morton Conflicts: fs/exec.c [Context conflicts, and use __GENKSYMS__ to avoid kabi breakage warning.] Signed-off-by: Jinjiang Tu --- fs/exec.c | 13 +++++++++++++ include/linux/ksm.h | 13 +++++++++++++ 2 files changed, 26 insertions(+) diff --git a/fs/exec.c b/fs/exec.c index 792d62632e92..43378e25abcb 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -65,6 +65,9 @@ #include #include #include +#ifndef __GENKSYMS__ +#include +#endif #include #include @@ -252,6 +255,14 @@ static int __bprm_mm_init(struct linux_binprm *bprm) goto err_free; } + /* + * Need to be called with mmap write lock + * held, to avoid race with ksmd. + */ + err = ksm_execve(mm); + if (err) + goto err_ksm; + /* * Place the stack at the largest stack address the architecture * supports. Later, we'll move this to an appropriate place. We don't @@ -273,6 +284,8 @@ static int __bprm_mm_init(struct linux_binprm *bprm) bprm->p = vma->vm_end - sizeof(void *); return 0; err: + ksm_exit(mm); +err_ksm: mmap_write_unlock(mm); err_free: bprm->vma = NULL; diff --git a/include/linux/ksm.h b/include/linux/ksm.h index 4e02e8a770a9..debef5446114 100644 --- a/include/linux/ksm.h +++ b/include/linux/ksm.h @@ -45,6 +45,14 @@ static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm) return 0; } +static inline int ksm_execve(struct mm_struct *mm) +{ + if (test_bit(MMF_VM_MERGE_ANY, &mm->flags)) + return __ksm_enter(mm); + + return 0; +} + static inline void ksm_exit(struct mm_struct *mm) { if (test_bit(MMF_VM_MERGEABLE, &mm->flags)) @@ -83,6 +91,11 @@ static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm) return 0; } +static inline int ksm_execve(struct mm_struct *mm) +{ + return 0; +} + static inline void ksm_exit(struct mm_struct *mm) { } -- Gitee From 87a2560d70bdd0268862637184a0c25e2afc7e51 Mon Sep 17 00:00:00 2001 From: Jinjiang Tu Date: Sat, 25 May 2024 15:50:30 +0800 Subject: [PATCH 3/4] mm/memcontrol: add ksm state for memcg hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9GT87 ---------------------------------------- Add KSM state for memcg, the valid values include 0 and 1. When changing auto_ksm_enabled from 0 to 1, enable KSM for tasks in the memcg. When changing auto_ksm_enabled from 1 to 0, disable KSM for tasks in the memcg. If enable/disable fails, return the error code and don't change auto_ksm_enabled. If the auto_ksm_state of the child memcgs differ, also enable/disable KSM for the tasks in the memcgs. If enable/disable for a child memcg fails, stop traversing child memcgs and return the error code. When writing the value same to auto_ksm_enabled of the memcg, i.e. from 0 to 0 and 1 to 1, do nothing. Signed-off-by: Jinjiang Tu --- include/linux/memcontrol.h | 4 ++++ mm/memcontrol.c | 30 +++++++++++++++++++++++++++++- 2 files changed, 33 insertions(+), 1 deletion(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 287c54141a90..ef3a6a8e640f 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -414,7 +414,11 @@ struct mem_cgroup { #else KABI_RESERVE(7) #endif +#ifdef CONFIG_KSM + KABI_USE(8, bool auto_ksm_enabled) +#else KABI_RESERVE(8) +#endif struct mem_cgroup_per_node *nodeinfo[0]; /* WARNING: nodeinfo must be the last member here */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index db44ade93455..52248cfa9140 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5772,7 +5772,7 @@ static ssize_t memcg_high_async_ratio_write(struct kernfs_open_file *of, } #ifdef CONFIG_KSM -static int memcg_set_ksm_for_tasks(struct mem_cgroup *memcg, bool enable) +static int __memcg_set_ksm_for_tasks(struct mem_cgroup *memcg, bool enable) { struct task_struct *task; struct mm_struct *mm; @@ -5806,6 +5806,27 @@ static int memcg_set_ksm_for_tasks(struct mem_cgroup *memcg, bool enable) return ret; } +static int memcg_set_ksm_for_tasks(struct mem_cgroup *memcg, bool enable) +{ + struct mem_cgroup *iter; + int ret = 0; + + for_each_mem_cgroup_tree(iter, memcg) { + if (READ_ONCE(iter->auto_ksm_enabled) == enable) + continue; + + ret = __memcg_set_ksm_for_tasks(iter, enable); + if (ret) { + mem_cgroup_iter_break(memcg, iter); + break; + } + + WRITE_ONCE(iter->auto_ksm_enabled, enable); + } + + return ret; +} + static int memory_ksm_show(struct seq_file *m, void *v) { unsigned long ksm_merging_pages = 0; @@ -5833,6 +5854,7 @@ static int memory_ksm_show(struct seq_file *m, void *v) } css_task_iter_end(&it); + seq_printf(m, "auto ksm enabled: %d\n", READ_ONCE(memcg->auto_ksm_enabled)); seq_printf(m, "merge any tasks: %u\n", tasks); seq_printf(m, "ksm_rmap_items %lu\n", ksm_rmap_items); seq_printf(m, "ksm_merging_pages %lu\n", ksm_merging_pages); @@ -5855,6 +5877,9 @@ static ssize_t memory_ksm_write(struct kernfs_open_file *of, char *buf, if (err) return err; + if (READ_ONCE(memcg->auto_ksm_enabled) == enable) + return nbytes; + err = memcg_set_ksm_for_tasks(memcg, enable); if (err) return err; @@ -6430,6 +6455,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) } hugetlb_pool_inherit(memcg, parent); +#ifdef CONFIG_KSM + memcg->auto_ksm_enabled = READ_ONCE(parent->auto_ksm_enabled); +#endif error = memcg_online_kmem(memcg); if (error) -- Gitee From decc704ff781c2adaf3bb7625fcab017b26e0785 Mon Sep 17 00:00:00 2001 From: Jinjiang Tu Date: Sat, 25 May 2024 15:50:31 +0800 Subject: [PATCH 4/4] mm/memcontrol: enable KSM for tasks moving to new memcg hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9GT87 ---------------------------------------- When a task moves to a new memcg, enable KSM for the task if the auto_ksm_enabled of the memcg is 1. Signed-off-by: Jinjiang Tu --- mm/memcontrol.c | 43 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 43 insertions(+) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 52248cfa9140..9007c3554771 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5886,6 +5886,39 @@ static ssize_t memory_ksm_write(struct kernfs_open_file *of, char *buf, return nbytes; } + +static void memcg_attach_ksm(struct cgroup_taskset *tset) +{ + struct cgroup_subsys_state *css; + struct mem_cgroup *memcg; + struct task_struct *task; + + cgroup_taskset_first(tset, &css); + memcg = mem_cgroup_from_css(css); + if (!READ_ONCE(memcg->auto_ksm_enabled)) + return; + + cgroup_taskset_for_each(task, css, tset) { + struct mm_struct *mm = get_task_mm(task); + + if (!mm) + continue; + + if (mmap_write_lock_killable(mm)) { + mmput(mm); + continue; + } + + ksm_enable_merge_any(mm); + + mmap_write_unlock(mm); + mmput(mm); + } +} +#else +static inline void memcg_attach_ksm(struct cgroup_taskset *tset) +{ +} #endif /* CONFIG_KSM */ #ifdef CONFIG_CGROUP_V1_WRITEBACK @@ -7373,6 +7406,12 @@ static void mem_cgroup_move_charge(void) atomic_dec(&mc.from->moving_account); } +static void mem_cgroup_attach(struct cgroup_taskset *tset) +{ + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) + memcg_attach_ksm(tset); +} + static void mem_cgroup_move_task(void) { if (mc.to) { @@ -7388,6 +7427,9 @@ static int mem_cgroup_can_attach(struct cgroup_taskset *tset) static void mem_cgroup_cancel_attach(struct cgroup_taskset *tset) { } +static void mem_cgroup_attach(struct cgroup_taskset *tset) +{ +} static void mem_cgroup_move_task(void) { } @@ -7651,6 +7693,7 @@ struct cgroup_subsys memory_cgrp_subsys = { .css_rstat_flush = mem_cgroup_css_rstat_flush, .can_attach = mem_cgroup_can_attach, .cancel_attach = mem_cgroup_cancel_attach, + .attach = mem_cgroup_attach, .post_attach = mem_cgroup_move_task, .bind = mem_cgroup_bind, .dfl_cftypes = memory_files, -- Gitee