diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index bd1204178a40dbc0d580f24cef1beab8d98830e5..7b151e24a562d765f23d0fe3b69861cbe8b4a5b5 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2538,6 +2538,31 @@ kgdbwait [KGDB] Stop kernel execution and enter the kernel debugger at the earliest opportunity. + kho= [KEXEC,EARLY] + Format: { "0" | "1" | "off" | "on" | "y" | "n" } + Enables or disables Kexec HandOver. + "0" | "off" | "n" - kexec handover is disabled + "1" | "on" | "y" - kexec handover is enabled + + kho_scratch= [KEXEC,EARLY] + Format: ll[KMG],mm[KMG],nn[KMG] | nn% + Defines the size of the KHO scratch region. The KHO + scratch regions are physically contiguous memory + ranges that can only be used for non-kernel + allocations. That way, even when memory is heavily + fragmented with handed over memory, the kexeced + kernel will always have enough contiguous ranges to + bootstrap itself. + + It is possible to specify the exact amount of + memory in the form of "ll[KMG],mm[KMG],nn[KMG]" + where the first parameter defines the size of a low + memory scratch area, the second parameter defines + the size of a global scratch area and the third + parameter defines the size of additional per-node + scratch areas. The form "nn%" defines scale factor + (in percents) of memory that was used during boot. + kmac= [MIPS] Korina ethernet MAC address. Configure the RouterBoard 532 series on-chip Ethernet adapter MAC address. diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index 1f883abf3f00fa7311b0d16081039c74dfcf2919..1ade7df54c2b09ea5fc23f278e66f6c3412183c7 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -42,3 +42,4 @@ the Linux memory management. transhuge userfaultfd zswap + kho diff --git a/Documentation/admin-guide/mm/kho.rst b/Documentation/admin-guide/mm/kho.rst new file mode 100644 index 0000000000000000000000000000000000000000..6dc18ed4b8861c7c76e81b795eb0a65af8319c3c --- /dev/null +++ b/Documentation/admin-guide/mm/kho.rst @@ -0,0 +1,115 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later + +==================== +Kexec Handover Usage +==================== + +Kexec HandOver (KHO) is a mechanism that allows Linux to preserve memory +regions, which could contain serialized system states, across kexec. + +This document expects that you are familiar with the base KHO +:ref:`concepts `. If you have not read +them yet, please do so now. + +Prerequisites +============= + +KHO is available when the kernel is compiled with ``CONFIG_KEXEC_HANDOVER`` +set to y. Every KHO producer may have its own config option that you +need to enable if you would like to preserve their respective state across +kexec. + +To use KHO, please boot the kernel with the ``kho=on`` command line +parameter. You may use ``kho_scratch`` parameter to define size of the +scratch regions. For example ``kho_scratch=16M,512M,256M`` will reserve a +16 MiB low memory scratch area, a 512 MiB global scratch region, and 256 MiB +per NUMA node scratch regions on boot. + +Perform a KHO kexec +=================== + +First, before you perform a KHO kexec, you need to move the system into +the :ref:`KHO finalization phase ` :: + + $ echo 1 > /sys/kernel/debug/kho/out/finalize + +After this command, the KHO FDT is available in +``/sys/kernel/debug/kho/out/fdt``. Other subsystems may also register +their own preserved sub FDTs under +``/sys/kernel/debug/kho/out/sub_fdts/``. + +Next, load the target payload and kexec into it. It is important that you +use the ``-s`` parameter to use the in-kernel kexec file loader, as user +space kexec tooling currently has no support for KHO with the user space +based file loader :: + + # kexec -l /path/to/bzImage --initrd /path/to/initrd -s + # kexec -e + +The new kernel will boot up and contain some of the previous kernel's state. + +For example, if you used ``reserve_mem`` command line parameter to create +an early memory reservation, the new kernel will have that memory at the +same physical address as the old kernel. + +Abort a KHO exec +================ + +You can move the system out of KHO finalization phase again by calling :: + + $ echo 0 > /sys/kernel/debug/kho/out/active + +After this command, the KHO FDT is no longer available in +``/sys/kernel/debug/kho/out/fdt``. + +debugfs Interfaces +================== + +Currently KHO creates the following debugfs interfaces. Notice that these +interfaces may change in the future. They will be moved to sysfs once KHO is +stabilized. + +``/sys/kernel/debug/kho/out/finalize`` + Kexec HandOver (KHO) allows Linux to transition the state of + compatible drivers into the next kexec'ed kernel. To do so, + device drivers will instruct KHO to preserve memory regions, + which could contain serialized kernel state. + While the state is serialized, they are unable to perform + any modifications to state that was serialized, such as + handed over memory allocations. + + When this file contains "1", the system is in the transition + state. When contains "0", it is not. To switch between the + two states, echo the respective number into this file. + +``/sys/kernel/debug/kho/out/fdt`` + When KHO state tree is finalized, the kernel exposes the + flattened device tree blob that carries its current KHO + state in this file. Kexec user space tooling can use this + as input file for the KHO payload image. + +``/sys/kernel/debug/kho/out/scratch_len`` + Lengths of KHO scratch regions, which are physically contiguous + memory regions that will always stay available for future kexec + allocations. Kexec user space tools can use this file to determine + where it should place its payload images. + +``/sys/kernel/debug/kho/out/scratch_phys`` + Physical locations of KHO scratch regions. Kexec user space tools + can use this file in conjunction to scratch_phys to determine where + it should place its payload images. + +``/sys/kernel/debug/kho/out/sub_fdts/`` + In the KHO finalization phase, KHO producers register their own + FDT blob under this directory. + +``/sys/kernel/debug/kho/in/fdt`` + When the kernel was booted with Kexec HandOver (KHO), + the state tree that carries metadata about the previous + kernel's state is in this file in the format of flattened + device tree. This file may disappear when all consumers of + it finished to interpret their metadata. + +``/sys/kernel/debug/kho/in/sub_fdts/`` + Similar to ``kho/out/sub_fdts/``, but contains sub FDT blobs + of KHO producers passed from the old kernel. diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst index 7a3a08d81f111c272b41d3fad86187769f35bd93..919d483a6aa4f068b7bc06863ebd6586e442cefe 100644 --- a/Documentation/core-api/index.rst +++ b/Documentation/core-api/index.rst @@ -107,6 +107,7 @@ more memory-management documentation in Documentation/mm/index.rst. pin_user_pages boot-time-mm gfp_mask-from-fs-io + kho/index Interfaces for kernel debugging =============================== diff --git a/Documentation/core-api/kho/bindings/kho.yaml b/Documentation/core-api/kho/bindings/kho.yaml new file mode 100644 index 0000000000000000000000000000000000000000..11e8ab7b219d99997d997c6f5050f9d696f3ed8b --- /dev/null +++ b/Documentation/core-api/kho/bindings/kho.yaml @@ -0,0 +1,43 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +title: Kexec HandOver (KHO) root tree + +maintainers: + - Mike Rapoport + - Changyuan Lyu + +description: | + System memory preserved by KHO across kexec. + +properties: + compatible: + enum: + - kho-v1 + + preserved-memory-map: + description: | + physical address (u64) of an in-memory structure describing all preserved + folios and memory ranges. + +patternProperties: + "$[0-9a-f_]+^": + $ref: sub-fdt.yaml# + description: physical address of a KHO user's own FDT. + +required: + - compatible + - preserved-memory-map + +additionalProperties: false + +examples: + - | + kho { + compatible = "kho-v1"; + preserved-memory-map = <0xf0be16 0x1000000>; + + memblock { + fdt = <0x80cc16 0x1000000>; + }; + }; diff --git a/Documentation/core-api/kho/bindings/memblock/memblock.yaml b/Documentation/core-api/kho/bindings/memblock/memblock.yaml new file mode 100644 index 0000000000000000000000000000000000000000..d388c28eb91d1ad35f1ac52871af8708fd899484 --- /dev/null +++ b/Documentation/core-api/kho/bindings/memblock/memblock.yaml @@ -0,0 +1,39 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +title: Memblock reserved memory + +maintainers: + - Mike Rapoport + +description: | + Memblock can serialize its current memory reservations created with + reserve_mem command line option across kexec through KHO. + The post-KHO kernel can then consume these reservations and they are + guaranteed to have the same physical address. + +properties: + compatible: + enum: + - reserve-mem-v1 + +patternProperties: + "$[0-9a-f_]+^": + $ref: reserve-mem.yaml# + description: reserved memory regions + +required: + - compatible + +additionalProperties: false + +examples: + - | + memblock { + compatible = "memblock-v1"; + n1 { + compatible = "reserve-mem-v1"; + start = <0xc06b 0x4000000>; + size = <0x04 0x00>; + }; + }; diff --git a/Documentation/core-api/kho/bindings/memblock/reserve-mem.yaml b/Documentation/core-api/kho/bindings/memblock/reserve-mem.yaml new file mode 100644 index 0000000000000000000000000000000000000000..10282d3d1bcdcea1aa699413a5d7c781a62e34a4 --- /dev/null +++ b/Documentation/core-api/kho/bindings/memblock/reserve-mem.yaml @@ -0,0 +1,40 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +title: Memblock reserved memory regions + +maintainers: + - Mike Rapoport + +description: | + Memblock can serialize its current memory reservations created with + reserve_mem command line option across kexec through KHO. + This object describes each such region. + +properties: + compatible: + enum: + - reserve-mem-v1 + + start: + description: | + physical address (u64) of the reserved memory region. + + size: + description: | + size (u64) of the reserved memory region. + +required: + - compatible + - start + - size + +additionalProperties: false + +examples: + - | + n1 { + compatible = "reserve-mem-v1"; + start = <0xc06b 0x4000000>; + size = <0x04 0x00>; + }; diff --git a/Documentation/core-api/kho/bindings/sub-fdt.yaml b/Documentation/core-api/kho/bindings/sub-fdt.yaml new file mode 100644 index 0000000000000000000000000000000000000000..b9a3d2d2485019d64823ca08a2e1f91c663e7f07 --- /dev/null +++ b/Documentation/core-api/kho/bindings/sub-fdt.yaml @@ -0,0 +1,27 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +title: KHO users' FDT address + +maintainers: + - Mike Rapoport + - Changyuan Lyu + +description: | + Physical address of an FDT blob registered by a KHO user. + +properties: + fdt: + description: | + physical address (u64) of an FDT blob. + +required: + - fdt + +additionalProperties: false + +examples: + - | + memblock { + fdt = <0x80cc16 0x1000000>; + }; diff --git a/Documentation/core-api/kho/concepts.rst b/Documentation/core-api/kho/concepts.rst new file mode 100644 index 0000000000000000000000000000000000000000..36d5c05cfb307ed318deb7401ee5ff8dbeebc6f4 --- /dev/null +++ b/Documentation/core-api/kho/concepts.rst @@ -0,0 +1,74 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later +.. _kho-concepts: + +======================= +Kexec Handover Concepts +======================= + +Kexec HandOver (KHO) is a mechanism that allows Linux to preserve memory +regions, which could contain serialized system states, across kexec. + +It introduces multiple concepts: + +KHO FDT +======= + +Every KHO kexec carries a KHO specific flattened device tree (FDT) blob +that describes preserved memory regions. These regions contain either +serialized subsystem states, or in-memory data that shall not be touched +across kexec. After KHO, subsystems can retrieve and restore preserved +memory regions from KHO FDT. + +KHO only uses the FDT container format and libfdt library, but does not +adhere to the same property semantics that normal device trees do: Properties +are passed in native endianness and standardized properties like ``regs`` and +``ranges`` do not exist, hence there are no ``#...-cells`` properties. + +KHO is still under development. The FDT schema is unstable and would change +in the future. + +Scratch Regions +=============== + +To boot into kexec, we need to have a physically contiguous memory range that +contains no handed over memory. Kexec then places the target kernel and initrd +into that region. The new kernel exclusively uses this region for memory +allocations before during boot up to the initialization of the page allocator. + +We guarantee that we always have such regions through the scratch regions: On +first boot KHO allocates several physically contiguous memory regions. Since +after kexec these regions will be used by early memory allocations, there is a +scratch region per NUMA node plus a scratch region to satisfy allocations +requests that do not require particular NUMA node assignment. +By default, size of the scratch region is calculated based on amount of memory +allocated during boot. The ``kho_scratch`` kernel command line option may be +used to explicitly define size of the scratch regions. +The scratch regions are declared as CMA when page allocator is initialized so +that their memory can be used during system lifetime. CMA gives us the +guarantee that no handover pages land in that region, because handover pages +must be at a static physical memory location and CMA enforces that only +movable pages can be located inside. + +After KHO kexec, we ignore the ``kho_scratch`` kernel command line option and +instead reuse the exact same region that was originally allocated. This allows +us to recursively execute any amount of KHO kexecs. Because we used this region +for boot memory allocations and as target memory for kexec blobs, some parts +of that memory region may be reserved. These reservations are irrelevant for +the next KHO, because kexec can overwrite even the original kernel. + +.. _kho-finalization-phase: + +KHO finalization phase +====================== + +To enable user space based kexec file loader, the kernel needs to be able to +provide the FDT that describes the current kernel's state before +performing the actual kexec. The process of generating that FDT is +called serialization. When the FDT is generated, some properties +of the system may become immutable because they are already written down +in the FDT. That state is called the KHO finalization phase. + +Public API +========== +.. kernel-doc:: kernel/kexec_handover.c + :export: diff --git a/Documentation/core-api/kho/fdt.rst b/Documentation/core-api/kho/fdt.rst new file mode 100644 index 0000000000000000000000000000000000000000..62505285d60d6af8d779e44412b5b770272ec52a --- /dev/null +++ b/Documentation/core-api/kho/fdt.rst @@ -0,0 +1,80 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later + +======= +KHO FDT +======= + +KHO uses the flattened device tree (FDT) container format and libfdt +library to create and parse the data that is passed between the +kernels. The properties in KHO FDT are stored in native format. +It includes the physical address of an in-memory structure describing +all preserved memory regions, as well as physical addresses of KHO users' +own FDTs. Interpreting those sub FDTs is the responsibility of KHO users. + +KHO nodes and properties +======================== + +Property ``preserved-memory-map`` +--------------------------------- + +KHO saves a special property named ``preserved-memory-map`` under the root node. +This node contains the physical address of an in-memory structure for KHO to +preserve memory regions across kexec. + +Property ``compatible`` +----------------------- + +The ``compatible`` property determines compatibility between the kernel +that created the KHO FDT and the kernel that attempts to load it. +If the kernel that loads the KHO FDT is not compatible with it, the entire +KHO process will be bypassed. + +Property ``fdt`` +---------------- + +Generally, a KHO user serialize its state into its own FDT and instructs +KHO to preserve the underlying memory, such that after kexec, the new kernel +can recover its state from the preserved FDT. + +A KHO user thus can create a node in KHO root tree and save the physical address +of its own FDT in that node's property ``fdt`` . + +Examples +======== + +The following example demonstrates KHO FDT that preserves two memory +regions created with ``reserve_mem`` kernel command line parameter:: + + /dts-v1/; + + / { + compatible = "kho-v1"; + + preserved-memory-map = <0x40be16 0x1000000>; + + memblock { + fdt = <0x1517 0x1000000>; + }; + }; + +where the ``memblock`` node contains an FDT that is requested by the +subsystem memblock for preservation. The FDT contains the following +serialized data:: + + /dts-v1/; + + / { + compatible = "memblock-v1"; + + n1 { + compatible = "reserve-mem-v1"; + start = <0xc06b 0x4000000>; + size = <0x04 0x00>; + }; + + n2 { + compatible = "reserve-mem-v1"; + start = <0xc067 0x4000000>; + size = <0x04 0x00>; + }; + }; diff --git a/Documentation/core-api/kho/index.rst b/Documentation/core-api/kho/index.rst new file mode 100644 index 0000000000000000000000000000000000000000..0c63b0c5c14363e3522fa9382c4471b35d808f36 --- /dev/null +++ b/Documentation/core-api/kho/index.rst @@ -0,0 +1,13 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later + +======================== +Kexec Handover Subsystem +======================== + +.. toctree:: + :maxdepth: 1 + + concepts + fdt + +.. only:: subproject and html diff --git a/MAINTAINERS b/MAINTAINERS index bd9daeb31fe694798fea88cf554e36c44a38736d..88da436166ed2d130fdd1f4cdcc45f974ca9d4b1 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -11667,6 +11667,17 @@ F: include/linux/kexec.h F: include/uapi/linux/kexec.h F: kernel/kexec* +KEXEC HANDOVER (KHO) +M: Alexander Graf +M: Mike Rapoport +M: Changyuan Lyu +L: kexec@lists.infradead.org +S: Maintained +F: Documentation/admin-guide/mm/kho.rst +F: Documentation/core-api/kho/* +F: include/linux/kexec_handover.h +F: kernel/kexec_handover.c + KEYS-ENCRYPTED M: Mimi Zohar L: linux-integrity@vger.kernel.org @@ -13761,6 +13772,7 @@ M: Mike Rapoport L: linux-mm@kvack.org S: Maintained F: Documentation/core-api/boot-time-mm.rst +F: Documentation/core-api/kho/bindings/memblock/* F: include/linux/memblock.h F: mm/memblock.c F: mm/mm_init.c diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 2e2ff6f4b44bac80fd11893bc3b7610b3c9815bc..cf18ef63c7247c5700afd76f2827d23da01659ac 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -1600,6 +1600,9 @@ config ARCH_SUPPORTS_KEXEC_IMAGE_VERIFY_SIG config ARCH_DEFAULT_KEXEC_IMAGE_VERIFY_SIG def_bool y +config ARCH_SUPPORTS_KEXEC_HANDOVER + def_bool y + config ARCH_SUPPORTS_CRASH_DUMP def_bool y diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index a3c144c29c813b38b95778e563311a3069e49199..fefd43775ad78f75e1515255e37b2711ae37f877 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -2126,6 +2126,9 @@ config ARCH_SUPPORTS_KEXEC_BZIMAGE_VERIFY_SIG config ARCH_SUPPORTS_KEXEC_JUMP def_bool y +config ARCH_SUPPORTS_KEXEC_HANDOVER + def_bool X86_64 + config ARCH_SUPPORTS_CRASH_DUMP def_bool X86_64 || (X86_32 && HIGHMEM) diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c index dec961c6d16a7c476e509a39614176682884b6b1..948da3b01cac46c31418e789a4445fcfebb0cb2f 100644 --- a/arch/x86/boot/compressed/kaslr.c +++ b/arch/x86/boot/compressed/kaslr.c @@ -789,6 +789,49 @@ static void process_e820_entries(unsigned long minimum, } } +/* + * If KHO is active, only process its scratch areas to ensure we are not + * stepping onto preserved memory. + */ +static bool process_kho_entries(unsigned long minimum, unsigned long image_size) +{ + struct kho_scratch *kho_scratch; + struct setup_data *ptr; + struct kho_data *kho; + int i, nr_areas = 0; + + if (!IS_ENABLED(CONFIG_KEXEC_HANDOVER)) + return false; + + ptr = (struct setup_data *)(unsigned long)boot_params_ptr->hdr.setup_data; + while (ptr) { + if (ptr->type == SETUP_KEXEC_KHO) { + kho = (struct kho_data *)(unsigned long)ptr->data; + kho_scratch = (void *)(unsigned long)kho->scratch_addr; + nr_areas = kho->scratch_size / sizeof(*kho_scratch); + break; + } + + ptr = (struct setup_data *)(unsigned long)ptr->next; + } + + if (!nr_areas) + return false; + + for (i = 0; i < nr_areas; i++) { + struct kho_scratch *area = &kho_scratch[i]; + struct mem_vector region = { + .start = area->addr, + .size = area->size, + }; + + if (process_mem_region(®ion, minimum, image_size)) + break; + } + + return true; +} + static unsigned long find_random_phys_addr(unsigned long minimum, unsigned long image_size) { @@ -804,7 +847,12 @@ static unsigned long find_random_phys_addr(unsigned long minimum, return 0; } - if (!process_efi_entries(minimum, image_size)) + /* + * During kexec handover only process KHO scratch areas that are known + * not to contain any data that must be preserved. + */ + if (!process_kho_entries(minimum, image_size) && + !process_efi_entries(minimum, image_size)) process_e820_entries(minimum, image_size); phys_addr = slots_fetch_random(); diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h index 5c83729c8e71ff5a287095211829cf197365a728..02735671deb769ee63421440ad870f93d39c899b 100644 --- a/arch/x86/include/asm/setup.h +++ b/arch/x86/include/asm/setup.h @@ -64,6 +64,8 @@ extern void x86_ce4100_early_setup(void); static inline void x86_ce4100_early_setup(void) { } #endif +#include + #ifndef _SETUP #include diff --git a/arch/x86/include/uapi/asm/bootparam.h b/arch/x86/include/uapi/asm/bootparam.h index eeea058cf6028ec15cd7a0524b1eca7824775e4f..bacdd33026557a5eb277289c3cd9c57022b5df59 100644 --- a/arch/x86/include/uapi/asm/bootparam.h +++ b/arch/x86/include/uapi/asm/bootparam.h @@ -13,7 +13,8 @@ #define SETUP_CC_BLOB 7 #define SETUP_IMA 8 #define SETUP_RNG_SEED 9 -#define SETUP_ENUM_MAX SETUP_RNG_SEED +#define SETUP_KEXEC_KHO 10 +#define SETUP_ENUM_MAX SETUP_KEXEC_KHO #define SETUP_INDIRECT (1<<31) #define SETUP_TYPE_MAX (SETUP_ENUM_MAX | SETUP_INDIRECT) @@ -182,6 +183,16 @@ struct ima_setup_data { __u64 size; } __attribute__((packed)); +/* + * Locations of kexec handover metadata + */ +struct kho_data { + __u64 fdt_addr; + __u64 fdt_size; + __u64 scratch_addr; + __u64 scratch_size; +} __attribute__((packed)); + /* The so-called "zeropage" */ struct boot_params { struct screen_info screen_info; /* 0x000 */ diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c index 1f6fb8e85e0f60d2fbebddf0f19f7b955808e677..77188777797c292755ac1d94b72b6284da505d97 100644 --- a/arch/x86/kernel/e820.c +++ b/arch/x86/kernel/e820.c @@ -1344,6 +1344,24 @@ void __init e820__memblock_setup(void) memblock_add(entry->addr, entry->size); } + /* + * At this point memblock is only allowed to allocate from memory + * below 1M (aka ISA_END_ADDRESS) up until direct map is completely set + * up in init_mem_mapping(). + * + * KHO kernels are special and use only scratch memory for memblock + * allocations, but memory below 1M is ignored by kernel after early + * boot and cannot be naturally marked as scratch. + * + * To allow allocation of the real-mode trampoline and a few (if any) + * other very early allocations from below 1M forcibly mark the memory + * below 1M as scratch. + * + * After real mode trampoline is allocated, we clear that scratch + * marking. + */ + memblock_mark_kho_scratch(0, SZ_1M); + /* Throw away partial pages: */ memblock_trim_memory(PAGE_SIZE); diff --git a/arch/x86/kernel/kexec-bzimage64.c b/arch/x86/kernel/kexec-bzimage64.c index a61c12c012709799ba0b2e1f11295b0861a96122..4609f29487accb66e4161007da7cce97ac102548 100644 --- a/arch/x86/kernel/kexec-bzimage64.c +++ b/arch/x86/kernel/kexec-bzimage64.c @@ -233,6 +233,32 @@ setup_ima_state(const struct kimage *image, struct boot_params *params, #endif /* CONFIG_IMA_KEXEC */ } +static void setup_kho(const struct kimage *image, struct boot_params *params, + unsigned long params_load_addr, + unsigned int setup_data_offset) +{ + struct setup_data *sd = (void *)params + setup_data_offset; + struct kho_data *kho = (void *)sd + sizeof(*sd); + + if (!IS_ENABLED(CONFIG_KEXEC_HANDOVER)) + return; + + sd->type = SETUP_KEXEC_KHO; + sd->len = sizeof(struct kho_data); + + /* Only add if we have all KHO images in place */ + if (!image->kho.fdt || !image->kho.scratch) + return; + + /* Add setup data */ + kho->fdt_addr = image->kho.fdt; + kho->fdt_size = PAGE_SIZE; + kho->scratch_addr = image->kho.scratch->mem; + kho->scratch_size = image->kho.scratch->bufsz; + sd->next = params->hdr.setup_data; + params->hdr.setup_data = params_load_addr + setup_data_offset; +} + static int setup_boot_parameters(struct kimage *image, struct boot_params *params, unsigned long params_load_addr, @@ -305,6 +331,13 @@ setup_boot_parameters(struct kimage *image, struct boot_params *params, sizeof(struct ima_setup_data); } + if (IS_ENABLED(CONFIG_KEXEC_HANDOVER)) { + /* Setup space to store preservation metadata */ + setup_kho(image, params, params_load_addr, setup_data_offset); + setup_data_offset += sizeof(struct setup_data) + + sizeof(struct kho_data); + } + /* Setup RNG seed */ setup_rng_seed(params, params_load_addr, setup_data_offset); @@ -470,6 +503,10 @@ static void *bzImage64_load(struct kimage *image, char *kernel, kbuf.bufsz += sizeof(struct setup_data) + sizeof(struct ima_setup_data); + if (IS_ENABLED(CONFIG_KEXEC_HANDOVER)) + kbuf.bufsz += sizeof(struct setup_data) + + sizeof(struct kho_data); + params = kzalloc(kbuf.bufsz, GFP_KERNEL); if (!params) return ERR_PTR(-ENOMEM); diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index 93dc119c8e2e89aa17ac94926a0c010ffd4f67f3..9a1bc98686c4d2f287a282057215625f87d45668 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -219,8 +219,8 @@ static void __init cleanup_highmap(void) static void __init reserve_brk(void) { if (_brk_end > _brk_start) - memblock_reserve(__pa_symbol(_brk_start), - _brk_end - _brk_start); + memblock_reserve_kern(__pa_symbol(_brk_start), + _brk_end - _brk_start); /* Mark brk area as locked down and no longer taking any new allocations */ @@ -318,7 +318,7 @@ static void __init early_reserve_initrd(void) !ramdisk_image || !ramdisk_size) return; /* No initrd provided by bootloader */ - memblock_reserve(ramdisk_image, ramdisk_end - ramdisk_image); + memblock_reserve_kern(ramdisk_image, ramdisk_end - ramdisk_image); } static void __init reserve_initrd(void) @@ -371,7 +371,7 @@ static void __init add_early_ima_buffer(u64 phys_addr) } if (data->size) { - memblock_reserve(data->addr, data->size); + memblock_reserve_kern(data->addr, data->size); ima_kexec_buffer_phys = data->addr; ima_kexec_buffer_size = data->size; } @@ -409,6 +409,29 @@ int __init ima_get_kexec_buffer(void **addr, size_t *size) } #endif +static void __init add_kho(u64 phys_addr, u32 data_len) +{ + struct kho_data *kho; + u64 addr = phys_addr + sizeof(struct setup_data); + u64 size = data_len - sizeof(struct setup_data); + + if (!IS_ENABLED(CONFIG_KEXEC_HANDOVER)) { + pr_warn("Passed KHO data, but CONFIG_KEXEC_HANDOVER not set. Ignoring.\n"); + return; + } + + kho = early_memremap(addr, size); + if (!kho) { + pr_warn("setup: failed to memremap kho data (0x%llx, 0x%llx)\n", + addr, size); + return; + } + + kho_populate(kho->fdt_addr, kho->fdt_size, kho->scratch_addr, kho->scratch_size); + + early_memunmap(kho, size); +} + static void __init parse_setup_data(void) { struct setup_data *data; @@ -437,6 +460,9 @@ static void __init parse_setup_data(void) case SETUP_IMA: add_early_ima_buffer(pa_data); break; + case SETUP_KEXEC_KHO: + add_kho(pa_data, data_len); + break; case SETUP_RNG_SEED: data = early_memremap(pa_data, data_len); add_bootloader_randomness(data->data, data->len); @@ -471,7 +497,7 @@ static void __init memblock_x86_reserve_range_setup_data(void) len = sizeof(*data); pa_next = data->next; - memblock_reserve(pa_data, sizeof(*data) + data->len); + memblock_reserve_kern(pa_data, sizeof(*data) + data->len); if (data->type == SETUP_INDIRECT) { len += data->len; @@ -485,7 +511,7 @@ static void __init memblock_x86_reserve_range_setup_data(void) indirect = (struct setup_indirect *)data->data; if (indirect->type != SETUP_INDIRECT) - memblock_reserve(indirect->addr, indirect->len); + memblock_reserve_kern(indirect->addr, indirect->len); } pa_data = pa_next; @@ -798,8 +824,8 @@ static void __init early_reserve_memory(void) * __end_of_kernel_reserve symbol must be explicitly reserved with a * separate memblock_reserve() or they will be discarded. */ - memblock_reserve(__pa_symbol(_text), - (unsigned long)__end_of_kernel_reserve - (unsigned long)_text); + memblock_reserve_kern(__pa_symbol(_text), + (unsigned long)__end_of_kernel_reserve - (unsigned long)_text); /* * The first 4Kb of memory is a BIOS owned area, but generally it is diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c index 788e5559549f39ab72573fb582d5964ce3490ed0..8194556eeecbb5cd07f7b965d30f8d78382a2007 100644 --- a/arch/x86/realmode/init.c +++ b/arch/x86/realmode/init.c @@ -65,6 +65,8 @@ void __init reserve_real_mode(void) * setup_arch(). */ memblock_reserve(0, SZ_1M); + + memblock_clear_kho_scratch(0, SZ_1M); } static void __init sme_sev_setup_real_mode(struct trampoline_header *th) diff --git a/drivers/firmware/efi/efi-init.c b/drivers/firmware/efi/efi-init.c index 59b0d7197b68529bc2408708d7700da225fc94fa..6583114898d5724c91160936b8bc7020e4d50c4c 100644 --- a/drivers/firmware/efi/efi-init.c +++ b/drivers/firmware/efi/efi-init.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include #include @@ -155,12 +156,32 @@ static __init void reserve_regions(void) pr_info("Processing EFI memory map:\n"); /* - * Discard memblocks discovered so far: if there are any at this - * point, they originate from memory nodes in the DT, and UEFI - * uses its own memory map instead. + * Discard memblocks discovered so far except for KHO scratch + * regions. Most memblocks at this point originate from memory nodes + * in the DT and UEFI uses its own memory map instead. However, if + * KHO is enabled, scratch regions, which are good known memory + * must be preserved. */ memblock_dump_all(); - memblock_remove(0, PHYS_ADDR_MAX); + + if (is_kho_boot()) { + struct memblock_region *r; + + /* Remove all non-KHO regions */ + for_each_mem_region(r) { + if (!memblock_is_kho_scratch(r)) { + memblock_remove(r->base, r->size); + r--; + } + } + } else { + /* + * KHO is disabled. Discard memblocks discovered so far: + * if there are any at this point, they originate from memory + * nodes in the DT, and UEFI uses its own memory map instead. + */ + memblock_remove(0, PHYS_ADDR_MAX); + } for_each_efi_memory_desc(md) { paddr = md->phys_addr; diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c index 366fbdc56dec10473296112055c56f6beed91820..7bf62ea0867ac1957f6d7d87a6fc64abe01ba0ef 100644 --- a/drivers/of/fdt.c +++ b/drivers/of/fdt.c @@ -26,6 +26,7 @@ #include #include #include +#include #include /* for COMMAND_LINE_SIZE */ #include @@ -1007,6 +1008,36 @@ void __init early_init_dt_check_for_usable_mem_range(void) memblock_add(rgn[i].base, rgn[i].size); } +/** + * early_init_dt_check_kho - Decode info required for kexec handover from DT + */ +static void __init early_init_dt_check_kho(void) +{ + unsigned long node = chosen_node_offset; + u64 fdt_start, fdt_size, scratch_start, scratch_size; + const __be32 *p; + int l; + + if (!IS_ENABLED(CONFIG_KEXEC_HANDOVER) || (long)node < 0) + return; + + p = of_get_flat_dt_prop(node, "linux,kho-fdt", &l); + if (l != (dt_root_addr_cells + dt_root_size_cells) * sizeof(__be32)) + return; + + fdt_start = dt_mem_next_cell(dt_root_addr_cells, &p); + fdt_size = dt_mem_next_cell(dt_root_addr_cells, &p); + + p = of_get_flat_dt_prop(node, "linux,kho-scratch", &l); + if (l != (dt_root_addr_cells + dt_root_size_cells) * sizeof(__be32)) + return; + + scratch_start = dt_mem_next_cell(dt_root_addr_cells, &p); + scratch_size = dt_mem_next_cell(dt_root_addr_cells, &p); + + kho_populate(fdt_start, fdt_size, scratch_start, scratch_size); +} + #ifdef CONFIG_SERIAL_EARLYCON int __init early_init_dt_scan_chosen_stdout(void) @@ -1306,6 +1337,9 @@ void __init early_init_dt_scan_nodes(void) /* Handle linux,usable-memory-range property */ early_init_dt_check_for_usable_mem_range(); + + /* Handle kexec handover */ + early_init_dt_check_kho(); } bool __init early_init_dt_scan(void *dt_virt, phys_addr_t dt_phys) diff --git a/drivers/of/kexec.c b/drivers/of/kexec.c index 3b98a57f1f0743e0dee21e68e84d4644a1f9d0bf..46650f5948acf795d8d14ff5213fe46f5be08ec0 100644 --- a/drivers/of/kexec.c +++ b/drivers/of/kexec.c @@ -264,6 +264,43 @@ static inline int setup_ima_buffer(const struct kimage *image, void *fdt, } #endif /* CONFIG_IMA_KEXEC */ +static int kho_add_chosen(const struct kimage *image, void *fdt, int chosen_node) +{ + int ret = 0; +#ifdef CONFIG_KEXEC_HANDOVER + phys_addr_t fdt_mem = 0; + phys_addr_t fdt_len = 0; + phys_addr_t scratch_mem = 0; + phys_addr_t scratch_len = 0; + + ret = fdt_delprop(fdt, chosen_node, "linux,kho-fdt"); + if (ret && ret != -FDT_ERR_NOTFOUND) + return ret; + ret = fdt_delprop(fdt, chosen_node, "linux,kho-scratch"); + if (ret && ret != -FDT_ERR_NOTFOUND) + return ret; + + if (!image->kho.fdt || !image->kho.scratch) + return 0; + + fdt_mem = image->kho.fdt; + fdt_len = PAGE_SIZE; + scratch_mem = image->kho.scratch->mem; + scratch_len = image->kho.scratch->bufsz; + + pr_debug("Adding kho metadata to DT"); + + ret = fdt_appendprop_addrrange(fdt, 0, chosen_node, "linux,kho-fdt", + fdt_mem, fdt_len); + if (ret) + return ret; + ret = fdt_appendprop_addrrange(fdt, 0, chosen_node, "linux,kho-scratch", + scratch_mem, scratch_len); + +#endif /* CONFIG_KEXEC_HANDOVER */ + return ret; +} + /* * of_kexec_alloc_and_setup_fdt - Alloc and setup a new Flattened Device Tree * @@ -412,6 +449,11 @@ void *of_kexec_alloc_and_setup_fdt(const struct kimage *image, } } + /* Add kho metadata if this is a KHO image */ + ret = kho_add_chosen(image, fdt, chosen_node); + if (ret) + goto out; + /* add bootargs */ if (cmdline) { ret = fdt_setprop_string(fdt, chosen_node, "bootargs", cmdline); diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 3e4c0c536a3d45791e7df591cd44df45ce8e6e40..b7b290a0bc51e4a5a946fde34412147da9088976 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -6,6 +6,7 @@ #include #include +#include struct vm_area_struct; struct mempolicy; @@ -373,4 +374,6 @@ extern struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask, #endif void free_contig_range(unsigned long pfn, unsigned long nr_pages); +DEFINE_FREE(free_page, void *, free_page((unsigned long)_T)) + #endif /* __LINUX_GFP_H */ diff --git a/include/linux/kexec.h b/include/linux/kexec.h index 32c78078552ccc19edd82abae84e73c104633c3c..9ca95930ceb6e9fd59051c7c89f3169035c963cc 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -364,6 +364,11 @@ struct kimage { size_t ima_buffer_size; #endif + struct { + struct kexec_segment *scratch; + phys_addr_t fdt; + } kho; + /* Core ELF header buffer */ void *elf_headers; unsigned long elf_headers_sz; diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h new file mode 100644 index 0000000000000000000000000000000000000000..25042c1d8d54afbe93451c4471e0fc80268b1331 --- /dev/null +++ b/include/linux/kexec_handover.h @@ -0,0 +1,144 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef LINUX_KEXEC_HANDOVER_H +#define LINUX_KEXEC_HANDOVER_H + +#include +#include + +struct kho_scratch { + phys_addr_t addr; + phys_addr_t size; +}; + +/* KHO Notifier index */ +enum kho_event { + KEXEC_KHO_FINALIZE = 0, + KEXEC_KHO_ABORT = 1, +}; + +struct folio; +struct notifier_block; +struct page; + +#define DECLARE_KHOSER_PTR(name, type) \ + union { \ + phys_addr_t phys; \ + type ptr; \ + } name +#define KHOSER_STORE_PTR(dest, val) \ + ({ \ + typeof(val) v = val; \ + typecheck(typeof((dest).ptr), v); \ + (dest).phys = virt_to_phys(v); \ + }) +#define KHOSER_LOAD_PTR(src) \ + ({ \ + typeof(src) s = src; \ + (typeof((s).ptr))((s).phys ? phys_to_virt((s).phys) : NULL); \ + }) + +struct kho_serialization; + +struct kho_vmalloc_chunk; +struct kho_vmalloc { + DECLARE_KHOSER_PTR(first, struct kho_vmalloc_chunk *); + unsigned int total_pages; + unsigned short flags; + unsigned short order; +}; + +#ifdef CONFIG_KEXEC_HANDOVER +bool kho_is_enabled(void); +bool is_kho_boot(void); + +int kho_preserve_folio(struct folio *folio); +int kho_preserve_pages(struct page *page, unsigned int nr_pages); +int kho_preserve_vmalloc(void *ptr, struct kho_vmalloc *preservation); +struct folio *kho_restore_folio(phys_addr_t phys); +struct page *kho_restore_pages(phys_addr_t phys, unsigned int nr_pages); +void *kho_restore_vmalloc(const struct kho_vmalloc *preservation); +int kho_add_subtree(struct kho_serialization *ser, const char *name, void *fdt); +int kho_retrieve_subtree(const char *name, phys_addr_t *phys); + +int register_kho_notifier(struct notifier_block *nb); +int unregister_kho_notifier(struct notifier_block *nb); + +void kho_memory_init(void); + +void kho_populate(phys_addr_t fdt_phys, u64 fdt_len, phys_addr_t scratch_phys, + u64 scratch_len); +#else +static inline bool kho_is_enabled(void) +{ + return false; +} + +static inline bool is_kho_boot(void) +{ + return false; +} + +static inline int kho_preserve_folio(struct folio *folio) +{ + return -EOPNOTSUPP; +} + +static inline int kho_preserve_pages(struct page *page, unsigned int nr_pages) +{ + return -EOPNOTSUPP; +} + +static inline int kho_preserve_vmalloc(void *ptr, + struct kho_vmalloc *preservation) +{ + return -EOPNOTSUPP; +} + +static inline struct folio *kho_restore_folio(phys_addr_t phys) +{ + return NULL; +} + +static inline struct page *kho_restore_pages(phys_addr_t phys, + unsigned int nr_pages) +{ + return NULL; +} + +static inline void *kho_restore_vmalloc(const struct kho_vmalloc *preservation) +{ + return NULL; +} + +static inline int kho_add_subtree(struct kho_serialization *ser, + const char *name, void *fdt) +{ + return -EOPNOTSUPP; +} + +static inline int kho_retrieve_subtree(const char *name, phys_addr_t *phys) +{ + return -EOPNOTSUPP; +} + +static inline int register_kho_notifier(struct notifier_block *nb) +{ + return -EOPNOTSUPP; +} + +static inline int unregister_kho_notifier(struct notifier_block *nb) +{ + return -EOPNOTSUPP; +} + +static inline void kho_memory_init(void) +{ +} + +static inline void kho_populate(phys_addr_t fdt_phys, u64 fdt_len, + phys_addr_t scratch_phys, u64 scratch_len) +{ +} +#endif /* CONFIG_KEXEC_HANDOVER */ + +#endif /* LINUX_KEXEC_HANDOVER_H */ diff --git a/include/linux/memblock.h b/include/linux/memblock.h index 004243557ddfb4812c373dc9752517d1561e7505..8caf6586adfee9d68e2f67733388f20cb79cfc7f 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -40,6 +40,16 @@ extern unsigned long long max_possible_pfn; * via a driver, and never indicated in the firmware-provided memory map as * system RAM. This corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED in the * kernel resource tree. + * @MEMBLOCK_RSRV_NOINIT: memory region for which struct pages are + * not initialized (only for reserved regions). + * @MEMBLOCK_RSRV_KERN: memory region that is reserved for kernel use, + * either explictitly with memblock_reserve_kern() or via memblock + * allocation APIs. All memblock allocations set this flag. + * @MEMBLOCK_KHO_SCRATCH: memory region that kexec can pass to the next + * kernel in handover mode. During early boot, we do not know about all + * memory reservations yet, so we get scratch memory from the previous + * kernel that we know is good to use. It is the only memory that + * allocations may happen from in this phase. */ enum memblock_flags { MEMBLOCK_NONE = 0x0, /* No special request */ @@ -47,6 +57,9 @@ enum memblock_flags { MEMBLOCK_MIRROR = 0x2, /* mirrored region */ MEMBLOCK_NOMAP = 0x4, /* don't add to kernel direct mapping */ MEMBLOCK_DRIVER_MANAGED = 0x8, /* always detected via a driver */ + MEMBLOCK_RSRV_NOINIT = 0x10, /* don't initialize struct pages */ + MEMBLOCK_RSRV_KERN = 0x20, /* memory reserved for kernel use */ + MEMBLOCK_KHO_SCRATCH = 0x40, /* scratch memory for kexec handover */ }; /** @@ -113,7 +126,19 @@ int memblock_add_node(phys_addr_t base, phys_addr_t size, int nid, int memblock_add(phys_addr_t base, phys_addr_t size); int memblock_remove(phys_addr_t base, phys_addr_t size); int memblock_phys_free(phys_addr_t base, phys_addr_t size); -int memblock_reserve(phys_addr_t base, phys_addr_t size); +int __memblock_reserve(phys_addr_t base, phys_addr_t size, int nid, + enum memblock_flags flags); + +static __always_inline int memblock_reserve(phys_addr_t base, phys_addr_t size) +{ + return __memblock_reserve(base, size, NUMA_NO_NODE, 0); +} + +static __always_inline int memblock_reserve_kern(phys_addr_t base, phys_addr_t size) +{ + return __memblock_reserve(base, size, NUMA_NO_NODE, MEMBLOCK_RSRV_KERN); +} + #ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP int memblock_physmem_add(phys_addr_t base, phys_addr_t size); #endif @@ -128,6 +153,9 @@ int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size); int memblock_mark_mirror(phys_addr_t base, phys_addr_t size); int memblock_mark_nomap(phys_addr_t base, phys_addr_t size); int memblock_clear_nomap(phys_addr_t base, phys_addr_t size); +int memblock_reserved_mark_noinit(phys_addr_t base, phys_addr_t size); +int memblock_mark_kho_scratch(phys_addr_t base, phys_addr_t size); +int memblock_clear_kho_scratch(phys_addr_t base, phys_addr_t size); void memblock_free_all(void); void memblock_free(void *ptr, size_t size); @@ -262,11 +290,21 @@ static inline bool memblock_is_nomap(struct memblock_region *m) return m->flags & MEMBLOCK_NOMAP; } +static inline bool memblock_is_reserved_noinit(struct memblock_region *m) +{ + return m->flags & MEMBLOCK_RSRV_NOINIT; +} + static inline bool memblock_is_driver_managed(struct memblock_region *m) { return m->flags & MEMBLOCK_DRIVER_MANAGED; } +static inline bool memblock_is_kho_scratch(struct memblock_region *m) +{ + return m->flags & MEMBLOCK_KHO_SCRATCH; +} + int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn, unsigned long *end_pfn); void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn, @@ -488,6 +526,7 @@ static inline __init_memblock bool memblock_bottom_up(void) phys_addr_t memblock_phys_mem_size(void); phys_addr_t memblock_reserved_size(void); +phys_addr_t memblock_reserved_kern_size(phys_addr_t limit, int nid); phys_addr_t memblock_start_of_DRAM(void); phys_addr_t memblock_end_of_DRAM(void); void memblock_enforce_memory_limit(phys_addr_t memory_limit); @@ -638,5 +677,14 @@ static inline void early_memtest(phys_addr_t start, phys_addr_t end) { } static inline void memtest_report_meminfo(struct seq_file *m) { } #endif +#ifdef CONFIG_MEMBLOCK_KHO_SCRATCH +void memblock_set_kho_scratch_only(void); +void memblock_clear_kho_scratch_only(void); +void memmap_init_kho_scratch_pages(void); +#else +static inline void memblock_set_kho_scratch_only(void) { } +static inline void memblock_clear_kho_scratch_only(void) { } +static inline void memmap_init_kho_scratch_pages(void) {} +#endif #endif /* _LINUX_MEMBLOCK_H */ diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index 51285972a96488d0a9343ceefae9f5593265dd18..d515e93f54ffa1645c6e395307cfb6e15219afa2 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -182,6 +182,9 @@ extern int remap_vmalloc_range_partial(struct vm_area_struct *vma, extern int remap_vmalloc_range(struct vm_area_struct *vma, void *addr, unsigned long pgoff); +int vmap_pages_range(unsigned long addr, unsigned long end, + pgprot_t prot, struct page **pages, unsigned int page_shift); + /* * Architectures can set this mask to a combination of PGTBL_P?D_MODIFIED values * and let generic vmalloc and ioremap code know when arch_sync_kernel_mappings() diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec index 37e488d5b4fc0a4117a78ad726a4ff68e7268eff..0225a680688ffa6363acbe28421e1888dc5832e3 100644 --- a/kernel/Kconfig.kexec +++ b/kernel/Kconfig.kexec @@ -93,6 +93,30 @@ config KEXEC_JUMP Jump between original kernel and kexeced kernel and invoke code in physical address mode via KEXEC +config KEXEC_HANDOVER + bool "kexec handover" + depends on ARCH_SUPPORTS_KEXEC_HANDOVER && ARCH_SUPPORTS_KEXEC_FILE + depends on !DEFERRED_STRUCT_PAGE_INIT + select MEMBLOCK_KHO_SCRATCH + select KEXEC_FILE + select DEBUG_FS + select LIBFDT + select CMA + help + Allow kexec to hand over state across kernels by generating and + passing additional metadata to the target kernel. This is useful + to keep data or state alive across the kexec. For this to work, + both source and target kernels need to have this option enabled. + +config KEXEC_HANDOVER_DEBUG + bool "Enable Kexec Handover debug checks" + depends on KEXEC_HANDOVER_DEBUGFS + help + This option enables extra sanity checks for the Kexec Handover + subsystem. Since, KHO performance is crucial in live update + scenarios and the extra code might be adding overhead it is + only optionally enabled. + config CRASH_DUMP bool "kernel crash dumps" depends on ARCH_SUPPORTS_CRASH_DUMP diff --git a/kernel/Makefile b/kernel/Makefile index ce105a5558fcfadafdc04aeefeefa6231f18008f..67b451004c07fa4be45d2421b52649365b2d4262 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -73,6 +73,8 @@ obj-$(CONFIG_KEXEC_CORE) += kexec_core.o obj-$(CONFIG_KEXEC) += kexec.o obj-$(CONFIG_KEXEC_FILE) += kexec_file.o obj-$(CONFIG_KEXEC_ELF) += kexec_elf.o +obj-$(CONFIG_KEXEC_HANDOVER) += kexec_handover.o +obj-$(CONFIG_KEXEC_HANDOVER_DEBUG) += kexec_handover_debug.o obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o obj-$(CONFIG_COMPAT) += compat.o obj-$(CONFIG_CGROUPS) += cgroup/ diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c index 830344627e9f2093a28d74d13aef7984e71d8c86..4571f442848304c15d632f596f8bc6c8fe4c9bdd 100644 --- a/kernel/kexec_file.c +++ b/kernel/kexec_file.c @@ -249,6 +249,11 @@ kimage_file_prepare_segments(struct kimage *image, int kernel_fd, int initrd_fd, /* IMA needs to pass the measurement list to the next kernel. */ ima_add_kexec_buffer(image); + /* If KHO is active, add its images to the list */ + ret = kho_fill_kimage(image); + if (ret) + goto out; + /* Call image load handler */ ldata = kexec_image_load_default(image); @@ -612,6 +617,14 @@ int kexec_locate_mem_hole(struct kexec_buf *kbuf) if (kbuf->mem != KEXEC_BUF_MEM_UNKNOWN) return 0; + /* + * If KHO is active, only use KHO scratch memory. All other memory + * could potentially be handed over. + */ + ret = kho_locate_mem_hole(kbuf, locate_mem_hole_callback); + if (ret <= 0) + return ret; + if (!IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) ret = kexec_walk_resources(kbuf, locate_mem_hole_callback); else diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c new file mode 100644 index 0000000000000000000000000000000000000000..9a3e521fa34cb30da17a3d859881cf47a41e3955 --- /dev/null +++ b/kernel/kexec_handover.c @@ -0,0 +1,1653 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * kexec_handover.c - kexec handover metadata processing + * Copyright (C) 2023 Alexander Graf + * Copyright (C) 2025 Microsoft Corporation, Mike Rapoport + * Copyright (C) 2025 Google LLC, Changyuan Lyu + */ + +#define pr_fmt(fmt) "KHO: " fmt + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include "kexec_handover_internal.h" +/* + * KHO is tightly coupled with mm init and needs access to some of mm + * internal APIs. + */ +#include "../mm/internal.h" +#include "kexec_internal.h" + +#define KHO_FDT_COMPATIBLE "kho-v1" +#define PROP_PRESERVED_MEMORY_MAP "preserved-memory-map" +#define PROP_SUB_FDT "fdt" + +#define KHO_PAGE_MAGIC 0x4b484f50U /* ASCII for 'KHOP' */ + +/* + * KHO uses page->private, which is an unsigned long, to store page metadata. + * Use it to store both the magic and the order. + */ +union kho_page_info { + unsigned long page_private; + struct { + unsigned int order; + unsigned int magic; + }; +}; + +static_assert(sizeof(union kho_page_info) == sizeof(((struct page *)0)->private)); + +static bool kho_enable __ro_after_init; + +bool kho_is_enabled(void) +{ + return kho_enable; +} +EXPORT_SYMBOL_GPL(kho_is_enabled); + +static int __init kho_parse_enable(char *p) +{ + return kstrtobool(p, &kho_enable); +} +early_param("kho", kho_parse_enable); + +/* + * Keep track of memory that is to be preserved across KHO. + * + * The serializing side uses two levels of xarrays to manage chunks of per-order + * PAGE_SIZE byte bitmaps. For instance if PAGE_SIZE = 4096, the entire 1G order + * of a 8TB system would fit inside a single 4096 byte bitmap. For order 0 + * allocations each bitmap will cover 128M of address space. Thus, for 16G of + * memory at most 512K of bitmap memory will be needed for order 0. + * + * This approach is fully incremental, as the serialization progresses folios + * can continue be aggregated to the tracker. The final step, immediately prior + * to kexec would serialize the xarray information into a linked list for the + * successor kernel to parse. + */ + +#define PRESERVE_BITS (PAGE_SIZE * 8) + +struct kho_mem_phys_bits { + DECLARE_BITMAP(preserve, PRESERVE_BITS); +}; + +static_assert(sizeof(struct kho_mem_phys_bits) == PAGE_SIZE); + +struct kho_mem_phys { + /* + * Points to kho_mem_phys_bits, a sparse bitmap array. Each bit is sized + * to order. + */ + struct xarray phys_bits; +}; + +struct kho_mem_track { + /* Points to kho_mem_phys, each order gets its own bitmap tree */ + struct xarray orders; +}; + +struct khoser_mem_chunk; + +struct kho_serialization { + struct page *fdt; + struct list_head fdt_list; + struct dentry *sub_fdt_dir; + struct kho_mem_track track; + /* First chunk of serialized preserved memory map */ + struct khoser_mem_chunk *preserved_mem_map; +}; + +struct kho_out { + struct blocking_notifier_head chain_head; + + struct dentry *dir; + + struct mutex lock; /* protects KHO FDT finalization */ + + struct kho_serialization ser; + bool finalized; +}; + +static struct kho_out kho_out = { + .chain_head = BLOCKING_NOTIFIER_INIT(kho_out.chain_head), + .lock = __MUTEX_INITIALIZER(kho_out.lock), + .ser = { + .fdt_list = LIST_HEAD_INIT(kho_out.ser.fdt_list), + .track = { + .orders = XARRAY_INIT(kho_out.ser.track.orders, 0), + }, + }, + .finalized = false, +}; + +static void *xa_load_or_alloc(struct xarray *xa, unsigned long index) +{ + void *res = xa_load(xa, index); + + if (res) + return res; + + void *elm __free(free_page) = (void *)get_zeroed_page(GFP_KERNEL); + + if (!elm) + return ERR_PTR(-ENOMEM); + + if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm), PAGE_SIZE))) + return ERR_PTR(-EINVAL); + + res = xa_cmpxchg(xa, index, NULL, elm, GFP_KERNEL); + if (xa_is_err(res)) + return ERR_PTR(xa_err(res)); + else if (res) + return res; + + return no_free_ptr(elm); +} + +static void __kho_unpreserve(struct kho_mem_track *track, unsigned long pfn, + unsigned long end_pfn) +{ + struct kho_mem_phys_bits *bits; + struct kho_mem_phys *physxa; + + while (pfn < end_pfn) { + const unsigned int order = + min(count_trailing_zeros(pfn), ilog2(end_pfn - pfn)); + const unsigned long pfn_high = pfn >> order; + + physxa = xa_load(&track->orders, order); + if (!physxa) + continue; + + bits = xa_load(&physxa->phys_bits, pfn_high / PRESERVE_BITS); + if (!bits) + continue; + + clear_bit(pfn_high % PRESERVE_BITS, bits->preserve); + + pfn += 1 << order; + } +} + +static int __kho_preserve_order(struct kho_mem_track *track, unsigned long pfn, + unsigned int order) +{ + struct kho_mem_phys_bits *bits; + struct kho_mem_phys *physxa, *new_physxa; + const unsigned long pfn_high = pfn >> order; + + might_sleep(); + + if (kho_out.finalized) + return -EBUSY; + + physxa = xa_load(&track->orders, order); + if (!physxa) { + int err; + + new_physxa = kzalloc(sizeof(*physxa), GFP_KERNEL); + if (!new_physxa) + return -ENOMEM; + + xa_init(&new_physxa->phys_bits); + physxa = xa_cmpxchg(&track->orders, order, NULL, new_physxa, + GFP_KERNEL); + + err = xa_err(physxa); + if (err || physxa) { + xa_destroy(&new_physxa->phys_bits); + kfree(new_physxa); + + if (err) + return err; + } else { + physxa = new_physxa; + } + } + + bits = xa_load_or_alloc(&physxa->phys_bits, pfn_high / PRESERVE_BITS); + if (IS_ERR(bits)) + return PTR_ERR(bits); + + set_bit(pfn_high % PRESERVE_BITS, bits->preserve); + + return 0; +} + +static struct page *kho_restore_page(phys_addr_t phys) +{ + struct page *page = pfn_to_online_page(PHYS_PFN(phys)); + union kho_page_info info; + unsigned int nr_pages; + + if (!page) + return NULL; + + info.page_private = page->private; + /* + * deserialize_bitmap() only sets the magic on the head page. This magic + * check also implicitly makes sure phys is order-aligned since for + * non-order-aligned phys addresses, magic will never be set. + */ + if (WARN_ON_ONCE(info.magic != KHO_PAGE_MAGIC || info.order > MAX_ORDER)) + return NULL; + nr_pages = (1 << info.order); + + /* Clear private to make sure later restores on this page error out. */ + page->private = 0; + /* Head page gets refcount of 1. */ + set_page_count(page, 1); + + /* For higher order folios, tail pages get a page count of zero. */ + for (unsigned int i = 1; i < nr_pages; i++) + set_page_count(page + i, 0); + + if (info.order > 0) + prep_compound_page(page, info.order); + + adjust_managed_page_count(page, nr_pages); + return page; +} + +/** + * kho_restore_folio - recreates the folio from the preserved memory. + * @phys: physical address of the folio. + * + * Return: pointer to the struct folio on success, NULL on failure. + */ +struct folio *kho_restore_folio(phys_addr_t phys) +{ + struct page *page = kho_restore_page(phys); + + return page ? page_folio(page) : NULL; +} +EXPORT_SYMBOL_GPL(kho_restore_folio); + +/** + * kho_restore_pages - restore list of contiguous order 0 pages. + * @phys: physical address of the first page. + * @nr_pages: number of pages. + * + * Restore a contiguous list of order 0 pages that was preserved with + * kho_preserve_pages(). + * + * Return: 0 on success, error code on failure + */ +struct page *kho_restore_pages(phys_addr_t phys, unsigned int nr_pages) +{ + const unsigned long start_pfn = PHYS_PFN(phys); + const unsigned long end_pfn = start_pfn + nr_pages; + unsigned long pfn = start_pfn; + + while (pfn < end_pfn) { + const unsigned int order = + min(count_trailing_zeros(pfn), ilog2(end_pfn - pfn)); + struct page *page = kho_restore_page(PFN_PHYS(pfn)); + + if (!page) + return NULL; + split_page(page, order); + pfn += 1 << order; + } + + return pfn_to_page(start_pfn); +} +EXPORT_SYMBOL_GPL(kho_restore_pages); + +/* Serialize and deserialize struct kho_mem_phys across kexec + * + * Record all the bitmaps in a linked list of pages for the next kernel to + * process. Each chunk holds bitmaps of the same order and each block of bitmaps + * starts at a given physical address. This allows the bitmaps to be sparse. The + * xarray is used to store them in a tree while building up the data structure, + * but the KHO successor kernel only needs to process them once in order. + * + * All of this memory is normal kmalloc() memory and is not marked for + * preservation. The successor kernel will remain isolated to the scratch space + * until it completes processing this list. Once processed all the memory + * storing these ranges will be marked as free. + */ + +struct khoser_mem_bitmap_ptr { + phys_addr_t phys_start; + DECLARE_KHOSER_PTR(bitmap, struct kho_mem_phys_bits *); +}; + +struct khoser_mem_chunk_hdr { + DECLARE_KHOSER_PTR(next, struct khoser_mem_chunk *); + unsigned int order; + unsigned int num_elms; +}; + +#define KHOSER_BITMAP_SIZE \ + ((PAGE_SIZE - sizeof(struct khoser_mem_chunk_hdr)) / \ + sizeof(struct khoser_mem_bitmap_ptr)) + +struct khoser_mem_chunk { + struct khoser_mem_chunk_hdr hdr; + struct khoser_mem_bitmap_ptr bitmaps[KHOSER_BITMAP_SIZE]; +}; + +static_assert(sizeof(struct khoser_mem_chunk) == PAGE_SIZE); + +static struct khoser_mem_chunk *new_chunk(struct khoser_mem_chunk *cur_chunk, + unsigned long order) +{ + struct khoser_mem_chunk *chunk __free(free_page) = NULL; + + chunk = (void *)get_zeroed_page(GFP_KERNEL); + if (!chunk) + return ERR_PTR(-ENOMEM); + + if (WARN_ON(kho_scratch_overlap(virt_to_phys(chunk), PAGE_SIZE))) + return ERR_PTR(-EINVAL); + + chunk->hdr.order = order; + if (cur_chunk) + KHOSER_STORE_PTR(cur_chunk->hdr.next, chunk); + return no_free_ptr(chunk); +} + +static void kho_mem_ser_free(struct khoser_mem_chunk *first_chunk) +{ + struct khoser_mem_chunk *chunk = first_chunk; + + while (chunk) { + struct khoser_mem_chunk *tmp = chunk; + + chunk = KHOSER_LOAD_PTR(chunk->hdr.next); + kfree(tmp); + } +} + +static int kho_mem_serialize(struct kho_serialization *ser) +{ + struct khoser_mem_chunk *first_chunk = NULL; + struct khoser_mem_chunk *chunk = NULL; + struct kho_mem_phys *physxa; + unsigned long order; + int err = -ENOMEM; + + xa_for_each(&ser->track.orders, order, physxa) { + struct kho_mem_phys_bits *bits; + unsigned long phys; + + chunk = new_chunk(chunk, order); + if (IS_ERR(chunk)) { + err = PTR_ERR(chunk); + goto err_free; + } + + if (!first_chunk) + first_chunk = chunk; + + xa_for_each(&physxa->phys_bits, phys, bits) { + struct khoser_mem_bitmap_ptr *elm; + + if (chunk->hdr.num_elms == ARRAY_SIZE(chunk->bitmaps)) { + chunk = new_chunk(chunk, order); + if (IS_ERR(chunk)) { + err = PTR_ERR(chunk); + goto err_free; + } + } + + elm = &chunk->bitmaps[chunk->hdr.num_elms]; + chunk->hdr.num_elms++; + elm->phys_start = (phys * PRESERVE_BITS) + << (order + PAGE_SHIFT); + KHOSER_STORE_PTR(elm->bitmap, bits); + } + } + + ser->preserved_mem_map = first_chunk; + + return 0; + +err_free: + kho_mem_ser_free(first_chunk); + return err; +} + +static void __init deserialize_bitmap(unsigned int order, + struct khoser_mem_bitmap_ptr *elm) +{ + struct kho_mem_phys_bits *bitmap = KHOSER_LOAD_PTR(elm->bitmap); + unsigned long bit; + + for_each_set_bit(bit, bitmap->preserve, PRESERVE_BITS) { + int sz = 1 << (order + PAGE_SHIFT); + phys_addr_t phys = + elm->phys_start + (bit << (order + PAGE_SHIFT)); + struct page *page = pfn_to_page(PHYS_PFN(phys)); + union kho_page_info info; + + memblock_reserve(phys, sz); + memblock_reserved_mark_noinit(phys, sz); + info.magic = KHO_PAGE_MAGIC; + info.order = order; + page->private = info.page_private; + } +} + +static void __init kho_mem_deserialize(const void *fdt) +{ + struct khoser_mem_chunk *chunk; + const phys_addr_t *mem; + int len; + + mem = fdt_getprop(fdt, 0, PROP_PRESERVED_MEMORY_MAP, &len); + + if (!mem || len != sizeof(*mem)) { + pr_err("failed to get preserved memory bitmaps\n"); + return; + } + + chunk = *mem ? phys_to_virt(*mem) : NULL; + while (chunk) { + unsigned int i; + + for (i = 0; i != chunk->hdr.num_elms; i++) + deserialize_bitmap(chunk->hdr.order, + &chunk->bitmaps[i]); + chunk = KHOSER_LOAD_PTR(chunk->hdr.next); + } +} + +/* + * With KHO enabled, memory can become fragmented because KHO regions may + * be anywhere in physical address space. The scratch regions give us a + * safe zones that we will never see KHO allocations from. This is where we + * can later safely load our new kexec images into and then use the scratch + * area for early allocations that happen before page allocator is + * initialized. + */ +static struct kho_scratch *kho_scratch; +static unsigned int kho_scratch_cnt; + +/* + * The scratch areas are scaled by default as percent of memory allocated from + * memblock. A user can override the scale with command line parameter: + * + * kho_scratch=N% + * + * It is also possible to explicitly define size for a lowmem, a global and + * per-node scratch areas: + * + * kho_scratch=l[KMG],n[KMG],m[KMG] + * + * The explicit size definition takes precedence over scale definition. + */ +static unsigned int scratch_scale __initdata = 200; +static phys_addr_t scratch_size_global __initdata; +static phys_addr_t scratch_size_pernode __initdata; +static phys_addr_t scratch_size_lowmem __initdata; + +static int __init kho_parse_scratch_size(char *p) +{ + size_t len; + unsigned long sizes[3]; + size_t total_size = 0; + int i; + + if (!p) + return -EINVAL; + + len = strlen(p); + if (!len) + return -EINVAL; + + /* parse nn% */ + if (p[len - 1] == '%') { + /* unsigned int max is 4,294,967,295, 10 chars */ + char s_scale[11] = {}; + int ret = 0; + + if (len > ARRAY_SIZE(s_scale)) + return -EINVAL; + + memcpy(s_scale, p, len - 1); + ret = kstrtouint(s_scale, 10, &scratch_scale); + if (!ret) + pr_notice("scratch scale is %d%%\n", scratch_scale); + return ret; + } + + /* parse ll[KMG],mm[KMG],nn[KMG] */ + for (i = 0; i < ARRAY_SIZE(sizes); i++) { + char *endp = p; + + if (i > 0) { + if (*p != ',') + return -EINVAL; + p += 1; + } + + sizes[i] = memparse(p, &endp); + if (endp == p) + return -EINVAL; + p = endp; + total_size += sizes[i]; + } + + if (!total_size) + return -EINVAL; + + /* The string should be fully consumed by now. */ + if (*p) + return -EINVAL; + + scratch_size_lowmem = sizes[0]; + scratch_size_global = sizes[1]; + scratch_size_pernode = sizes[2]; + scratch_scale = 0; + + pr_notice("scratch areas: lowmem: %lluMiB global: %lluMiB pernode: %lldMiB\n", + (u64)(scratch_size_lowmem >> 20), + (u64)(scratch_size_global >> 20), + (u64)(scratch_size_pernode >> 20)); + + return 0; +} +early_param("kho_scratch", kho_parse_scratch_size); + +static void __init scratch_size_update(void) +{ + phys_addr_t size; + + if (!scratch_scale) + return; + + size = memblock_reserved_kern_size(ARCH_LOW_ADDRESS_LIMIT, + NUMA_NO_NODE); + size = size * scratch_scale / 100; + scratch_size_lowmem = round_up(size, CMA_MIN_ALIGNMENT_BYTES); + + size = memblock_reserved_kern_size(MEMBLOCK_ALLOC_ANYWHERE, + NUMA_NO_NODE); + size = size * scratch_scale / 100 - scratch_size_lowmem; + scratch_size_global = round_up(size, CMA_MIN_ALIGNMENT_BYTES); +} + +static phys_addr_t __init scratch_size_node(int nid) +{ + phys_addr_t size; + + if (scratch_scale) { + size = memblock_reserved_kern_size(MEMBLOCK_ALLOC_ANYWHERE, + nid); + size = size * scratch_scale / 100; + } else { + size = scratch_size_pernode; + } + + return round_up(size, CMA_MIN_ALIGNMENT_BYTES); +} + +/** + * kho_reserve_scratch - Reserve a contiguous chunk of memory for kexec + * + * With KHO we can preserve arbitrary pages in the system. To ensure we still + * have a large contiguous region of memory when we search the physical address + * space for target memory, let's make sure we always have a large CMA region + * active. This CMA region will only be used for movable pages which are not a + * problem for us during KHO because we can just move them somewhere else. + */ +static void __init kho_reserve_scratch(void) +{ + phys_addr_t addr, size; + int nid, i = 0; + + if (!kho_enable) + return; + + scratch_size_update(); + + /* FIXME: deal with node hot-plug/remove */ + kho_scratch_cnt = num_online_nodes() + 2; + size = kho_scratch_cnt * sizeof(*kho_scratch); + kho_scratch = memblock_alloc(size, PAGE_SIZE); + if (!kho_scratch) + goto err_disable_kho; + + /* + * reserve scratch area in low memory for lowmem allocations in the + * next kernel + */ + size = scratch_size_lowmem; + addr = memblock_phys_alloc_range(size, CMA_MIN_ALIGNMENT_BYTES, 0, + ARCH_LOW_ADDRESS_LIMIT); + if (!addr) + goto err_free_scratch_desc; + + kho_scratch[i].addr = addr; + kho_scratch[i].size = size; + i++; + + /* reserve large contiguous area for allocations without nid */ + size = scratch_size_global; + addr = memblock_phys_alloc(size, CMA_MIN_ALIGNMENT_BYTES); + if (!addr) + goto err_free_scratch_areas; + + kho_scratch[i].addr = addr; + kho_scratch[i].size = size; + i++; + + for_each_online_node(nid) { + size = scratch_size_node(nid); + addr = memblock_alloc_range_nid(size, CMA_MIN_ALIGNMENT_BYTES, + 0, MEMBLOCK_ALLOC_ACCESSIBLE, + nid, true); + if (!addr) + goto err_free_scratch_areas; + + kho_scratch[i].addr = addr; + kho_scratch[i].size = size; + i++; + } + + return; + +err_free_scratch_areas: + for (i--; i >= 0; i--) + memblock_phys_free(kho_scratch[i].addr, kho_scratch[i].size); +err_free_scratch_desc: + memblock_free(kho_scratch, kho_scratch_cnt * sizeof(*kho_scratch)); +err_disable_kho: + pr_warn("Failed to reserve scratch area, disabling kexec handover\n"); + kho_enable = false; +} + +struct fdt_debugfs { + struct list_head list; + struct debugfs_blob_wrapper wrapper; + struct dentry *file; +}; + +static int kho_debugfs_fdt_add(struct list_head *list, struct dentry *dir, + const char *name, const void *fdt) +{ + struct fdt_debugfs *f; + struct dentry *file; + + f = kmalloc(sizeof(*f), GFP_KERNEL); + if (!f) + return -ENOMEM; + + f->wrapper.data = (void *)fdt; + f->wrapper.size = fdt_totalsize(fdt); + + file = debugfs_create_blob(name, 0400, dir, &f->wrapper); + if (IS_ERR(file)) { + kfree(f); + return PTR_ERR(file); + } + + f->file = file; + list_add(&f->list, list); + + return 0; +} + +/** + * kho_add_subtree - record the physical address of a sub FDT in KHO root tree. + * @ser: serialization control object passed by KHO notifiers. + * @name: name of the sub tree. + * @fdt: the sub tree blob. + * + * Creates a new child node named @name in KHO root FDT and records + * the physical address of @fdt. The pages of @fdt must also be preserved + * by KHO for the new kernel to retrieve it after kexec. + * + * A debugfs blob entry is also created at + * ``/sys/kernel/debug/kho/out/sub_fdts/@name``. + * + * Return: 0 on success, error code on failure + */ +int kho_add_subtree(struct kho_serialization *ser, const char *name, void *fdt) +{ + int err = 0; + u64 phys = (u64)virt_to_phys(fdt); + void *root = page_to_virt(ser->fdt); + + err |= fdt_begin_node(root, name); + err |= fdt_property(root, PROP_SUB_FDT, &phys, sizeof(phys)); + err |= fdt_end_node(root); + + if (err) + return err; + + return kho_debugfs_fdt_add(&ser->fdt_list, ser->sub_fdt_dir, name, fdt); +} +EXPORT_SYMBOL_GPL(kho_add_subtree); + +int register_kho_notifier(struct notifier_block *nb) +{ + return blocking_notifier_chain_register(&kho_out.chain_head, nb); +} +EXPORT_SYMBOL_GPL(register_kho_notifier); + +int unregister_kho_notifier(struct notifier_block *nb) +{ + return blocking_notifier_chain_unregister(&kho_out.chain_head, nb); +} +EXPORT_SYMBOL_GPL(unregister_kho_notifier); + +/** + * kho_preserve_folio - preserve a folio across kexec. + * @folio: folio to preserve. + * + * Instructs KHO to preserve the whole folio across kexec. The order + * will be preserved as well. + * + * Return: 0 on success, error code on failure + */ +int kho_preserve_folio(struct folio *folio) +{ + const unsigned long pfn = folio_pfn(folio); + const unsigned int order = folio_order(folio); + struct kho_mem_track *track = &kho_out.ser.track; + + if (WARN_ON(kho_scratch_overlap(pfn << PAGE_SHIFT, PAGE_SIZE << order))) + return -EINVAL; + + return __kho_preserve_order(track, pfn, order); +} +EXPORT_SYMBOL_GPL(kho_preserve_folio); + +/** + * kho_preserve_pages - preserve contiguous pages across kexec + * @page: first page in the list. + * @nr_pages: number of pages. + * + * Preserve a contiguous list of order 0 pages. Must be restored using + * kho_restore_pages() to ensure the pages are restored properly as order 0. + * + * Return: 0 on success, error code on failure + */ +int kho_preserve_pages(struct page *page, unsigned int nr_pages) +{ + struct kho_mem_track *track = &kho_out.ser.track; + const unsigned long start_pfn = page_to_pfn(page); + const unsigned long end_pfn = start_pfn + nr_pages; + unsigned long pfn = start_pfn; + unsigned long failed_pfn = 0; + int err = 0; + + if (WARN_ON(kho_scratch_overlap(start_pfn << PAGE_SHIFT, + nr_pages << PAGE_SHIFT))) { + return -EINVAL; + } + + while (pfn < end_pfn) { + const unsigned int order = + min(count_trailing_zeros(pfn), ilog2(end_pfn - pfn)); + + err = __kho_preserve_order(track, pfn, order); + if (err) { + failed_pfn = pfn; + break; + } + + pfn += 1 << order; + } + + if (err) + __kho_unpreserve(track, start_pfn, failed_pfn); + + return err; +} +EXPORT_SYMBOL_GPL(kho_preserve_pages); + +struct kho_vmalloc_hdr { + DECLARE_KHOSER_PTR(next, struct kho_vmalloc_chunk *); +}; + +#define KHO_VMALLOC_SIZE \ + ((PAGE_SIZE - sizeof(struct kho_vmalloc_hdr)) / \ + sizeof(phys_addr_t)) + +struct kho_vmalloc_chunk { + struct kho_vmalloc_hdr hdr; + phys_addr_t phys[KHO_VMALLOC_SIZE]; +}; + +static_assert(sizeof(struct kho_vmalloc_chunk) == PAGE_SIZE); + +/* vmalloc flags KHO supports */ +#define KHO_VMALLOC_SUPPORTED_FLAGS (VM_ALLOC | VM_ALLOW_HUGE_VMAP) + +/* KHO internal flags for vmalloc preservations */ +#define KHO_VMALLOC_ALLOC 0x0001 +#define KHO_VMALLOC_HUGE_VMAP 0x0002 + +static unsigned short vmalloc_flags_to_kho(unsigned int vm_flags) +{ + unsigned short kho_flags = 0; + + if (vm_flags & VM_ALLOC) + kho_flags |= KHO_VMALLOC_ALLOC; + if (vm_flags & VM_ALLOW_HUGE_VMAP) + kho_flags |= KHO_VMALLOC_HUGE_VMAP; + + return kho_flags; +} + +static unsigned int kho_flags_to_vmalloc(unsigned short kho_flags) +{ + unsigned int vm_flags = 0; + + if (kho_flags & KHO_VMALLOC_ALLOC) + vm_flags |= VM_ALLOC; + if (kho_flags & KHO_VMALLOC_HUGE_VMAP) + vm_flags |= VM_ALLOW_HUGE_VMAP; + + return vm_flags; +} + +static struct kho_vmalloc_chunk *new_vmalloc_chunk(struct kho_vmalloc_chunk *cur) +{ + struct kho_vmalloc_chunk *chunk; + int err; + + chunk = (struct kho_vmalloc_chunk *)get_zeroed_page(GFP_KERNEL); + if (!chunk) + return NULL; + + err = kho_preserve_pages(virt_to_page(chunk), 1); + if (err) + goto err_free; + if (cur) + KHOSER_STORE_PTR(cur->hdr.next, chunk); + return chunk; + +err_free: + free_page((unsigned long)chunk); + return NULL; +} + +static void kho_vmalloc_unpreserve_chunk(struct kho_vmalloc_chunk *chunk) +{ + struct kho_mem_track *track = &kho_out.ser.track; + unsigned long pfn = PHYS_PFN(virt_to_phys(chunk)); + + __kho_unpreserve(track, pfn, pfn + 1); + + for (int i = 0; chunk->phys[i]; i++) { + pfn = PHYS_PFN(chunk->phys[i]); + __kho_unpreserve(track, pfn, pfn + 1); + } +} + +static void kho_vmalloc_free_chunks(struct kho_vmalloc *kho_vmalloc) +{ + struct kho_vmalloc_chunk *chunk = KHOSER_LOAD_PTR(kho_vmalloc->first); + + while (chunk) { + struct kho_vmalloc_chunk *tmp = chunk; + + kho_vmalloc_unpreserve_chunk(chunk); + + chunk = KHOSER_LOAD_PTR(chunk->hdr.next); + free_page((unsigned long)tmp); + } +} + +/** + * kho_preserve_vmalloc - preserve memory allocated with vmalloc() across kexec + * @ptr: pointer to the area in vmalloc address space + * @preservation: placeholder for preservation metadata + * + * Instructs KHO to preserve the area in vmalloc address space at @ptr. The + * physical pages mapped at @ptr will be preserved and on successful return + * @preservation will hold the physical address of a structure that describes + * the preservation. + * + * NOTE: The memory allocated with vmalloc_node() variants cannot be reliably + * restored on the same node + * + * Return: 0 on success, error code on failure + */ +int kho_preserve_vmalloc(void *ptr, struct kho_vmalloc *preservation) +{ + struct kho_vmalloc_chunk *chunk; + struct vm_struct *vm = find_vm_area(ptr); + unsigned int order, flags, nr_contig_pages; + unsigned int idx = 0; + int err; + + if (!vm) + return -EINVAL; + + if (vm->flags & ~KHO_VMALLOC_SUPPORTED_FLAGS) + return -EOPNOTSUPP; + + flags = vmalloc_flags_to_kho(vm->flags); + + /* + * Backporting issue: get_vm_area_page_order() is in a large patch, + * this is ugly but simple, as original function is just a wrapper. + */ +#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC + order = vm->page_order; +#else + order = 0; +#endif + + chunk = new_vmalloc_chunk(NULL); + if (!chunk) + return -ENOMEM; + KHOSER_STORE_PTR(preservation->first, chunk); + + nr_contig_pages = (1 << order); + for (int i = 0; i < vm->nr_pages; i += nr_contig_pages) { + phys_addr_t phys = page_to_phys(vm->pages[i]); + + err = kho_preserve_pages(vm->pages[i], nr_contig_pages); + if (err) + goto err_free; + + chunk->phys[idx++] = phys; + if (idx == ARRAY_SIZE(chunk->phys)) { + chunk = new_vmalloc_chunk(chunk); + if (!chunk) + goto err_free; + idx = 0; + } + } + + preservation->total_pages = vm->nr_pages; + preservation->flags = flags; + preservation->order = order; + + return 0; + +err_free: + kho_vmalloc_free_chunks(preservation); + return err; +} +EXPORT_SYMBOL_GPL(kho_preserve_vmalloc); + +/** + * kho_restore_vmalloc - recreates and populates an area in vmalloc address + * space from the preserved memory. + * @preservation: preservation metadata. + * + * Recreates an area in vmalloc address space and populates it with memory that + * was preserved using kho_preserve_vmalloc(). + * + * Return: pointer to the area in the vmalloc address space, NULL on failure. + */ +void *kho_restore_vmalloc(const struct kho_vmalloc *preservation) +{ + struct kho_vmalloc_chunk *chunk = KHOSER_LOAD_PTR(preservation->first); + unsigned int align, order, shift, vm_flags; + unsigned long total_pages, contig_pages; + unsigned long addr, size; + struct vm_struct *area; + struct page **pages; + unsigned int idx = 0; + int err; + + vm_flags = kho_flags_to_vmalloc(preservation->flags); + if (vm_flags & ~KHO_VMALLOC_SUPPORTED_FLAGS) + return NULL; + + total_pages = preservation->total_pages; + pages = kvmalloc_array(total_pages, sizeof(*pages), GFP_KERNEL); + if (!pages) + return NULL; + order = preservation->order; + contig_pages = (1 << order); + shift = PAGE_SHIFT + order; + align = 1 << shift; + + while (chunk) { + struct page *page; + + for (int i = 0; chunk->phys[i]; i++) { + phys_addr_t phys = chunk->phys[i]; + + if (idx + contig_pages > total_pages) + goto err_free_pages_array; + + page = kho_restore_pages(phys, contig_pages); + if (!page) + goto err_free_pages_array; + + for (int j = 0; j < contig_pages; j++) + pages[idx++] = page; + + phys += contig_pages * PAGE_SIZE; + } + + page = kho_restore_pages(virt_to_phys(chunk), 1); + if (!page) + goto err_free_pages_array; + chunk = KHOSER_LOAD_PTR(chunk->hdr.next); + __free_page(page); + } + + if (idx != total_pages) + goto err_free_pages_array; + + area = __get_vm_area_node(total_pages * PAGE_SIZE, align, shift, + vm_flags, VMALLOC_START, VMALLOC_END, + NUMA_NO_NODE, GFP_KERNEL, + __builtin_return_address(0)); + if (!area) + goto err_free_pages_array; + + addr = (unsigned long)area->addr; + size = get_vm_area_size(area); + err = vmap_pages_range(addr, addr + size, PAGE_KERNEL, pages, shift); + if (err) + goto err_free_vm_area; + + area->nr_pages = total_pages; + area->pages = pages; + + return area->addr; + +err_free_vm_area: + free_vm_area(area); +err_free_pages_array: + kvfree(pages); + return NULL; +} +EXPORT_SYMBOL_GPL(kho_restore_vmalloc); + +/* Handling for debug/kho/out */ + +static struct dentry *debugfs_root; + +static int kho_out_update_debugfs_fdt(void) +{ + int err = 0; + struct fdt_debugfs *ff, *tmp; + + if (kho_out.finalized) { + err = kho_debugfs_fdt_add(&kho_out.ser.fdt_list, kho_out.dir, + "fdt", page_to_virt(kho_out.ser.fdt)); + } else { + list_for_each_entry_safe(ff, tmp, &kho_out.ser.fdt_list, list) { + debugfs_remove(ff->file); + list_del(&ff->list); + kfree(ff); + } + } + + return err; +} + +static int kho_abort(void) +{ + int err; + unsigned long order; + struct kho_mem_phys *physxa; + + xa_for_each(&kho_out.ser.track.orders, order, physxa) { + struct kho_mem_phys_bits *bits; + unsigned long phys; + + xa_for_each(&physxa->phys_bits, phys, bits) + kfree(bits); + + xa_destroy(&physxa->phys_bits); + kfree(physxa); + } + xa_destroy(&kho_out.ser.track.orders); + + if (kho_out.ser.preserved_mem_map) { + kho_mem_ser_free(kho_out.ser.preserved_mem_map); + kho_out.ser.preserved_mem_map = NULL; + } + + err = blocking_notifier_call_chain(&kho_out.chain_head, KEXEC_KHO_ABORT, + NULL); + err = notifier_to_errno(err); + + if (err) + pr_err("Failed to abort KHO finalization: %d\n", err); + + return err; +} + +static int kho_finalize(void) +{ + int err = 0; + u64 *preserved_mem_map; + void *fdt = page_to_virt(kho_out.ser.fdt); + + err |= fdt_create(fdt, PAGE_SIZE); + err |= fdt_finish_reservemap(fdt); + err |= fdt_begin_node(fdt, ""); + err |= fdt_property_string(fdt, "compatible", KHO_FDT_COMPATIBLE); + /** + * Reserve the preserved-memory-map property in the root FDT, so + * that all property definitions will precede subnodes created by + * KHO callers. + */ + err |= fdt_property_placeholder(fdt, PROP_PRESERVED_MEMORY_MAP, + sizeof(*preserved_mem_map), + (void **)&preserved_mem_map); + if (err) + goto abort; + + err = kho_preserve_folio(page_folio(kho_out.ser.fdt)); + if (err) + goto abort; + + err = blocking_notifier_call_chain(&kho_out.chain_head, + KEXEC_KHO_FINALIZE, &kho_out.ser); + err = notifier_to_errno(err); + if (err) + goto abort; + + err = kho_mem_serialize(&kho_out.ser); + if (err) + goto abort; + + *preserved_mem_map = (u64)virt_to_phys(kho_out.ser.preserved_mem_map); + + err |= fdt_end_node(fdt); + err |= fdt_finish(fdt); + +abort: + if (err) { + pr_err("Failed to convert KHO state tree: %d\n", err); + kho_abort(); + } + + return err; +} + +static int kho_out_finalize_get(void *data, u64 *val) +{ + mutex_lock(&kho_out.lock); + *val = kho_out.finalized; + mutex_unlock(&kho_out.lock); + + return 0; +} + +static int kho_out_finalize_set(void *data, u64 _val) +{ + int ret = 0; + bool val = !!_val; + + mutex_lock(&kho_out.lock); + + if (val == kho_out.finalized) { + if (kho_out.finalized) + ret = -EEXIST; + else + ret = -ENOENT; + goto unlock; + } + + if (val) + ret = kho_finalize(); + else + ret = kho_abort(); + + if (ret) + goto unlock; + + kho_out.finalized = val; + ret = kho_out_update_debugfs_fdt(); + +unlock: + mutex_unlock(&kho_out.lock); + return ret; +} + +DEFINE_DEBUGFS_ATTRIBUTE(fops_kho_out_finalize, kho_out_finalize_get, + kho_out_finalize_set, "%llu\n"); + +static int scratch_phys_show(struct seq_file *m, void *v) +{ + for (int i = 0; i < kho_scratch_cnt; i++) + seq_printf(m, "0x%llx\n", kho_scratch[i].addr); + + return 0; +} +DEFINE_SHOW_ATTRIBUTE(scratch_phys); + +static int scratch_len_show(struct seq_file *m, void *v) +{ + for (int i = 0; i < kho_scratch_cnt; i++) + seq_printf(m, "0x%llx\n", kho_scratch[i].size); + + return 0; +} +DEFINE_SHOW_ATTRIBUTE(scratch_len); + +static __init int kho_out_debugfs_init(void) +{ + struct dentry *dir, *f, *sub_fdt_dir; + + dir = debugfs_create_dir("out", debugfs_root); + if (IS_ERR(dir)) + return -ENOMEM; + + sub_fdt_dir = debugfs_create_dir("sub_fdts", dir); + if (IS_ERR(sub_fdt_dir)) + goto err_rmdir; + + f = debugfs_create_file("scratch_phys", 0400, dir, NULL, + &scratch_phys_fops); + if (IS_ERR(f)) + goto err_rmdir; + + f = debugfs_create_file("scratch_len", 0400, dir, NULL, + &scratch_len_fops); + if (IS_ERR(f)) + goto err_rmdir; + + f = debugfs_create_file("finalize", 0600, dir, NULL, + &fops_kho_out_finalize); + if (IS_ERR(f)) + goto err_rmdir; + + kho_out.dir = dir; + kho_out.ser.sub_fdt_dir = sub_fdt_dir; + return 0; + +err_rmdir: + debugfs_remove_recursive(dir); + return -ENOENT; +} + +struct kho_in { + struct dentry *dir; + phys_addr_t fdt_phys; + phys_addr_t scratch_phys; + struct list_head fdt_list; +}; + +static struct kho_in kho_in = { + .fdt_list = LIST_HEAD_INIT(kho_in.fdt_list), +}; + +static const void *kho_get_fdt(void) +{ + return kho_in.fdt_phys ? phys_to_virt(kho_in.fdt_phys) : NULL; +} + +/** + * is_kho_boot - check if current kernel was booted via KHO-enabled + * kexec + * + * This function checks if the current kernel was loaded through a kexec + * operation with KHO enabled, by verifying that a valid KHO FDT + * was passed. + * + * Note: This function returns reliable results only after + * kho_populate() has been called during early boot. Before that, + * it may return false even if KHO data is present. + * + * Return: true if booted via KHO-enabled kexec, false otherwise + */ +bool is_kho_boot(void) +{ + return !!kho_get_fdt(); +} +EXPORT_SYMBOL_GPL(is_kho_boot); + +/** + * kho_retrieve_subtree - retrieve a preserved sub FDT by its name. + * @name: the name of the sub FDT passed to kho_add_subtree(). + * @phys: if found, the physical address of the sub FDT is stored in @phys. + * + * Retrieve a preserved sub FDT named @name and store its physical + * address in @phys. + * + * Return: 0 on success, error code on failure + */ +int kho_retrieve_subtree(const char *name, phys_addr_t *phys) +{ + const void *fdt = kho_get_fdt(); + const u64 *val; + int offset, len; + + if (!fdt) + return -ENOENT; + + if (!phys) + return -EINVAL; + + offset = fdt_subnode_offset(fdt, 0, name); + if (offset < 0) + return -ENOENT; + + val = fdt_getprop(fdt, offset, PROP_SUB_FDT, &len); + if (!val || len != sizeof(*val)) + return -EINVAL; + + *phys = (phys_addr_t)*val; + + return 0; +} +EXPORT_SYMBOL_GPL(kho_retrieve_subtree); + +/* Handling for debugfs/kho/in */ + +static __init int kho_in_debugfs_init(const void *fdt) +{ + struct dentry *sub_fdt_dir; + int err, child; + + kho_in.dir = debugfs_create_dir("in", debugfs_root); + if (IS_ERR(kho_in.dir)) + return PTR_ERR(kho_in.dir); + + sub_fdt_dir = debugfs_create_dir("sub_fdts", kho_in.dir); + if (IS_ERR(sub_fdt_dir)) { + err = PTR_ERR(sub_fdt_dir); + goto err_rmdir; + } + + err = kho_debugfs_fdt_add(&kho_in.fdt_list, kho_in.dir, "fdt", fdt); + if (err) + goto err_rmdir; + + fdt_for_each_subnode(child, fdt, 0) { + int len = 0; + const char *name = fdt_get_name(fdt, child, NULL); + const u64 *fdt_phys; + + fdt_phys = fdt_getprop(fdt, child, "fdt", &len); + if (!fdt_phys) + continue; + if (len != sizeof(*fdt_phys)) { + pr_warn("node `%s`'s prop `fdt` has invalid length: %d\n", + name, len); + continue; + } + err = kho_debugfs_fdt_add(&kho_in.fdt_list, sub_fdt_dir, name, + phys_to_virt(*fdt_phys)); + if (err) { + pr_warn("failed to add fdt `%s` to debugfs: %d\n", name, + err); + continue; + } + } + + return 0; + +err_rmdir: + debugfs_remove_recursive(kho_in.dir); + return err; +} + +static __init int kho_init(void) +{ + int err = 0; + const void *fdt = kho_get_fdt(); + + if (!kho_enable) + return 0; + + kho_out.ser.fdt = alloc_page(GFP_KERNEL); + if (!kho_out.ser.fdt) { + err = -ENOMEM; + goto err_free_scratch; + } + + debugfs_root = debugfs_create_dir("kho", NULL); + if (IS_ERR(debugfs_root)) { + err = -ENOENT; + goto err_free_fdt; + } + + err = kho_out_debugfs_init(); + if (err) + goto err_free_fdt; + + if (fdt) { + err = kho_in_debugfs_init(fdt); + /* + * Failure to create /sys/kernel/debug/kho/in does not prevent + * reviving state from KHO and setting up KHO for the next + * kexec. + */ + if (err) + pr_err("failed exposing handover FDT in debugfs: %d\n", + err); + + return 0; + } + + for (int i = 0; i < kho_scratch_cnt; i++) { + unsigned long base_pfn = PHYS_PFN(kho_scratch[i].addr); + unsigned long count = kho_scratch[i].size >> PAGE_SHIFT; + unsigned long pfn; + + for (pfn = base_pfn; pfn < base_pfn + count; + pfn += pageblock_nr_pages) + init_cma_reserved_pageblock(pfn_to_page(pfn)); + } + + return 0; + +err_free_fdt: + put_page(kho_out.ser.fdt); + kho_out.ser.fdt = NULL; +err_free_scratch: + for (int i = 0; i < kho_scratch_cnt; i++) { + void *start = __va(kho_scratch[i].addr); + void *end = start + kho_scratch[i].size; + + free_reserved_area(start, end, -1, ""); + } + kho_enable = false; + return err; +} +late_initcall(kho_init); + +static void __init kho_release_scratch(void) +{ + phys_addr_t start, end; + u64 i; + + memmap_init_kho_scratch_pages(); + + /* + * Mark scratch mem as CMA before we return it. That way we + * ensure that no kernel allocations happen on it. That means + * we can reuse it as scratch memory again later. + */ + __for_each_mem_range(i, &memblock.memory, NULL, NUMA_NO_NODE, + MEMBLOCK_KHO_SCRATCH, &start, &end, NULL) { + ulong start_pfn = pageblock_start_pfn(PFN_DOWN(start)); + ulong end_pfn = pageblock_align(PFN_UP(end)); + ulong pfn; + + for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) + set_pageblock_migratetype(pfn_to_page(pfn), + MIGRATE_CMA); + } +} + +void __init kho_memory_init(void) +{ + struct folio *folio; + + if (kho_in.scratch_phys) { + kho_scratch = phys_to_virt(kho_in.scratch_phys); + kho_release_scratch(); + + kho_mem_deserialize(kho_get_fdt()); + folio = kho_restore_folio(kho_in.fdt_phys); + if (!folio) + pr_warn("failed to restore folio for KHO fdt\n"); + } else { + kho_reserve_scratch(); + } +} + +void __init kho_populate(phys_addr_t fdt_phys, u64 fdt_len, + phys_addr_t scratch_phys, u64 scratch_len) +{ + void *fdt = NULL; + struct kho_scratch *scratch = NULL; + int err = 0; + unsigned int scratch_cnt = scratch_len / sizeof(*kho_scratch); + + /* Validate the input FDT */ + fdt = early_memremap(fdt_phys, fdt_len); + if (!fdt) { + pr_warn("setup: failed to memremap FDT (0x%llx)\n", fdt_phys); + err = -EFAULT; + goto out; + } + err = fdt_check_header(fdt); + if (err) { + pr_warn("setup: handover FDT (0x%llx) is invalid: %d\n", + fdt_phys, err); + err = -EINVAL; + goto out; + } + err = fdt_node_check_compatible(fdt, 0, KHO_FDT_COMPATIBLE); + if (err) { + pr_warn("setup: handover FDT (0x%llx) is incompatible with '%s': %d\n", + fdt_phys, KHO_FDT_COMPATIBLE, err); + err = -EINVAL; + goto out; + } + + scratch = early_memremap(scratch_phys, scratch_len); + if (!scratch) { + pr_warn("setup: failed to memremap scratch (phys=0x%llx, len=%lld)\n", + scratch_phys, scratch_len); + err = -EFAULT; + goto out; + } + + /* + * We pass a safe contiguous blocks of memory to use for early boot + * purporses from the previous kernel so that we can resize the + * memblock array as needed. + */ + for (int i = 0; i < scratch_cnt; i++) { + struct kho_scratch *area = &scratch[i]; + u64 size = area->size; + + memblock_add(area->addr, size); + err = memblock_mark_kho_scratch(area->addr, size); + if (WARN_ON(err)) { + pr_warn("failed to mark the scratch region 0x%pa+0x%pa: %d", + &area->addr, &size, err); + goto out; + } + pr_debug("Marked 0x%pa+0x%pa as scratch", &area->addr, &size); + } + + memblock_reserve(scratch_phys, scratch_len); + + /* + * Now that we have a viable region of scratch memory, let's tell + * the memblocks allocator to only use that for any allocations. + * That way we ensure that nothing scribbles over in use data while + * we initialize the page tables which we will need to ingest all + * memory reservations from the previous kernel. + */ + memblock_set_kho_scratch_only(); + + kho_in.fdt_phys = fdt_phys; + kho_in.scratch_phys = scratch_phys; + kho_scratch_cnt = scratch_cnt; + pr_info("found kexec handover data. Will skip init for some devices\n"); + +out: + if (fdt) + early_memunmap(fdt, fdt_len); + if (scratch) + early_memunmap(scratch, scratch_len); + if (err) + pr_warn("disabling KHO revival: %d\n", err); +} + +/* Helper functions for kexec_file_load */ + +int kho_fill_kimage(struct kimage *image) +{ + ssize_t scratch_size; + int err = 0; + struct kexec_buf scratch; + + if (!kho_out.finalized) + return 0; + + image->kho.fdt = page_to_phys(kho_out.ser.fdt); + + scratch_size = sizeof(*kho_scratch) * kho_scratch_cnt; + scratch = (struct kexec_buf){ + .image = image, + .buffer = kho_scratch, + .bufsz = scratch_size, + .mem = KEXEC_BUF_MEM_UNKNOWN, + .memsz = scratch_size, + .buf_align = SZ_64K, /* Makes it easier to map */ + .buf_max = ULONG_MAX, + .top_down = true, + }; + err = kexec_add_buffer(&scratch); + if (err) + return err; + image->kho.scratch = &image->segment[image->nr_segments - 1]; + + return 0; +} + +static int kho_walk_scratch(struct kexec_buf *kbuf, + int (*func)(struct resource *, void *)) +{ + int ret = 0; + int i; + + for (i = 0; i < kho_scratch_cnt; i++) { + struct resource res = { + .start = kho_scratch[i].addr, + .end = kho_scratch[i].addr + kho_scratch[i].size - 1, + }; + + /* Try to fit the kimage into our KHO scratch region */ + ret = func(&res, kbuf); + if (ret) + break; + } + + return ret; +} + +int kho_locate_mem_hole(struct kexec_buf *kbuf, + int (*func)(struct resource *, void *)) +{ + int ret; + + if (!kho_enable || kbuf->image->type == KEXEC_TYPE_CRASH) + return 1; + + ret = kho_walk_scratch(kbuf, func); + + return ret == 1 ? 0 : -EADDRNOTAVAIL; +} diff --git a/kernel/kexec_handover_debug.c b/kernel/kexec_handover_debug.c new file mode 100644 index 0000000000000000000000000000000000000000..6efb696f542612fd100ffcac0143a00c3cc6d0dd --- /dev/null +++ b/kernel/kexec_handover_debug.c @@ -0,0 +1,25 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * kexec_handover_debug.c - kexec handover optional debug functionality + * Copyright (C) 2025 Google LLC, Pasha Tatashin + */ + +#define pr_fmt(fmt) "KHO: " fmt + +#include "kexec_handover_internal.h" + +bool kho_scratch_overlap(phys_addr_t phys, size_t size) +{ + phys_addr_t scratch_start, scratch_end; + unsigned int i; + + for (i = 0; i < kho_scratch_cnt; i++) { + scratch_start = kho_scratch[i].addr; + scratch_end = kho_scratch[i].addr + kho_scratch[i].size; + + if (phys < scratch_end && (phys + size) > scratch_start) + return true; + } + + return false; +} diff --git a/kernel/kexec_handover_internal.h b/kernel/kexec_handover_internal.h new file mode 100644 index 0000000000000000000000000000000000000000..05e9720ba7b995ae58dc098a6189b3ae2e1c190b --- /dev/null +++ b/kernel/kexec_handover_internal.h @@ -0,0 +1,16 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef LINUX_KEXEC_HANDOVER_INTERNAL_H +#define LINUX_KEXEC_HANDOVER_INTERNAL_H + +#include + +#ifdef CONFIG_KEXEC_HANDOVER_DEBUG +bool kho_scratch_overlap(phys_addr_t phys, size_t size); +#else +static inline bool kho_scratch_overlap(phys_addr_t phys, size_t size) +{ + return false; +} +#endif /* CONFIG_KEXEC_HANDOVER_DEBUG */ + +#endif /* LINUX_KEXEC_HANDOVER_INTERNAL_H */ diff --git a/kernel/kexec_internal.h b/kernel/kexec_internal.h index 74da1409cd14b5db50c60832eb7445a747ae0447..b437a5ac38c888bd69cf6eb0e218689960ced0b1 100644 --- a/kernel/kexec_internal.h +++ b/kernel/kexec_internal.h @@ -36,4 +36,20 @@ extern size_t kexec_purgatory_size; #else /* CONFIG_KEXEC_FILE */ static inline void kimage_file_post_load_cleanup(struct kimage *image) { } #endif /* CONFIG_KEXEC_FILE */ + +struct kexec_buf; + +#ifdef CONFIG_KEXEC_HANDOVER +int kho_locate_mem_hole(struct kexec_buf *kbuf, + int (*func)(struct resource *, void *)); +int kho_fill_kimage(struct kimage *image); +#else +static inline int kho_locate_mem_hole(struct kexec_buf *kbuf, + int (*func)(struct resource *, void *)) +{ + return 1; +} + +static inline int kho_fill_kimage(struct kimage *image) { return 0; } +#endif /* CONFIG_KEXEC_HANDOVER */ #endif /* LINUX_KEXEC_INTERNAL_H */ diff --git a/mm/Kconfig b/mm/Kconfig index 7f159aeb2bfeadcb2e07da13719471a35b1816f5..8e2cd7fd26dff100f4b6772abd56645cb7f4bc51 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -529,6 +529,10 @@ config HAVE_FAST_GUP depends on MMU bool +# Enable memblock support for scratch memory which is needed for kexec handover +config MEMBLOCK_KHO_SCRATCH + bool + # Don't discard allocated memory used to track "memory" and "reserved" memblocks # after early boot, so it can still be used to test for validity of memory. # Also, memblocks are updated with memory hot(un)plug. diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 6a8e42c02108dbcb9d9b81d3a70c76b36f005ecf..f35f60383b3c407b20eb47c40258229374c9ccae 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3252,6 +3252,16 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid) } found: + + /* + * Only initialize the head struct page in memmap_init_reserved_pages, + * rest of the struct pages will be initialized by the HugeTLB + * subsystem itself. + * The head struct page is used to get folio information by the HugeTLB + * subsystem like zone id and node id. + */ + memblock_reserved_mark_noinit(virt_to_phys((void *)m + PAGE_SIZE), + huge_page_size(h) - PAGE_SIZE); /* Put them into a private list first because mem_map is not up yet */ INIT_LIST_HEAD(&m->list); list_add(&m->list, &huge_boot_pages); @@ -3259,6 +3269,43 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid) return 1; } +/* Initialize [start_page:end_page_number] tail struct pages of a hugepage */ +static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio, + unsigned long start_page_number, + unsigned long end_page_number) +{ + enum zone_type zone = zone_idx(folio_zone(folio)); + int nid = folio_nid(folio); + unsigned long head_pfn = folio_pfn(folio); + unsigned long pfn, end_pfn = head_pfn + end_page_number; + int ret; + + for (pfn = head_pfn + start_page_number; pfn < end_pfn; pfn++) { + struct page *page = pfn_to_page(pfn); + + __init_single_page(page, pfn, zone, nid); + prep_compound_tail((struct page *)folio, pfn - head_pfn); + ret = page_ref_freeze(page, 1); + VM_BUG_ON(!ret); + } +} + +static void __init hugetlb_folio_init_vmemmap(struct folio *folio, + struct hstate *h, + unsigned long nr_pages) +{ + int ret; + + /* Prepare folio head */ + __folio_clear_reserved(folio); + __folio_set_head(folio); + ret = page_ref_freeze(&folio->page, 1); + VM_BUG_ON(!ret); + /* Initialize the necessary tail struct pages */ + hugetlb_folio_init_tail_vmemmap(folio, 1, nr_pages); + prep_compound_head((struct page *)folio, huge_page_order(h)); +} + /* * Put bootmem huge pages into the standard lists after mem_map is up. * Note: This only applies to gigantic (order > MAX_ORDER) pages. @@ -3269,19 +3316,21 @@ static void __init gather_bootmem_prealloc(void) list_for_each_entry(m, &huge_boot_pages, list) { struct page *page = virt_to_page(m); - struct folio *folio = page_folio(page); + struct folio *folio = (void *)page; struct hstate *h = m->hstate; VM_BUG_ON(!hstate_is_gigantic(h)); WARN_ON(folio_ref_count(folio) != 1); - if (prep_compound_gigantic_folio(folio, huge_page_order(h))) { - WARN_ON(folio_test_reserved(folio)); - prep_new_hugetlb_folio(h, folio, folio_nid(folio)); - free_huge_folio(folio); /* add to the hugepage allocator */ - } else { - /* VERY unlikely inflated ref count on a tail page */ - free_gigantic_folio(folio, huge_page_order(h)); - } + + hugetlb_folio_init_vmemmap(folio, h, + HUGETLB_VMEMMAP_RESERVE_PAGES); + prep_new_hugetlb_folio(h, folio, folio_nid(folio)); + /* If HVO fails, initialize all tail struct pages */ + if (!HPageVmemmapOptimized(&folio->page)) + hugetlb_folio_init_tail_vmemmap(folio, + HUGETLB_VMEMMAP_RESERVE_PAGES, + pages_per_huge_page(h)); + free_huge_folio(folio); /* add to the hugepage allocator */ /* * We need to restore the 'stolen' pages to totalram_pages @@ -3292,6 +3341,7 @@ static void __init gather_bootmem_prealloc(void) cond_resched(); } } + static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid) { unsigned long i; diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c index 446e9fc723ecc07ea73776e43a65ec41e94ee53c..dcac302aa368a30b7c0b03773570cf8d7415910f 100644 --- a/mm/hugetlb_vmemmap.c +++ b/mm/hugetlb_vmemmap.c @@ -318,9 +318,8 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, .reuse_addr = reuse, .vmemmap_pages = &vmemmap_pages, }; - int nid = page_to_nid((struct page *)start); - gfp_t gfp_mask = GFP_KERNEL | __GFP_THISNODE | __GFP_NORETRY | - __GFP_NOWARN; + int nid = page_to_nid((struct page *)reuse); + gfp_t gfp_mask = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN; /* * Allocate a new head vmemmap page to avoid breaking a contiguous @@ -586,7 +585,7 @@ static int __init hugetlb_vmemmap_init(void) const struct hstate *h; /* HUGETLB_VMEMMAP_RESERVE_SIZE should cover all used struct pages */ - BUILD_BUG_ON(__NR_USED_SUBPAGE * sizeof(struct page) > HUGETLB_VMEMMAP_RESERVE_SIZE); + BUILD_BUG_ON(__NR_USED_SUBPAGE > HUGETLB_VMEMMAP_RESERVE_PAGES); for_each_hstate(h) { if (hugetlb_vmemmap_optimizable(h)) { diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h index 25bd0e002431402b4eb44099500226c2d00b2ff7..4573899855d7066c259ae3b9db0acf0ac1042934 100644 --- a/mm/hugetlb_vmemmap.h +++ b/mm/hugetlb_vmemmap.h @@ -10,15 +10,16 @@ #define _LINUX_HUGETLB_VMEMMAP_H #include -#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP -int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head); -void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head); - /* * Reserve one vmemmap page, all vmemmap addresses are mapped to it. See * Documentation/vm/vmemmap_dedup.rst. */ #define HUGETLB_VMEMMAP_RESERVE_SIZE PAGE_SIZE +#define HUGETLB_VMEMMAP_RESERVE_PAGES (HUGETLB_VMEMMAP_RESERVE_SIZE / sizeof(struct page)) + +#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP +int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head); +void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head); static inline unsigned int hugetlb_vmemmap_size(const struct hstate *h) { diff --git a/mm/internal.h b/mm/internal.h index 2a408bafe167e0491b90a1b91dcc16197574ec7f..4e152bebca4d936ead8accd7b896a72c5a7b9298 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1048,6 +1048,8 @@ DECLARE_STATIC_KEY_TRUE(deferred_pages); bool __init deferred_grow_zone(struct zone *zone, unsigned int order); #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */ +void init_reserved_page(unsigned long pfn, int nid); + enum mminit_level { MMINIT_WARNING, MMINIT_VERIFY, @@ -1251,6 +1253,12 @@ int numa_migrate_prep(struct folio *folio, struct vm_area_struct *vma, void free_zone_device_page(struct page *page); int migrate_device_coherent_page(struct page *page); +struct vm_struct *__get_vm_area_node(unsigned long size, + unsigned long align, unsigned long shift, + unsigned long vm_flags, unsigned long start, + unsigned long end, int node, gfp_t gfp_mask, + const void *caller); + /* * mm/gup.c */ @@ -1487,6 +1495,9 @@ struct vma_prepare { struct vm_area_struct *remove2; }; +void __meminit __init_single_page(struct page *page, unsigned long pfn, + unsigned long zone, int nid); + /* shrinker related functions */ unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg, int priority); diff --git a/mm/memblock.c b/mm/memblock.c index 0695284232f3c617022d4694059c7532e1cf0d48..8243bef0ba0984a2fdb3a0509ad22b194feec6ee 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -106,6 +106,13 @@ unsigned long min_low_pfn; unsigned long max_pfn; unsigned long long max_possible_pfn; +#ifdef CONFIG_MEMBLOCK_KHO_SCRATCH +/* When set to true, only allocate from MEMBLOCK_KHO_SCRATCH ranges */ +static bool kho_scratch_only; +#else +#define kho_scratch_only false +#endif + static struct memblock_region memblock_memory_init_regions[INIT_MEMBLOCK_MEMORY_REGIONS] __initdata_memblock; static struct memblock_region memblock_reserved_init_regions[INIT_MEMBLOCK_RESERVED_REGIONS] __initdata_memblock; #ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP @@ -168,6 +175,10 @@ bool __init_memblock memblock_has_mirror(void) static enum memblock_flags __init_memblock choose_memblock_flags(void) { + /* skip non-scratch memory for kho early boot allocations */ + if (kho_scratch_only) + return MEMBLOCK_KHO_SCRATCH; + return system_has_some_mirror ? MEMBLOCK_MIRROR : MEMBLOCK_NONE; } @@ -502,7 +513,7 @@ static int __init_memblock memblock_double_array(struct memblock_type *type, * needn't do it */ if (!use_slab) - BUG_ON(memblock_reserve(addr, new_alloc_size)); + BUG_ON(memblock_reserve_kern(addr, new_alloc_size)); /* Update slab flag */ *in_slab = use_slab; @@ -651,7 +662,7 @@ static int __init_memblock memblock_add_range(struct memblock_type *type, #ifdef CONFIG_NUMA WARN_ON(nid != memblock_get_region_node(rgn)); #endif - WARN_ON(flags != rgn->flags); + WARN_ON(flags != MEMBLOCK_NONE && flags != rgn->flags); nr_new++; if (insert) { if (start_rgn == -1) @@ -910,14 +921,15 @@ int __init_memblock memblock_phys_free(phys_addr_t base, phys_addr_t size) return memblock_remove_range(&memblock.reserved, base, size); } -int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size) +int __init_memblock __memblock_reserve(phys_addr_t base, phys_addr_t size, + int nid, enum memblock_flags flags) { phys_addr_t end = base + size - 1; - memblock_dbg("%s: [%pa-%pa] %pS\n", __func__, - &base, &end, (void *)_RET_IP_); + memblock_dbg("%s: [%pa-%pa] nid=%d flags=%x %pS\n", __func__, + &base, &end, nid, flags, (void *)_RET_IP_); - return memblock_add_range(&memblock.reserved, base, size, MAX_NUMNODES, 0); + return memblock_add_range(&memblock.reserved, base, size, nid, flags); } #ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP @@ -932,8 +944,43 @@ int __init_memblock memblock_physmem_add(phys_addr_t base, phys_addr_t size) } #endif +#ifdef CONFIG_MEMBLOCK_KHO_SCRATCH +__init void memblock_set_kho_scratch_only(void) +{ + kho_scratch_only = true; +} + +__init void memblock_clear_kho_scratch_only(void) +{ + kho_scratch_only = false; +} + +__init void memmap_init_kho_scratch_pages(void) +{ + phys_addr_t start, end; + unsigned long pfn; + int nid; + u64 i; + + if (!IS_ENABLED(CONFIG_DEFERRED_STRUCT_PAGE_INIT)) + return; + + /* + * Initialize struct pages for free scratch memory. + * The struct pages for reserved scratch memory will be set up in + * reserve_bootmem_region() + */ + __for_each_mem_range(i, &memblock.memory, NULL, NUMA_NO_NODE, + MEMBLOCK_KHO_SCRATCH, &start, &end, &nid) { + for (pfn = PFN_UP(start); pfn < PFN_DOWN(end); pfn++) + init_reserved_page(pfn, nid); + } +} +#endif + /** * memblock_setclr_flag - set or clear flag for a memory region + * @type: memblock type to set/clear flag for * @base: base address of the region * @size: size of the region * @set: set or clear the flag @@ -943,10 +990,9 @@ int __init_memblock memblock_physmem_add(phys_addr_t base, phys_addr_t size) * * Return: 0 on success, -errno on failure. */ -static int __init_memblock memblock_setclr_flag(phys_addr_t base, - phys_addr_t size, int set, int flag) +static int __init_memblock memblock_setclr_flag(struct memblock_type *type, + phys_addr_t base, phys_addr_t size, int set, int flag) { - struct memblock_type *type = &memblock.memory; int i, ret, start_rgn, end_rgn; ret = memblock_isolate_range(type, base, size, &start_rgn, &end_rgn); @@ -975,7 +1021,7 @@ static int __init_memblock memblock_setclr_flag(phys_addr_t base, */ int __init_memblock memblock_mark_hotplug(phys_addr_t base, phys_addr_t size) { - return memblock_setclr_flag(base, size, 1, MEMBLOCK_HOTPLUG); + return memblock_setclr_flag(&memblock.memory, base, size, 1, MEMBLOCK_HOTPLUG); } /** @@ -987,7 +1033,7 @@ int __init_memblock memblock_mark_hotplug(phys_addr_t base, phys_addr_t size) */ int __init_memblock memblock_clear_hotplug(phys_addr_t base, phys_addr_t size) { - return memblock_setclr_flag(base, size, 0, MEMBLOCK_HOTPLUG); + return memblock_setclr_flag(&memblock.memory, base, size, 0, MEMBLOCK_HOTPLUG); } /** @@ -1004,7 +1050,7 @@ int __init_memblock memblock_mark_mirror(phys_addr_t base, phys_addr_t size) system_has_some_mirror = true; - return memblock_setclr_flag(base, size, 1, MEMBLOCK_MIRROR); + return memblock_setclr_flag(&memblock.memory, base, size, 1, MEMBLOCK_MIRROR); } /** @@ -1024,7 +1070,7 @@ int __init_memblock memblock_mark_mirror(phys_addr_t base, phys_addr_t size) */ int __init_memblock memblock_mark_nomap(phys_addr_t base, phys_addr_t size) { - return memblock_setclr_flag(base, size, 1, MEMBLOCK_NOMAP); + return memblock_setclr_flag(&memblock.memory, base, size, 1, MEMBLOCK_NOMAP); } /** @@ -1036,7 +1082,55 @@ int __init_memblock memblock_mark_nomap(phys_addr_t base, phys_addr_t size) */ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size) { - return memblock_setclr_flag(base, size, 0, MEMBLOCK_NOMAP); + return memblock_setclr_flag(&memblock.memory, base, size, 0, MEMBLOCK_NOMAP); +} + +/** + * memblock_reserved_mark_noinit - Mark a reserved memory region with flag + * MEMBLOCK_RSRV_NOINIT which results in the struct pages not being initialized + * for this region. + * @base: the base phys addr of the region + * @size: the size of the region + * + * struct pages will not be initialized for reserved memory regions marked with + * %MEMBLOCK_RSRV_NOINIT. + * + * Return: 0 on success, -errno on failure. + */ +int __init_memblock memblock_reserved_mark_noinit(phys_addr_t base, phys_addr_t size) +{ + return memblock_setclr_flag(&memblock.reserved, base, size, 1, + MEMBLOCK_RSRV_NOINIT); +} + +/** + * memblock_mark_kho_scratch - Mark a memory region as MEMBLOCK_KHO_SCRATCH. + * @base: the base phys addr of the region + * @size: the size of the region + * + * Only memory regions marked with %MEMBLOCK_KHO_SCRATCH will be considered + * for allocations during early boot with kexec handover. + * + * Return: 0 on success, -errno on failure. + */ +__init int memblock_mark_kho_scratch(phys_addr_t base, phys_addr_t size) +{ + return memblock_setclr_flag(&memblock.memory, base, size, 1, + MEMBLOCK_KHO_SCRATCH); +} + +/** + * memblock_clear_kho_scratch - Clear MEMBLOCK_KHO_SCRATCH flag for a + * specified region. + * @base: the base phys addr of the region + * @size: the size of the region + * + * Return: 0 on success, -errno on failure. + */ +__init int memblock_clear_kho_scratch(phys_addr_t base, phys_addr_t size) +{ + return memblock_setclr_flag(&memblock.memory, base, size, 0, + MEMBLOCK_KHO_SCRATCH); } static bool should_skip_region(struct memblock_type *type, @@ -1070,6 +1164,13 @@ static bool should_skip_region(struct memblock_type *type, if (!(flags & MEMBLOCK_DRIVER_MANAGED) && memblock_is_driver_managed(m)) return true; + /* + * In early alloc during kexec handover, we can only consider + * MEMBLOCK_KHO_SCRATCH regions for the allocations + */ + if ((flags & MEMBLOCK_KHO_SCRATCH) && !memblock_is_kho_scratch(m)) + return true; + return false; } @@ -1439,14 +1540,14 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size, again: found = memblock_find_in_range_node(size, align, start, end, nid, flags); - if (found && !memblock_reserve(found, size)) + if (found && !__memblock_reserve(found, size, nid, MEMBLOCK_RSRV_KERN)) goto done; if (numa_valid_node(nid) && !exact_nid) { found = memblock_find_in_range_node(size, align, start, end, NUMA_NO_NODE, flags); - if (found && !memblock_reserve(found, size)) + if (found && !memblock_reserve_kern(found, size)) goto done; } @@ -1718,6 +1819,28 @@ phys_addr_t __init_memblock memblock_reserved_size(void) return memblock.reserved.total_size; } +phys_addr_t __init_memblock memblock_reserved_kern_size(phys_addr_t limit, int nid) +{ + struct memblock_region *r; + phys_addr_t total = 0; + + for_each_reserved_mem_region(r) { + phys_addr_t size = r->size; + + if (r->base > limit) + break; + + if (r->base + r->size > limit) + size = limit - r->base; + + if (nid == memblock_get_region_node(r) || !numa_valid_node(nid)) + if (r->flags & MEMBLOCK_RSRV_KERN) + total += size; + } + + return total; +} + /* lowest address */ phys_addr_t __init_memblock memblock_start_of_DRAM(void) { @@ -2155,16 +2278,21 @@ static void __init memmap_init_reserved_pages(void) if (max_reserved != memblock.reserved.max) goto repeat; - /* initialize struct pages for the reserved regions */ + /* + * initialize struct pages for reserved regions that don't have + * the MEMBLOCK_RSRV_NOINIT flag set + */ for_each_reserved_mem_region(region) { - nid = memblock_get_region_node(region); - start = region->base; - end = start + region->size; + if (!memblock_is_reserved_noinit(region)) { + nid = memblock_get_region_node(region); + start = region->base; + end = start + region->size; - if (!numa_valid_node(nid)) - nid = early_pfn_to_nid(PFN_DOWN(start)); + if (!numa_valid_node(nid)) + nid = early_pfn_to_nid(PFN_DOWN(start)); - reserve_bootmem_region(start, end, nid); + reserve_bootmem_region(start, end, nid); + } } } @@ -2223,6 +2351,7 @@ void __init memblock_free_all(void) free_unused_memmap(); reset_all_zones_managed_pages(); + memblock_clear_kho_scratch_only(); pages = free_low_memory_core_early(); totalram_pages_add(pages); } @@ -2233,6 +2362,9 @@ static const char * const flagname[] = { [ilog2(MEMBLOCK_MIRROR)] = "MIRROR", [ilog2(MEMBLOCK_NOMAP)] = "NOMAP", [ilog2(MEMBLOCK_DRIVER_MANAGED)] = "DRV_MNG", + [ilog2(MEMBLOCK_RSRV_NOINIT)] = "RSV_NIT", + [ilog2(MEMBLOCK_RSRV_KERN)] = "RSV_KERN", + [ilog2(MEMBLOCK_KHO_SCRATCH)] = "KHO_SCRATCH", }; static int memblock_debug_show(struct seq_file *m, void *private) diff --git a/mm/mm_init.c b/mm/mm_init.c index 27ce281f72d77f87508c386ab70d17433b0dd172..bda0f522f869cb865b57d935e53c4d7ffe2e1068 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -27,6 +27,7 @@ #include #include #include +#include #include "internal.h" #include "slab.h" #include "shuffle.h" @@ -569,7 +570,7 @@ static void __init find_zone_movable_pfns_for_nodes(void) node_states[N_MEMORY] = saved_node_state; } -static void __meminit __init_single_page(struct page *page, unsigned long pfn, +void __meminit __init_single_page(struct page *page, unsigned long pfn, unsigned long zone, int nid) { mm_zero_struct_page(page); @@ -711,7 +712,7 @@ defer_init(int nid, unsigned long pfn, unsigned long end_pfn) return false; } -static void __meminit init_reserved_page(unsigned long pfn, int nid) +static void __meminit __init_reserved_page(unsigned long pfn, int nid) { pg_data_t *pgdat; int zid; @@ -742,11 +743,16 @@ static inline bool defer_init(int nid, unsigned long pfn, unsigned long end_pfn) return false; } -static inline void init_reserved_page(unsigned long pfn, int nid) +static inline void __init_reserved_page(unsigned long pfn, int nid) { } #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */ +void __meminit init_reserved_page(unsigned long pfn, int nid) +{ + __init_reserved_page(pfn, nid); +} + /* * Initialised pages do not have PageReserved set. This function is * called for each range allocated by the bootmem allocator and @@ -763,7 +769,7 @@ void __meminit reserve_bootmem_region(phys_addr_t start, if (pfn_valid(start_pfn)) { struct page *page = pfn_to_page(start_pfn); - init_reserved_page(start_pfn, nid); + __init_reserved_page(start_pfn, nid); /* Avoid false-positive PageTail() */ INIT_LIST_HEAD(&page->lru); @@ -2789,6 +2795,12 @@ void __init mm_core_init(void) report_meminit(); kmsan_init_shadow(); stack_depot_early_init(); + /* + * KHO memory setup must happen while memblock is still active, but + * as close as possible to buddy initialization + */ + kho_memory_init(); + mem_init(); mem_init_print_info(); kmem_cache_init(); diff --git a/mm/vmalloc.c b/mm/vmalloc.c index f5c7e2fa18d6b11cc8c379b53deba619b2387b52..95c8c4a286a5b675a8e4a03d91d92b308b229cfe 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -665,7 +665,7 @@ int vmap_pages_range_noflush(unsigned long addr, unsigned long end, * RETURNS: * 0 on success, -errno on failure. */ -static int vmap_pages_range(unsigned long addr, unsigned long end, +int vmap_pages_range(unsigned long addr, unsigned long end, pgprot_t prot, struct page **pages, unsigned int page_shift) { int err; @@ -2487,6 +2487,11 @@ static inline unsigned int vm_area_page_order(struct vm_struct *vm) #endif } +unsigned int get_vm_area_page_order(struct vm_struct *vm) +{ + return vm_area_page_order(vm); +} + static inline void set_vm_area_page_order(struct vm_struct *vm, unsigned int order) { #ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC @@ -2624,7 +2629,7 @@ static void clear_vm_uninitialized_flag(struct vm_struct *vm) vm->flags &= ~VM_UNINITIALIZED; } -static struct vm_struct *__get_vm_area_node(unsigned long size, +struct vm_struct *__get_vm_area_node(unsigned long size, unsigned long align, unsigned long shift, unsigned long flags, unsigned long start, unsigned long end, int node, gfp_t gfp_mask, const void *caller) diff --git a/tools/testing/memblock/tests/alloc_api.c b/tools/testing/memblock/tests/alloc_api.c index 68f1a75cd72c440d7a8a5301fc100e451605f7e7..c55f67dd367d05a684eb39568dd36298d7438baa 100644 --- a/tools/testing/memblock/tests/alloc_api.c +++ b/tools/testing/memblock/tests/alloc_api.c @@ -134,7 +134,7 @@ static int alloc_top_down_before_check(void) PREFIX_PUSH(); setup_memblock(); - memblock_reserve(memblock_end_of_DRAM() - total_size, r1_size); + memblock_reserve_kern(memblock_end_of_DRAM() - total_size, r1_size); allocated_ptr = run_memblock_alloc(r2_size, SMP_CACHE_BYTES); @@ -182,7 +182,7 @@ static int alloc_top_down_after_check(void) total_size = r1.size + r2_size; - memblock_reserve(r1.base, r1.size); + memblock_reserve_kern(r1.base, r1.size); allocated_ptr = run_memblock_alloc(r2_size, SMP_CACHE_BYTES); @@ -231,8 +231,8 @@ static int alloc_top_down_second_fit_check(void) total_size = r1.size + r2.size + r3_size; - memblock_reserve(r1.base, r1.size); - memblock_reserve(r2.base, r2.size); + memblock_reserve_kern(r1.base, r1.size); + memblock_reserve_kern(r2.base, r2.size); allocated_ptr = run_memblock_alloc(r3_size, SMP_CACHE_BYTES); @@ -285,8 +285,8 @@ static int alloc_in_between_generic_check(void) total_size = r1.size + r2.size + r3_size; - memblock_reserve(r1.base, r1.size); - memblock_reserve(r2.base, r2.size); + memblock_reserve_kern(r1.base, r1.size); + memblock_reserve_kern(r2.base, r2.size); allocated_ptr = run_memblock_alloc(r3_size, SMP_CACHE_BYTES); @@ -422,7 +422,7 @@ static int alloc_limited_space_generic_check(void) setup_memblock(); /* Simulate almost-full memory */ - memblock_reserve(memblock_start_of_DRAM(), reserved_size); + memblock_reserve_kern(memblock_start_of_DRAM(), reserved_size); allocated_ptr = run_memblock_alloc(available_size, SMP_CACHE_BYTES); @@ -608,7 +608,7 @@ static int alloc_bottom_up_before_check(void) PREFIX_PUSH(); setup_memblock(); - memblock_reserve(memblock_start_of_DRAM() + r1_size, r2_size); + memblock_reserve_kern(memblock_start_of_DRAM() + r1_size, r2_size); allocated_ptr = run_memblock_alloc(r1_size, SMP_CACHE_BYTES); @@ -655,7 +655,7 @@ static int alloc_bottom_up_after_check(void) total_size = r1.size + r2_size; - memblock_reserve(r1.base, r1.size); + memblock_reserve_kern(r1.base, r1.size); allocated_ptr = run_memblock_alloc(r2_size, SMP_CACHE_BYTES); @@ -705,8 +705,8 @@ static int alloc_bottom_up_second_fit_check(void) total_size = r1.size + r2.size + r3_size; - memblock_reserve(r1.base, r1.size); - memblock_reserve(r2.base, r2.size); + memblock_reserve_kern(r1.base, r1.size); + memblock_reserve_kern(r2.base, r2.size); allocated_ptr = run_memblock_alloc(r3_size, SMP_CACHE_BYTES); diff --git a/tools/testing/memblock/tests/alloc_helpers_api.c b/tools/testing/memblock/tests/alloc_helpers_api.c index 3ef9486da8a09350422634ceaafeb8b687c7757e..e5362cfd2ff303592cc04f3a844cf51473542075 100644 --- a/tools/testing/memblock/tests/alloc_helpers_api.c +++ b/tools/testing/memblock/tests/alloc_helpers_api.c @@ -163,7 +163,7 @@ static int alloc_from_top_down_no_space_above_check(void) min_addr = memblock_end_of_DRAM() - SMP_CACHE_BYTES * 2; /* No space above this address */ - memblock_reserve(min_addr, r2_size); + memblock_reserve_kern(min_addr, r2_size); allocated_ptr = memblock_alloc_from(r1_size, SMP_CACHE_BYTES, min_addr); @@ -199,7 +199,7 @@ static int alloc_from_top_down_min_addr_cap_check(void) start_addr = (phys_addr_t)memblock_start_of_DRAM(); min_addr = start_addr - SMP_CACHE_BYTES * 3; - memblock_reserve(start_addr + r1_size, MEM_SIZE - r1_size); + memblock_reserve_kern(start_addr + r1_size, MEM_SIZE - r1_size); allocated_ptr = memblock_alloc_from(r1_size, SMP_CACHE_BYTES, min_addr); diff --git a/tools/testing/memblock/tests/alloc_nid_api.c b/tools/testing/memblock/tests/alloc_nid_api.c index 49bb416d34ffcc3bbf5578fec3504bccb24f0719..562e4701b0e0265ce169deb0e426e6dabab84420 100644 --- a/tools/testing/memblock/tests/alloc_nid_api.c +++ b/tools/testing/memblock/tests/alloc_nid_api.c @@ -324,7 +324,7 @@ static int alloc_nid_min_reserved_generic_check(void) min_addr = max_addr - r2_size; reserved_base = min_addr - r1_size; - memblock_reserve(reserved_base, r1_size); + memblock_reserve_kern(reserved_base, r1_size); allocated_ptr = run_memblock_alloc_nid(r2_size, SMP_CACHE_BYTES, min_addr, max_addr, @@ -374,7 +374,7 @@ static int alloc_nid_max_reserved_generic_check(void) max_addr = memblock_end_of_DRAM() - r1_size; min_addr = max_addr - r2_size; - memblock_reserve(max_addr, r1_size); + memblock_reserve_kern(max_addr, r1_size); allocated_ptr = run_memblock_alloc_nid(r2_size, SMP_CACHE_BYTES, min_addr, max_addr, @@ -436,8 +436,8 @@ static int alloc_nid_top_down_reserved_with_space_check(void) min_addr = r2.base + r2.size; max_addr = r1.base; - memblock_reserve(r1.base, r1.size); - memblock_reserve(r2.base, r2.size); + memblock_reserve_kern(r1.base, r1.size); + memblock_reserve_kern(r2.base, r2.size); allocated_ptr = run_memblock_alloc_nid(r3_size, SMP_CACHE_BYTES, min_addr, max_addr, @@ -499,8 +499,8 @@ static int alloc_nid_reserved_full_merge_generic_check(void) min_addr = r2.base + r2.size; max_addr = r1.base; - memblock_reserve(r1.base, r1.size); - memblock_reserve(r2.base, r2.size); + memblock_reserve_kern(r1.base, r1.size); + memblock_reserve_kern(r2.base, r2.size); allocated_ptr = run_memblock_alloc_nid(r3_size, SMP_CACHE_BYTES, min_addr, max_addr, @@ -563,8 +563,8 @@ static int alloc_nid_top_down_reserved_no_space_check(void) min_addr = r2.base + r2.size; max_addr = r1.base; - memblock_reserve(r1.base, r1.size); - memblock_reserve(r2.base, r2.size); + memblock_reserve_kern(r1.base, r1.size); + memblock_reserve_kern(r2.base, r2.size); allocated_ptr = run_memblock_alloc_nid(r3_size, SMP_CACHE_BYTES, min_addr, max_addr, @@ -909,8 +909,8 @@ static int alloc_nid_bottom_up_reserved_with_space_check(void) min_addr = r2.base + r2.size; max_addr = r1.base; - memblock_reserve(r1.base, r1.size); - memblock_reserve(r2.base, r2.size); + memblock_reserve_kern(r1.base, r1.size); + memblock_reserve_kern(r2.base, r2.size); allocated_ptr = run_memblock_alloc_nid(r3_size, SMP_CACHE_BYTES, min_addr, max_addr,