diff --git a/2025/Global_Memory_Planning_for_LLM.en.md b/2025/Global_Memory_Planning_for_LLM.en.md
index 9582a110d036a3e182dcf98c7b474f016b02d4cb..d81160800a7e7671c784fc2f1f16d84fb3ddec81 100644
--- a/2025/Global_Memory_Planning_for_LLM.en.md
+++ b/2025/Global_Memory_Planning_for_LLM.en.md
@@ -1,4 +1,4 @@
-#
2025 competition:Global Memory Planning for Large Model Training and Inference
+# 2025 competition:Global Memory Planning for Large Model Training and Inference
## **Ⅰ. Background**
With the development of AI, the parameter count and supported sequence length of large models have increased significantly. Against this backdrop, the memory requirements for training and inference of large models have also grown increasingly high, leading to frequent occurrences of out-of-memory issues. To address the challenge of insufficient memory, the industry has adopted memory offloading as one of the mainstream solutions. For instance, during large model inference, if the KVCache cannot be fully stored in the memory of NPUs/GPUs, a portion of the KVCache can be offloaded to the host memory and retrieved back to the NPU/GPU when needed, thereby alleviating memory pressure. However, memory offloading solutions usually introduce considerable overhead (depending on the size of the data to be offloaded and bandwidth). Generally, it is necessary to overlap the memory offloading process with the operator computation process, reducing overhead through mutual masking.
@@ -6,11 +6,11 @@ With the development of AI, the parameter count and supported sequence length of
The competition focuses on the entire training/inference process of large models. It abstracts the memory access requirements throughout the full training-inference workflow and derives the optimal timing for memory offloading/loading by solving global memory access problems. This approach not only resolves the out-of-memory issue but also minimizes the impact on end-to-end latency as much as possible. The competition requires participants to meet the basic functional requirements of the problem; those with high completion rates or excellent algorithm performance will win. Additionally, participants are expected to maintain good coding style, and excellent code quality may be considered for extra points.
## **Ⅱ. Problem Description**
-For the training/inference process of large models, the abstraction of global memory access can be expressed using a sequence of 4-tuples as follows:[addr_0, size_0, start_0, time_0], [addr_1, size_1, start_1, time_1], … [addr_n, size_n, start_n, time_n]
+For the training/inference process of large models, the abstraction of global memory access can be expressed using a sequence of 4-tuples as follows:[$addr_0$, $size_0$, $start_0$, $time_0$], [$addr_1$, $size_1$, $start_1$, $time_1$], … [$addr_n$, $size_n$, $start_n$, $time_n$]
Addr and size define a memory segment, representing the starting address and size of this segment.Start and time define the time period during which this memory segment is accessed, representing the start time of access and the duration of access.
The following points require attention:
1. The same memory segment (or its subset) may be accessed repeatedly, and there may be overlaps between different memory segments.
-2. The memory access times satisfy the condition start_0 ≤ start_1 ≤ … ≤ start_n, which also means that access to different memory segments may occur simultaneously.
+2. The memory access times satisfy the condition $start_0 \le start_1 \le … \le start_n$, which also means that access to different memory segments may occur simultaneously.
3. Each 4-tuple in the memory access sequence is regarded as a computation process. This process can be parallelized with the memory offloading/loading process, and neither affects the execution time of the other.
4. The memory offloading and loading processes cannot be parallelized with each other.
5. In the initial state (before the execution of the first memory sequence), all memory is idle, and no data has been loaded into the memory.
@@ -22,29 +22,29 @@ For a given memory access sequence, appropriate memory offloading/loading operat
## **Ⅳ. Use Cases**
**Input:**
-The first line inputs three integers L, M, and N. Among them: L represents the total virtual address size of the process; M represents the total capacity of HBM (High-Bandwidth Memory); N represents the length of the memory access sequence.The constraints are 1 ≤ M ≤ L ≤ 100000 and 1 ≤ N ≤ 10000.The next N lines each contain four integers: addrᵢ, sizeᵢ, startᵢ, timeᵢ (with constraints 1 ≤ addrᵢ, sizeᵢ ≤ L and 1 ≤ startᵢ, timeᵢ ≤ 10⁹). addrᵢ denotes the starting virtual address of the memory segment;
-sizeᵢ denotes the size of the virtual address segment;
-startᵢ denotes the earliest accessible time of the memory segment; timeᵢ denotes the duration for which the memory segment is accessed. The input guarantees that startᵢ is monotonically increasing.
+The first line inputs three integers $L$, $M$, and $N$. Among them: $L$ represents the total virtual address size of the process; $M$ represents the total capacity of HBM (High-Bandwidth Memory); $N$ represents the length of the memory access sequence.The constraints are $1 \le M \le L \le 100000$ and $1 \le N \le 10000$.The next $N$ lines each contain four integers: $addr_i$, $size_i$, $start_i$, $time_i$ (with constraints $1 \le addr_i, size_i \le L$ and $1 \le start_i, time_i \le 10^9$). $addr_i$ denotes the starting virtual address of the memory segment;
+$size_i$ denotes the size of the virtual address segment;
+$start_i$ denotes the earliest accessible time of the memory segment; $time_i$ denotes the duration for which the memory segment is accessed. The input guarantees that $start_i$ is **monotonically non-decreasing**.
**Output:**
-Output a maximum of 5*N lines. Each line represents an execution operation, which is divided into four types:
-1. Reload Ti Ai SiIndicates a prefetch operation. It means loading the memory with starting virtual address Ai and size Si into HBM at time Ti, which takes a total of 40 * Si time. The constraints are 1 ≤ Ai, Si and Ai + Si ≤ L.
-2. Visit Ti AiIndicates a memory access operation. It means starting the Ai-th memory access request at time Ti, which takes a total of time[Ai] (the access duration of the Ai-th memory segment) time. The constraints are 0 ≤ Ai < N and Ai (the index of memory access requests) must be monotonically increasing.
-3. Offload Ti Ai SiIndicates an offload operation. It means releasing the memory with starting virtual address Ai and size Si from HBM at time Ti, which takes a total of 40 * Si time. The constraints are 1 ≤ Ai, Si and Ai + Si ≤ L.
-4. Fin T Indicates the end of the task. It means the time when all computing tasks are completely finished is T. This operation must be output only once in the last line.
+Output a maximum of $5\times N$ lines. Each line represents an execution operation, which is divided into four types:
+1. *Reload* $T_i$ $A_i$ $S_i$. Indicates a prefetch operation. It means loading the memory with starting virtual address $A_i$ and size $S_i$ into HBM at time $T_i$, which takes a total of $40 \times S_i$ time. The constraints are $1 \le A_i, S_i$ and $A_i + S_i \le L$.
+2. *Visit* $T_i$ $A_i$. Indicates a memory access operation. It means starting the $A_i$-th memory access request at time $T_i$, which takes a total of time[$A_i$] (the access duration of the $A_i$-th memory segment) time. The constraints are $0 \le A_i \lt N$ and $A_i$ (the index of memory access requests) must be monotonically increasing.
+3. *Offload* $T_i$ $A_i$ $S_i$. Indicates an offload operation. It means releasing the memory with starting virtual address $A_i$ and size $S_i$ from HBM at time $T_i$, which takes a total of $40 \times S_i$ time. The constraints are $1 \le A_i, S_i$ and $A_i + S_i \le L$.
+4. *Fin* $T$. Indicates the end of the task. It means the time when all computing tasks are completely finished is $T$. This operation must be output only once in the last line.
Among the operations:
- Operations 1 and 3 are read-write type operations;
- Operation 2 is a computation type operation.
-Read-write type operations and computation type operations can be completed in parallel, but operations of the same type can only be completed in parallel (i.e., read-write operations cannot be parallelized with other read-write operations, and computation operations cannot be parallelized with other computation operations).
+Read-write type operations and computation type operations can be completed in parallel, but operations of the same type can only be completed in serial (i.e., read-write operations cannot be parallelized with other read-write operations, and computation operations cannot be parallelized with other computation operations).
-**Input Case 1:**
+**Input Case 1:**
200 100 2
0 100 0 30
100 100 50 10
-**Output Case 1:**
+**Output Case 1:**
Reload 0 0 100
Visit 4000 0
Offload 4030 0 100
@@ -52,13 +52,23 @@ Reload 8030 100 100
Visit 12030 1
Fin 12040
-**Input Case 2:**
+
+**Case 1 Explanation:**
+1. At the initial time $T = 0$, no virtual address space resides in the GPU/NPU memory. The memory capacity is $100$.
+2. Since memory access #0 requires the virtual address range $[0, 100)$, a *Reload* operation is performed to load this range into memory. The *Reload* operation loads $100$ units of memory and costs $4000$ units of time. At this point, $T = 4000$, and the data in $[0, 100)$ is now resident in memory.
+3. The earliest start time of memory access #0 is $0 \le T$, so it can begin immediately. Output a *Visit* operation, which takes $30$ units of time. Now $T = 4030$.
+4. Since the memory is full, before performing access #1, an *Offload* operation must be executed. The virtual address range $[0, 100)$ is offloaded, and $T = 8030$ afterward.
+5. The virtual address range $[100, 200)$ is then loaded into memory, which takes $4000$ units of time. Now $T = 12030$.
+6. The earliest start time of memory access #1 is $50 \le T$, so it can begin immediately. Output a *Visit* operation, which takes $10$ units of time. Now $T = 12040$.
+7. All memory access operations have been completed. Output *FIN*. The total elapsed time is $T = 12040$.
+
+**Input Case 2:**
300 200 3
0 100 0 50
100 100 4000 30
150 100 4001 20
-**Output Case 2:**
+**Output Case 2:**
Reload 0 0 100
Visit 4000 0
Reload 4000 100 100
@@ -68,7 +78,12 @@ Reload 10000 200 50
Visit 12000 2
Fin 12020
+**Case 2 Explanation:**
+Note that memory access operations (*Reload/Offload*) and computation operations can proceed in parallel. Therefore, at time $T = 4000$, the *Visit* and *Reload* operations can start simultaneously.
+Since the virtual address ranges accessed by memory access #1 and #2 partially overlap, the *Offload* operation does not need to evict the entire memory region accessed by access #1.
+
+
## **Ⅴ. Evaluation**
-- Basic Requirements (Level 0): The algorithm's output must meet Objective 1 and Objective
+- Basic Requirements (Level 0): The algorithm's output must meet Objective 1 and Objective. The algorithm must complete execution within **10 seconds** and use no more than **1 GB of memory**.
- Advanced Requirements (Level 1): On the premise of meeting the basic requirements, the total time to complete the computation of the entire input sequence (i.e., the end time of the last memory access operation) should be as short as possible.
- Additional Bonus Items: Providing detailed algorithm description documents, etc.
\ No newline at end of file
diff --git a/2025/Global_Memory_Planning_for_LLM.md b/2025/Global_Memory_Planning_for_LLM.md
index ccc579f6ae3238d665acbcdfb9683e2d785b7028..8f68e12bb5bca9837ca22e0f19e247159f5a2e20 100644
--- a/2025/Global_Memory_Planning_for_LLM.md
+++ b/2025/Global_Memory_Planning_for_LLM.md
@@ -7,12 +7,12 @@
## **二、问题描述**
对于大模型训练/推理的过程,对于全局内存访问的抽象可以用以下四元组的序列来表达:
-[addr_0, size_0, start_0, time_0], [addr_1, size_1, start_1, time_1], … [addr_n, size_n, start_n, time_n]
+[$addr_0$, $size_0$, $start_0$, $time_0$], [$addr_1$, $size_1$, $start_1$, $time_1$], … [$addr_n$, $size_n$, $start_n$, $time_n$]
addr和size定义了一段内存,表示这段内存的起始地址和大小。
start和time定义了这段内存被访问的时间段,表示开始访问的起始时间和持续时间。
有以下几点需要注意:
1. 同一段内存(的子集)可能被反复访问,并且不同的内存段之间可能存在交集。
-2. 内存访问时间满足start_0 <= start_1 <= … <= start_n,这也意味着不同段内存的访问可能是同时发生的。
+2. 内存访问时间满足$start_0 \le start_1 \le ... \le start_n$,这也意味着不同段内存的访问可能是同时发生的。
3. 内存访问序列的每个四元组都被认为是一个计算过程,这个过程可以和内存卸载/加载过程并行,并且不影响彼此的执行时间。
4. 内存卸载和加载过程,不能并行。
5. 在初始状态时(第一条内存序列执行前),内存都是空闲的,所有的数据都未加载到内存。
@@ -25,17 +25,17 @@ start和time定义了这段内存被访问的时间段,表示开始访问的
## **四、用例描述**
**输入格式:**
-第一行输入三个整数L,M,N,其中L表示进程总虚拟地址大小,M表示HBM总容量,N表示内存访问序列长度(1<= M <= L <=100000,1 <= N <= 10000)。接下来N行,每行4个整数addri, sizei, starti, timei (1 <= addri, sizei, <= L, 1 <= starti, timei <= 109)。其中addri表示该段内存的起始虚拟地址,sizei表示该段虚拟地址大小,starti表示该段内存的最早可访问时间,timei表示该段内存被访问的持续时间。输入保证starti是单调递增的。
+第一行输入三个整数$L$,$M$,$N$,其中$L$表示进程总虚拟地址大小,$M$表示HBM总容量,$N$表示内存访问序列长度($1 \le M \le L \le 100000$; $1 \le N \le 10000$)。接下来$N$行,每行4个整数$addr_i$, $size_i$, $start_i$, $time_i$ ($1 \le addr_i, size_i \le L$; $addr_i + size_i \le L$; $1 \le start_i, time_i \le 10^9$)描述一次内存访问。其中$addr_i$表示该段内存的起始虚拟地址,$size_i$表示该段虚拟地址大小,$start_i$表示该段内存的最早可访问时间,$time_i$表示该段内存被访问的持续时间。输入保证$start_i$是**单调不减**的。
**输出格式:**
-输出最多5*N行,每一行表示一个执行操作,分为四种类型:
-1. Reload Ti Ai Si,表示预取操作,意为在Ti时刻把起始虚拟地址为Ai,大小为Si的内存载入HBM,总共花费时间40* Si。需要保证1<=Ai, Si,且Ai + Si <= L。
-2. Visit Ti Ai,表示访存操作,意为在Ti时刻开始进行第Ai个内存访问诉求,总共花费时间为timeAi。需要保证0 <= Ai < N,且访存操作的Ai单调递增。
-3. Offload Ti Ai Si,表示卸载操作,意为在Ti时刻把起始虚拟地址为Ai,大小为Si的内存从HBM中释放,总共花费时间40* Si。需要保证1<=Ai, Si,且Ai + Si <= L。
-4. Fin T,表示任务结束,意为所有计算任务全部完成的时间为T。需要保证该操作只在最后一行输出一次。
+输出最多$5 \times N$行,每一行表示一个执行操作,分为四种类型:
+1. *Reload* $T_i$ $A_i$ $S_i$,表示预取操作,意为在$T_i$时刻把起始虚拟地址为$A_i$,大小为$S_i$的内存载入HBM,总共花费时间$40 \times S_i$。需要保证$1 \le A_i, S_i$,且$A_i + S_i \le L$。
+2. *Visit* $T_i$ $A_i$,表示访存操作,意为在$T_i$时刻开始进行第$A_i$个内存访问诉求,总共花费时间为$time_{A_i}$。需要保证$0 \le A_i \lt N$,且访存操作的$A_i$单调递增。
+3. *Offload* $T_i$ $A_i$ $S_i$,表示卸载操作,意为在$T_i$时刻把起始虚拟地址为$A_i$,大小为$S_i$的内存从HBM中释放,总共花费时间$40 \times S_i$。需要保证$1 \le A_i, S_i$,且$A_i + S_i \le L$。
+4. *Fin* $T$,表示任务结束,意为所有计算任务全部完成的时间为$T$。需要保证该操作只在最后一行输出一次。
-其中操作1和操作3为读写类型操作,操作2为计算类型操作。读写类型操作和计算类型操作可以并行完成,但是同种类型的操作只能并行完成。
+其中操作1和操作3为读写类型操作,操作2为计算类型操作。读写类型操作和计算类型操作**可以并行**完成,但是同种类型的操作**只能串行**完成。
**输入示例一:**
200 100 2
@@ -50,6 +50,15 @@ Reload 8030 100 100
Visit 12030 1
Fin 12040
+**示例一说明:**
+1. 初始时间$T=0$,所有虚拟地址空间均不在GPU/NPU内存中,内存容量为$100$。
+2. 由于序号$0$的内存访问需要用到$[0,100)$范围的虚拟地址,所以输出*Reload*操作把它们载入内存。*Reload*操作加载$100$单位内存,花费$4000$单位时间,此时$T=4000$,虚拟地址$[0,100)$的数据已经在内存中。
+3. 序号$0$的内存访问最早开始时间为$0 \le T$,可以直接开始序号$0$的内存访问,输出*Visit*操作,花费$30$个单位的时间,此时$T=4030$。
+4. 由于此时内存已满,执行序号1的访问前需要先做*offload*操作。虚拟地址空间$[0,100)$完成*offload*,此时$T=8030$。
+5. 把虚拟地址$[100,200)$的数据载入内存,花费$4000$个单位时间,此时$T=12030$。
+6. 序号$1$的内存访问最早开始时间为$50 \le T$,可以直接开始序号$1$的内存访问,输出*Visit*操作,花费$10$个单位的时间,此时$T=12040$。
+7. 所有内存访问操作都已完成,输出*FIN*,花费总时间$T=12040$。
+
**输入示例二:**
300 200 3
0 100 0 50
@@ -66,7 +75,10 @@ Reload 10000 200 50
Visit 12000 2
Fin 12020
+**示例二说明:**
+注意,读写操作和计算操作是可以并行的,因此在时刻$T=4000$时,可以同时开始*Visit*和*Reload*操作。由于内存访问序号1、2的虚拟地址存在重合部分,做*offload*操作时不需要把序号1所访问的内存全部*offload*。
+
## **五、评选标准**
-- 基础要求(Level 0):算法的输出满足目标一和目标二。
+- 基础要求(Level 0):算法的输出满足目标一和目标二,并且符合输出格式要求;算法执行时间不超过10秒,使用内存不超过1GB。
- 进阶要求(Level 1):在满足基础要求的前提下,完成整个输入序列计算的总时间(即最后一次访存操作结束的时间)越短越好。
- 其他加分项:能提供详细的算法说明文档等。
\ No newline at end of file