diff --git a/tutorials/source_en/parallel/msrun_launcher.md b/tutorials/source_en/parallel/msrun_launcher.md index dc305e8889d83e1ba0d8d9392e78cefa62a82715..e4ab661485dcde5e1fa05d1d72143297cd26f028 100644 --- a/tutorials/source_en/parallel/msrun_launcher.md +++ b/tutorials/source_en/parallel/msrun_launcher.md @@ -80,7 +80,7 @@ A parameters list of command line: Enable processes binding CPU cores. Bool/Dict True/False or a device-to-CPU-range dict. Default: False. - If set to True, msrun will automatically allocates CPU ranges based on device affinity; when manually passing a dict, e.g., {"device0":["0-10"],"device1":["11-20"]}, it assigns CPU range 0-10 to process 0 (device0) and 11-20 to process 1 (device1). + If set to True, msrun will automatically allocate CPU ranges based on device affinity; if a dictionary is manually passed, CPU binding will be performed according to the CPU ranges allocated in the dictionary. For specific configurations, please refer to the **Process-Level CPU Binding** section. --sim_level @@ -494,4 +494,69 @@ msrun --worker_num=8 --local_worker_num=8 --master_port=8118 --log_dir=msrun_log - `p` (print): Prints the value of a variable. For example, `p variable` displays the current value of the variable `variable`. - `l` (list): Display the context of the current code. - `b` (break): Set a breakpoint, either by specifying a line number or a function name. -- `h` (help): Display a help message listing all available commands. \ No newline at end of file +- `h` (help): Display a help message listing all available commands. + +## Process-Level Core Binding + +`msrun` supports setting the CPU affinity of a process at startup through the `--bind_core` parameter. The core implementation involves `msrun` internally calling the `taskset -c CPUA-CPUB python XXX.py` command to bind the process to CPU cores in the range from `CPUA` to `CPUB` while starting the Python file. Process-level core binding supports automatically obtaining the core binding strategy based on current environment information and also allows users to customize the core binding strategy. + +### 1. Automatic Core Binding (`--bind_core=True`) + +- **Function**: Automatically allocate CPU core ranges based on current environment information (CPU resources, NUMA nodes, device affinity) without manually specifying specific core numbers. +- **Automated allocation logic**: + + - Priority is given to using CPU cores within the affinity pool; if there are insufficient CPU cores in the affinity pool, CPU cores outside the affinity pool will be used. + - The automatic core binding function relies on system commands (such as `lscpu`, `npu-smi`) to obtain hardware information; if the command execution fails, the allocation strategy will be generated only based on available CPU resources. + - The method for obtaining the affinity relationship between CPUs and NPUs is consistent with the MindSpore interface `mindspore.runtime.set_cpu_affinity`, which can be referred to [mindspore.runtime.set_cpu_affinity](https://www.mindspore.cn/docs/en/master/api_python/runtime/mindspore.runtime.set_cpu_affinity.html). + +### 2. Custom Core Binding + +- **Function**: Customize the core binding strategy based on user input parameters. +- **Format Requirement**: Pass a dictionary in JSON format, which needs to be wrapped with `''` around `{}` in the shell environment. +- **Parameter Description**: + + - The `key` of the dictionary supports `scheduler` (scheduling process) or `deviceX` (device process, where `X` is the device number). + - The `value` of the dictionary is a list of CPU core range segments (e.g., `["0-9", "20-29"]`). + +- **Example Explanation**: + + ```bash + --bind_core='{"scheduler":["0-9"], "device0":["10-19"], "device1":["20-29", "40-49"]}' + ``` + + - Allocate CPU cores 0-9 to the `scheduler` process. + - Allocate CPU cores 10-19 to the worker process 0 (corresponding to `device0`). + - Allocate CPU cores 20-29 and 40-49 to the worker process 1 (corresponding to `device1`). + +- **Notes**: + + 1. The process number must match the device number. For example, if `ASCEND_RT_VISIBLE_DEVICES=6,7` is configured so that process 0 corresponds to `device6` and process 1 corresponds to `device7`, the `key` in the configuration must use `device6` and `device7` to ensure effective core binding: + + ```bash + --bind_core='{"scheduler":["0-9"], "device6":["10-19"], "device7":["20-29", "40-49"]}' + ``` + + The scheduler process does not occupy device resources, so it does not participate in device sorting. The order of keys does not affect their effectiveness (for example, the order of `scheduler` and `device6` in the above example can be interchanged). + 2. If the list of CPU range segments is empty, the affinity setting for that process is skipped. For example: + + ```bash + --bind_core='{"scheduler":[], "device0":[], "device1":["20-29", "40-49"]}' + ``` + + An empty list for `scheduler` or `device0` means core binding is not performed for those processes. + 3. It is recommended that the number of worker processes be consistent with the number of key-value pairs in `--bind_core`. For example, in a single-machine two-devices task, if only core binding for worker process 1 is required, all processes (including those not needing core binding) must be explicitly configured: + + ```bash + # correct example + --bind_core='{"scheduler":[], "device0":[], "device1":["20-29", "40-49"]}' + + # wrong example + --bind_core='{"device1":["20-29", "40-49"]}' + ``` + + In the wrong example, worker process 0 may be mistakenly identified as corresponding to `device1` and thus have core binding skipped. The `scheduler` and worker process 1 will also be skipped because they are not included in the configuration. + +### 3. Disabling Core Binding (`--bind_core=False`) + +- **Function**: Do not enable the process-level core binding function. +- **Default Value**: The default value of the `msrun --bind_core` parameter is `False`. diff --git a/tutorials/source_zh_cn/parallel/msrun_launcher.md b/tutorials/source_zh_cn/parallel/msrun_launcher.md index 56a8e1fa7a169417307035d7883905e6e554dbc7..198f21d554ea5dba4c719daaf079b36f73ca5cc3 100644 --- a/tutorials/source_zh_cn/parallel/msrun_launcher.md +++ b/tutorials/source_zh_cn/parallel/msrun_launcher.md @@ -80,7 +80,7 @@ 开启进程绑核。 Bool/Dict True、False或者给指定设备分配CPU范围段的字典。默认为False。 - 若设置为True,则会基于环境信息按照设备亲和去自动分配CPU范围段;若手动传入一个字典,如{"device0":["0-10"],"device1":["11-20"]},则会给0号进程(对应device0)分配CPU范围段0-10,给1号进程(对应device1)分配CPU范围段11-20。 + 若设置为True,则会基于环境信息按照设备亲和去自动分配CPU范围段;若手动传入一个字典,则根据该字典分配的CPU范围段去绑核。具体配置可参考**进程级绑核**章节。。 --sim_level @@ -495,3 +495,70 @@ msrun --worker_num=8 --local_worker_num=8 --master_port=8118 --log_dir=msrun_log - `l` (list):显示当前代码的上下文。 - `b` (break):设置断点,可以指定行号或函数名。 - `h` (help):显示帮助信息,列出所有可用命令。 + +## 进程级绑核 + +`msrun` 支持通过 `--bind_core` 参数在进程启动时设置进程的 CPU 亲和性,其核心实现是在 `msrun` 内部调用 `taskset -c CPUA-CPUB python XXX.py` 命令,在启动 Python 文件的同时,为进程绑定 `CPUA` 到 `CPUB` 范围的 CPU 核。进程级绑核支持基于当前环境信息去自动获取绑核策略,也支持用户自定义绑核策略。 + +### 1. 自动绑核(--bind_core=True) + +- **功能**:基于当前环境信息(CPU 资源、NUMA 节点、设备亲和性)自动分配 CPU 核范围,无需手动指定具体核编号。 +- **自动分配逻辑**: + + - 优先使用亲和池内的 CPU 核;若亲和池内 CPU 核不足,则使用非亲和池内的 CPU 核。 + - 自动绑核功能依赖系统命令(如 `lscpu`、`npu-smi`)获取硬件信息;若命令执行失败,将仅根据可用 CPU 资源生成分配策略。 + - CPU 与 NPU 间亲和关系的获取方式,与 MindSpore 接口 `mindspore.runtime.set_cpu_affinity` 一致,可参考 [mindspore.runtime.set_cpu_affinity](https://www.mindspore.cn/docs/zh-CN/master/api_python/runtime/mindspore.runtime.set_cpu_affinity.html)。 + +### 2. 自定义绑核 + +- **功能**:依据用户传参,定制绑核策略。 +- **格式要求**:传入 JSON 格式的字典,在 shell 环境中需用 `''` 包裹 `{}`。 +- **参数说明**: + + - 字典的 `key` 支持 `scheduler`(调度进程)或 `deviceX`(设备进程,`X` 为设备编号)。 + - 字典的 `value` 为 CPU 核范围段列表(如 `["0-9", "20-29"]`)。 + +- **示例**: + + ```bash + --bind_core='{"scheduler":["0-9"], "device0":["10-19"], "device1":["20-29", "40-49"]}' + ``` + + 表示: + + - 为`scheduler`进程分配 CPU 核 0-9; + - 为 0 号 worker 进程(对应`device0`)分配 CPU 核 10-19; + - 为 1 号 worker 进程(对应`device1`)分配 CPU 核 20-29 和 40-49。 + +- **注意事项**: + + 1. 进程编号需与设备编号匹配。例如,若通过`ASCEND_RT_VISIBLE_DEVICES=6,7`配置,使 0 号进程对应`device6`、1 号进程对应`device7`,则需按如下方式配置,否则无法为对应进程绑核: + + ```bash + --bind_core='{"scheduler":["0-9"], "device6":["10-19"], "device7":["20-29", "40-49"]}' + ``` + + scheduler 进程不占用设备资源,因此不参与设备排序,键的顺序不影响生效(如上述示例中`scheduler`与`device6`顺序可互换)。 + 2. 若 CPU 范围段列表为空,则跳过对该进程的亲和性设置。例如: + + ```bash + --bind_core='{"scheduler":[], "device0":[], "device1":["20-29", "40-49"]}' + ``` + + 表示:跳过`scheduler`进程和 0 号 worker 进程的绑核,仅为 1 号 worker 进程(`device1`)分配 CPU 核。 + 3. 建议 worker 进程数量与`--bind_core`字典的键值对数量一致。例如,单机两卡任务中,若仅需为 1 号 worker 进程绑核,需显式配置所有进程(包括不绑核的进程): + + ```bash + # 正确示例 + --bind_core='{"scheduler":[], "device0":[], "device1":["20-29", "40-49"]}' + + # 错误示例 + --bind_core='{"device1":["20-29", "40-49"]}' + ``` + + 错误示例中,0 号 worker 进程可能被误判为对应`device1`而跳过绑核,`scheduler`和 1 号 worker 进程因未在配置中也会被跳过。 + +### 3. 关闭绑核(--bind_core=False) + +- **功能**:不启用进程级绑核功能。 +- **默认值**:`msrun --bind_core` 参数默认值为`False`。