diff --git a/docs/mindformers/docs/source_en/feature/ckpt.md b/docs/mindformers/docs/source_en/feature/ckpt.md index 24eb0b6a09801927df9db35c044596228b92e206..9758105f5f8f0052925d0f652d1ec7d32263f8f4 100644 --- a/docs/mindformers/docs/source_en/feature/ckpt.md +++ b/docs/mindformers/docs/source_en/feature/ckpt.md @@ -267,11 +267,11 @@ bash transform_checkpoint.sh \ #### Multi-Node Multi-Device Training on Physical Machines -Training a large-scale model usually needs a cluster of servers. In the multi-node multi-device scenario, if there is a shared disk between servers, the automatic conversion function can be used. Otherwise, only offline conversion can be used. The following example is a training that uses two servers and 16 GPUs. +Training a large-scale model usually needs a cluster of servers. In the multi-node multi-device scenario, if a unified shared storage path (such as the NFS-mounted /worker directory) is configured between servers, the automatic conversion function can be used. Otherwise, only offline conversion can be used. The following example is a training that uses two servers and 16 GPUs. **Scenario 1: A shared disk exists between servers.** -If there is a shared disk between servers, you can use MindSpore Transformers to automatically convert a weight before multi-node multi-device training. Assume that `/data` is the shared disk between the servers and the MindSpore Transformers project code is stored in the `/data/mindformers` directory. +If a unified shared storage path (such as the NFS-mounted /worker directory) is configured between servers, you can use MindSpore Transformers to automatically convert a weight before multi-node multi-device training. - **Single-process conversion** @@ -332,9 +332,9 @@ If there is a shared disk between servers, you can use MindSpore Transformers to 16 8 ${ip} ${port} 1 output/msrun_log False 300 ``` -**Scenario 2: No shared disk exists between servers.** +**Scenario 2: No shared path exists between servers.** -If there is no shared disk between servers, you need to use the offline weight conversion tool to convert the weight. The following steps describe how to perform offline weight conversion and start a multi-node multi-device training task. +If there is no shared path between servers, you need to use the offline weight conversion tool to convert the weight. The following steps describe how to perform offline weight conversion and start a multi-node multi-device training task. - **Obtain the distributed policy file.** diff --git a/docs/mindformers/docs/source_en/feature/safetensors.md b/docs/mindformers/docs/source_en/feature/safetensors.md index 4353963d4e5e0b8cc73eb5220724b91541cfd19c..e817c12a17fd6a219b0c7da238ac80160be585fb 100644 --- a/docs/mindformers/docs/source_en/feature/safetensors.md +++ b/docs/mindformers/docs/source_en/feature/safetensors.md @@ -234,11 +234,11 @@ In large cluster scale scenarios, to avoid the online merging process taking too #### Physical Machine Multi-machcine Multi-card Training -Large-scale models usually need to be trained by clusters of multiple servers. Weight slicing conversion needs to rely on the target slicing strategy file after the compilation is completed. In this multi-machine and multi-card scenario, if there is a shared disk between servers and the generated strategy file is in the same directory, you can use the automatic conversion function; if there is no shared disk between servers, you need to manually copy the strategy file and then carry out the conversion function. The following is an example of two servers and 16 cards training. +Large-scale models usually need to be trained by clusters of multiple servers. Weight slicing conversion needs to rely on the target slicing strategy file after the compilation is completed. In this multi-machine and multi-card scenario, if a unified shared storage path (such as the NFS-mounted /worker directory) is configured between servers and the generated strategy file is in the same directory, you can use the automatic conversion function; if there is no shared disk between servers, you need to manually copy the strategy file and then carry out the conversion function. The following is an example of two servers and 16 cards training. **Scenario 1: There are shared disks between servers** -In scenarios where there are shared disks between servers, you can use MindSpore Transformers Auto-Weight Conversion feature to automatically perform weight conversion prior to multi-computer, multi-card training. Assuming that `/data` is a shared disk on the server and the project code for MindSpore Transformers is located under the `data/mindformers` path. +If a unified shared storage path (such as the NFS-mounted /worker directory) is configured between servers, you can use MindSpore Transformers to automatically convert a weight before multi-node multi-device training. **Parameter Configuration:** @@ -279,9 +279,9 @@ Use [mindformers/scripts/msrun_launcher.sh](https://gitee.com/mindspore/mindform 16 8 ${ip} ${port} 1 output/msrun_log False 300 ``` -**Scenario 2: No shared disks between servers** +**Scenario 2: No shared path between servers** -In the case where there is no shared disk between servers, you need to perform an offline merge and forward operation on the generated strategy files before enabling the online slicing function. The following steps describe how to perform this operation and start a multi-machine, multi-card training task. +In the case where there is no shared path between servers, you need to perform an offline merge and forward operation on the generated strategy files before enabling the online slicing function. The following steps describe how to perform this operation and start a multi-machine, multi-card training task. **1.Getting Distributed Strategies** @@ -548,6 +548,8 @@ ms.load_distributed_checkpoint( - **dst_safetensors_dir** (str) - The save directory for the weights in the save mode scenario. - **max_process_num** (int) - Maximum number of processes. Default: 64. +> Note: When loading the weights of offline sliced, the distributed strategy of the task must remain unchanged. + ## Weights Format Conversion ### Converting Ckpt to Safetensors diff --git a/docs/mindformers/docs/source_zh_cn/feature/ckpt.md b/docs/mindformers/docs/source_zh_cn/feature/ckpt.md index 45086a55c64c598e0ccbbb4cb216fde064946224..b5b88f30cc1da578f01bd067b9ae39abc7f88b86 100644 --- a/docs/mindformers/docs/source_zh_cn/feature/ckpt.md +++ b/docs/mindformers/docs/source_zh_cn/feature/ckpt.md @@ -267,11 +267,11 @@ bash transform_checkpoint.sh \ #### 物理机多机多卡训练 -大规模模型通常需要通过多台服务器组成的集群进行训练。在这种多机多卡的场景下,如果服务器之间存在共享盘,则可以使用自动转换功能,否则只能使用离线转换。下面以两台服务器、16卡训练为例进行说明。 +大规模模型通常需要通过多台服务器组成的集群进行训练。在这种多机多卡的场景下,如果服务器之间配置了统一的共享存储路径(如NFS挂载的/worker目录),则可以使用自动转换功能,否则只能使用离线转换。下面以两台服务器、16卡训练为例进行说明。 -**场景一:服务器之间有共享盘** +**场景一:服务器之间配置有共享存储路径** -在服务器之间有共享盘的场景下,可以使用 MindSpore Transformers 的自动权重转换功能在多机多卡训练之前自动进行权重转换。假设 `/data` 为服务器的共享盘,且 MindSpore Transformers 的工程代码位于 `/data/mindformers` 路径下。 +在服务器之间配置了统一的共享存储路径(如NFS挂载的/worker目录),可以使用 MindSpore Transformers 的自动权重转换功能在多机多卡训练之前自动进行权重转换。 - **单进程转换** @@ -332,9 +332,9 @@ bash transform_checkpoint.sh \ 16 8 ${ip} ${port} 1 output/msrun_log False 300 ``` -**场景二:服务器之间无共享盘** +**场景二:服务器之间无共享路径** -在服务器之间无共享盘的情况下,需要使用离线权重转换工具进行权重转换。以下步骤描述了如何进行离线权重转换,并启动多机多卡训练任务。 +在服务器之间无共享路径的情况下,需要使用离线权重转换工具进行权重转换。以下步骤描述了如何进行离线权重转换,并启动多机多卡训练任务。 - **获取分布式策略文件** diff --git a/docs/mindformers/docs/source_zh_cn/feature/safetensors.md b/docs/mindformers/docs/source_zh_cn/feature/safetensors.md index a6e90303fce50c2adc02ee44a44e26bbe32e0cd8..bf2b08b3f15ca564c82bb97fa164480147447afa 100644 --- a/docs/mindformers/docs/source_zh_cn/feature/safetensors.md +++ b/docs/mindformers/docs/source_zh_cn/feature/safetensors.md @@ -234,11 +234,11 @@ parallel_config: # 配置目标分布式策 #### 物理机多机多卡训练 -大规模模型通常需要通过多台服务器组成的集群进行训练。权重切分转换需要依赖编译完成后的目标切分策略文件,在这种多机多卡的场景下,如果服务器之间存在共享盘,生成的策略文件在同一个目录下,则可以使用自动转换功能;如果服务器之间无共享盘,需要手动复制策略文件后再进行转换功能。下面以两台服务器、16卡训练为例进行说明。 +大规模模型通常需要通过多台服务器组成的集群进行训练。权重切分转换需要依赖编译完成后的目标切分策略文件,在这种多机多卡的场景下,如果服务器之间配置了统一的共享存储路径(如NFS挂载的/worker目录),生成的策略文件在同一个目录下,则可以使用自动转换功能;如果服务器之间无共享盘,需要手动复制策略文件后再进行转换功能。下面以两台服务器、16卡训练为例进行说明。 -**场景一:服务器之间有共享盘** +**场景一:服务器之间配置有共享存储路径** -在服务器之间有共享盘的场景下,可以使用 MindSpore Transformers 的自动权重转换功能在多机多卡训练之前自动进行权重转换。假设 `/data` 为服务器的共享盘,且 MindSpore Transformers 的工程代码位于 `/data/mindformers` 路径下。 +在服务器之间配置了统一的共享存储路径(如NFS挂载的/worker目录),可以使用 MindSpore Transformers 的自动权重转换功能在多机多卡训练之前自动进行权重转换。 **参数配置:** @@ -279,9 +279,9 @@ parallel_config: # 配置16卡分布式策略 16 8 ${ip} ${port} 1 output/msrun_log False 300 ``` -**场景二:服务器之间无共享盘** +**场景二:服务器之间无共享路径** -在服务器之间无共享盘的情况下,需要对生成的策略文件进行离线合并和转发操作后再使能在线切分功能。以下步骤描述了如何进行该操作,并启动多机多卡训练任务。 +在服务器之间无共享路径的情况下,需要对生成的策略文件进行离线合并和转发操作后再使能在线切分功能。以下步骤描述了如何进行该操作,并启动多机多卡训练任务。 **1.获取分布式策略** @@ -548,6 +548,8 @@ ms.load_distributed_checkpoint( - **dst_safetensors_dir** (str) - 保存模式场景下,权重的保存目录。 - **max_process_num** (int) - 最大进程数。默认值:64。 +> 注:加载离线切分的权重时,任务的分布式策略需要保持不变。 + ## 权重格式转换 ### Ckpt转换Safetensors