This tutorial describes how to train the ResNet-50 network using MindSpore data parallelism and automatic parallelism on the GPU hardware platform.
The CIFAR-10
dataset is used as an example. The method of downloading and loading the dataset is the same as that for the Ascend 910 AI processor.
The method of downloading and loading the dataset: https://www.mindspore.cn/tutorials/experts/en/r1.7/parallel/train_ascend.html
OpenMPI-4.0.3
: multi-process communication library used by MindSpore.
Download the OpenMPI-4.0.3 source code package openmpi-4.0.3.tar.gz
from https://www.open-mpi.org/software/ompi/v4.0/.
For details about how to install OpenMPI, see the official tutorial: https://www.open-mpi.org/faq/?category=building#easy-build.
Password-free login between hosts (required for multi-host training). If multiple hosts are involved in the training, you need to configure password-free login between them. The procedure is as follows:
ssh-keygen -t rsa -P ""
command to generate a key.ssh-copy-id DEVICE-IP
command to set the IP address of the host that requires password-free login.ssh DEVICE-IP
command. If you can log in without entering the password, the configuration is successful.On the GPU hardware platform, MindSpore parallel distributed training uses NCCL for communication.
On the GPU platform, MindSpore does not support the following operations:
get_local_rank
,get_local_size
,get_world_rank_from_group_rank
,get_group_rank_from_world_rank
andcreate_group
The sample code for calling the HCCL is as follows:
from mindspore import context
from mindspore.communication import init
if __name__ == "__main__":
context.set_context(mode=context.GRAPH_MODE, device_target="GPU")
init("nccl")
...
In the preceding information,
mode=context.GRAPH_MODE
: sets the running mode to graph mode for distributed training. (The PyNative mode does not support parallel running.)init("nccl")
: enables NCCL communication and completes the distributed training initialization.On the GPU hardware platform, the network definition is the same as that for the Ascend 910 AI processor.
For details about the definitions of the network, optimizer, and loss function, see https://www.mindspore.cn/tutorials/experts/en/r1.7/parallel/train_ascend.html.
On the GPU hardware platform, MindSpore uses OpenMPI mpirun
for distributed training.
The following takes the distributed training script for eight devices as an example to describe how to run the script:
Obtain the running script of the example from:
https://gitee.com/mindspore/docs/blob/r1.7/docs/sample_code/distributed_training/run_gpu.sh
If the script is executed by the root user, the
--allow-run-as-root
parameter must be added tompirun
.
#!/bin/bash
echo "=============================================================================================================="
echo "Please run the script as: "
echo "bash run_gpu.sh DATA_PATH"
echo "For example: bash run_gpu.sh /path/dataset"
echo "It is better to use the absolute path."
echo "=============================================================================================================="
DATA_PATH=$1
export DATA_PATH=${DATA_PATH}
rm -rf device
mkdir device
cp ./resnet50_distributed_training.py ./resnet.py ./device
cd ./device
echo "start training"
mpirun -n 8 pytest -s -v ./resnet50_distributed_training.py > train.log 2>&1 &
The script will run in the bachground. The log file is saved in the device directory, we will run 10 epochs and each epochs contain 234 steps, and the loss result is saved in train.log. The output loss values of the grep command are as follows:
epoch: 1 step: 1, loss is 2.3025854
epoch: 1 step: 1, loss is 2.3025854
epoch: 1 step: 1, loss is 2.3025854
epoch: 1 step: 1, loss is 2.3025854
epoch: 1 step: 1, loss is 2.3025854
epoch: 1 step: 1, loss is 2.3025854
epoch: 1 step: 1, loss is 2.3025854
epoch: 1 step: 1, loss is 2.3025854
If multiple hosts are involved in the training, you need to set the multi-host configuration in the mpirun
command. You can use the -H
option in the mpirun
command. For example, mpirun -n 16 -H DEVICE1_IP:8,DEVICE2_IP:8 python hello.py
indicates that eight processes are started on the hosts whose IP addresses are DEVICE1_IP and DEVICE2_IP, respectively. Alternatively, you can create a hostfile similar to the following and transfer its path to the --hostfile
option of mpirun
. Each line in the hostfile is in the format of [hostname] slots=[slotnum]
, where hostname can be an IP address or a host name.
DEVICE1 slots=8
DEVICE2 slots=8
The following is the execution script of the 16-device two-host cluster. The variables DATA_PATH
and HOSTFILE
need to be transferred, indicating the dataset path and hostfile path. For details about more mpirun options, see the OpenMPI official website.
#!/bin/bash
DATA_PATH=$1
HOSTFILE=$2
rm -rf device
mkdir device
cp ./resnet50_distributed_training.py ./resnet.py ./device
cd ./device
echo "start training"
mpirun -n 16 --hostfile $HOSTFILE -x DATA_PATH=$DATA_PATH -x PATH -mca pml ob1 pytest -s -v ./resnet50_distributed_training.py > train.log 2>&1 &
Run running on GPU, the model parameters can be saved and loaded by referring to Distributed Training Model Parameters Saving and Loading.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。