Ascend
GPU
CPU
Model Development
Model Running
Model Evaluation
ModelArts is a one-stop AI development platform provided by HUAWEI CLOUD. It integrates the Ascend AI Processor resource pool. Developers can experience MindSpore on this platform.
ResNet-50 is used as an example to describe how to use MindSpore to complete a training task on ModelArts.
Create an account, configure ModelArts, and create an Object Storage Service (OBS) bucket by referring to the "Preparations" section in the ModelArts tutorial.
For more information about ModelArts, visit https://support.huaweicloud.com/wtsnew-modelarts/index.html. Prepare ModelArts by referring to the "Preparations" section.
You can click here to join the beta testing program of the ModelArts Ascend Compute Service.
ModelArts uses OBS to store data. Therefore, before starting a training job, you need to upload the data to OBS. The CIFAR-10 dataset in binary format is used as an example.
Download and decompress the CIFAR-10 dataset.
Download the CIFAR-10 dataset at http://www.cs.toronto.edu/~kriz/cifar.html. Among the three dataset versions provided on the page, select CIFAR-10 binary version.
Create an OBS bucket (for example, ms-dataset), create a data directory (for example, cifar-10) in the bucket, and upload the CIFAR-10 data to the data directory according to the following structure.
└─Object storage/ms-dataset/cifar-10
├─train
│ data_batch_1.bin
│ data_batch_2.bin
│ data_batch_3.bin
│ data_batch_4.bin
│ data_batch_5.bin
│
└─eval
test_batch.bin
Create an OBS bucket (for example, resnet50-train
), create a code directory (for example, resnet50_cifar10_train
) in the bucket, and upload all scripts in the following directories to the code directory:
ResNet-50 is used in scripts in https://gitee.com/mindspore/docs/tree/r1.5/docs/sample_code/sample_for_cloud/ to train the CIFAR-10 dataset and validate the accuracy after training is complete.
1*Ascend
or8*Ascend
can be used in scripts on ModelArts for training.Note that the script version must be the same as the MindSpore version selected in "Creating a Training Task." For example, if you use scripts provided for MindSpore 1.1, you need to select MindSpore 1.1 when creating a training job.
To facilitate subsequent training job creation, you need to create a training output directory and a log output directory. The directory structure created in this example is as follows:
└─Object storage/resnet50-train
├─resnet50_cifar10_train
│ dataset.py
│ resnet.py
│ resnet50_train.py
│
├─output
└─log
Scripts provided in section "Preparing for Script Execution" can directly run on ModelArts. If you want to experience how to use ResNet-50 to train CIFAR-10, skip this section. If you need to run customized MindSpore scripts or more MindSpore sample code on ModelArts, perform simple adaptation on the MindSpore code as follows:
Set data_url
and train_url
. They are necessary for running the script on ModelArts, corresponding to the data storage path (an OBS path) and training output path (an OBS path), respectively.
import argparse
parser = argparse.ArgumentParser(description='ResNet-50 train.')
parser.add_argument('--data_url', required=True, default=None, help='Location of data.')
parser.add_argument('--train_url', required=True, default=None, help='Location of training outputs.')
ModelArts allows you to pass arguments to the configuration options in the script. For details, see "Creating a Training Job."
parser.add_argument('--epoch_size', type=int, default=90, help='Train epoch size.')
MindSpore does not provide APIs for directly accessing OBS data. You need to use APIs provided by MoXing to interact with OBS. ModelArts training scripts are executed in containers. Generally, the /cache
directory is used to store the container data.
HUAWEI CLOUD MoXing provides various APIs for users: https://github.com/huaweicloud/ModelArts-Lab/tree/master/docs/moxing_api_doc. In this example, only the
copy_parallel
API is used.
Download the data stored in OBS to an execution container.
import moxing as mox
mox.file.copy_parallel(src_url='s3://dataset_url/', dst_url='/cache/data_path')
Upload the training output from the container to OBS.
import moxing as mox
mox.file.copy_parallel(src_url='/cache/output_path', dst_url='s3://output_url/')
To run scripts in the 8*Ascend
environment, you need to adapt dataset creation code and a local data path, and configure a distributed policy. By obtaining the environment variables DEVICE_ID
and RANK_SIZE
, you can build training scripts applicable to 1*Ascend
and 8*Ascend
.
Adapt a local path.
import os
device_num = int(os.getenv('RANK_SIZE'))
device_id = int(os.getenv('DEVICE_ID'))
# define local data path
local_data_path = '/cache/data'
if device_num > 1:
# define distributed local data path
local_data_path = os.path.join(local_data_path, str(device_id))
Adapt datasets.
import os
import mindspore.dataset as ds
device_id = int(os.getenv('DEVICE_ID'))
device_num = int(os.getenv('RANK_SIZE'))
if device_num == 1:
# create train data for 1 Ascend situation
dataset = ds.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=True)
else:
# create train data for 1 Ascend situation, split train data for 8 Ascend situation
dataset = ds.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=True,
num_shards=device_num, shard_id=device_id)
Configure a distributed policy.
import os
from mindspore import context
from mindspore.context import ParallelMode
device_num = int(os.getenv('RANK_SIZE'))
if device_num > 1:
context.set_auto_parallel_context(device_num=device_num,
parallel_mode=ParallelMode.DATA_PARALLEL,
gradients_mean=True)
Perform simple adaptation on the MindSpore script based on the preceding three points. The following pseudocode is used as an example:
Original MindSpore script:
import os
import argparse
from mindspore import context
from mindspore.context import ParallelMode
import mindspore.dataset as ds
device_id = int(os.getenv('DEVICE_ID'))
device_num = int(os.getenv('RANK_SIZE'))
def create_dataset(dataset_path):
if device_num == 1:
dataset = ds.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=True)
else:
dataset = ds.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=True,
num_shards=device_num, shard_id=device_id)
return dataset
def resnet50_train(args):
if device_num > 1:
context.set_auto_parallel_context(device_num=device_num,
parallel_mode=ParallelMode.DATA_PARALLEL,
gradients_mean=True)
train_dataset = create_dataset(local_data_path)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='ResNet-50 train.')
parser.add_argument('--local_data_path', required=True, default=None, help='Location of data.')
parser.add_argument('--epoch_size', type=int, default=90, help='Train epoch size.')
args_opt, unknown = parser.parse_known_args()
resnet50_train(args_opt)
Adapted MindSpore script:
import os
import argparse
from mindspore import context
from mindspore.context import ParallelMode
import mindspore.dataset as ds
# adapt to cloud: used for downloading data
import moxing as mox
device_id = int(os.getenv('DEVICE_ID'))
device_num = int(os.getenv('RANK_SIZE'))
def create_dataset(dataset_path):
if device_num == 1:
dataset = ds.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=True)
else:
dataset = ds.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=True,
num_shards=device_num, shard_id=device_id)
return dataset
def resnet50_train(args):
# adapt to cloud: define local data path
local_data_path = '/cache/data'
if device_num > 1:
context.set_auto_parallel_context(device_num=device_num,
parallel_mode=ParallelMode.DATA_PARALLEL,
gradients_mean=True)
# adapt to cloud: define distributed local data path
local_data_path = os.path.join(local_data_path, str(device_id))
# adapt to cloud: download data from obs to local location
print('Download data.')
mox.file.copy_parallel(src_url=args.data_url, dst_url=local_data_path)
train_dataset = create_dataset(local_data_path)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='ResNet-50 train.')
# adapt to cloud: get obs data path
parser.add_argument('--data_url', required=True, default=None, help='Location of data.')
# adapt to cloud: get obs output path
parser.add_argument('--train_url', required=True, default=None, help='Location of training outputs.')
parser.add_argument('--epoch_size', type=int, default=90, help='Train epoch size.')
args_opt, unknown = parser.parse_known_args()
resnet50_train(args_opt)
Create a training job to run the MindSpore script. The following provides step-by-step instructions for creating a training job on ModelArts.
Click Console on the HUAWEI CLOUD ModelArts home page at https://www.huaweicloud.com/product/modelarts.html.
ModelArts Tutorial https://support.huaweicloud.com/engineers-modelarts/modelarts_23_0238.html shows how to use a common framework to create a training job.
Training scripts and data in this tutorial are used as an example to describe how to configure arguments on the training job creation page.
Algorithm Source
: Click Frameworks
, and then select Ascend-Powered-Engine
and the required MindSpore version. (Mindspore-0.5-python3.7-aarch64
is used as an example here. Use scripts corresponding to the selected version.)
Code Directory
: Select a code directory created in an OBS bucket. Set Startup File
to a startup script in the code directory.
Data Source
: Click Data Storage Path
and enter the CIFAR-10 dataset path in OBS.
Argument
: Set data_url
and train_url
to the values of Data Storage Path
and Training Output Path
, respectively. Click the add icon to pass values to other arguments in the script, for example, epoch_size
.
Resource Pool
: Click Public Resource Pool > Ascend
.
Specification
: Select Ascend: 1 * Ascend 910 CPU: 24-core 96 GiB
or Ascend: 8 * Ascend 910 CPU: 192-core 768 GiB
, which indicate single-node single-device and single-node 8-device specifications, respectively.
You can view run logs on the Training Jobs page.
The 8*Ascend
specification is used to execute the ResNet-50 training job. The total number of epochs is 92, the accuracy is about 92%, and the number of images trained per second is about 12,000.
The 1*Ascend
specification is used to execute the ResNet-50 training job. The total number of epochs is 92, the accuracy is about 95%, and the number of images trained per second is about 1800.
If you specify a log path when creating a training job, you can download log files from OBS and view them.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。