During network migration, you are advised to use the PyNative mode for debugging. In PyNative mode, you can perform debugging, and log printing is user-friendly. After the debugging is complete, the graph mode is used. The graph mode is more user-friendly in execution performance. You can also find some problems in network compilation. For example, gradient truncation caused by third-party operators.
The accuracy debugging process is as follows:
This part includes checking all parameters and the number of trainable parameters, and checking the shape of all parameters.
Parameter
is used for MindSpore trainable and untrainable parameters.
from mindspore import nn
class msNet(nn.Cell):
def __init__(self):
super(msNet, self).__init__()
self.fc = nn.Dense(1, 1, weight_init='normal')
def construct(self, x):
output = self.fc(x)
return output
msnet = msNet()
# Obtain all parameters.
all_parameter = []
for item in msnet.get_parameters():
all_parameter.append(item)
print(item.name, item.data.shape)
print(f"all parameter numbers: {len(all_parameter)}")
# Obtain trainable parameters.
trainable_params = msnet.trainable_params()
for item in trainable_params:
print(item.name, item.data.shape)
print(f"trainable parameter numbers: {len(trainable_params)}")
fc.weight (1, 1)
fc.bias (1,)
all parameter numbers: 2
fc.weight (1, 1)
fc.bias (1,)
trainable parameter numbers: 2
Parameter
is used for PyTorch trainable parameters, and requires_grad=False
or buffer
is used for PyTorch untrainable parameters.
from torch import nn
class ptNet(nn.Module):
def __init__(self):
super(ptNet, self).__init__()
self.fc = nn.Linear(1, 1)
def construct(self, x):
output = self.fc(x)
return output
ptnet = ptNet()
all_parameter = []
trainable_params = []
# Obtain network parameters.
for name, item in ptnet.named_parameters():
if item.requires_grad:
trainable_params.append(item)
all_parameter.append(item)
print(name, item.shape)
for name, buffer in ptnet.named_buffers():
all_parameter.append(buffer)
print(name, buffer.shape)
print(f"all parameter numbers: {len(all_parameter)}")
print(f"trainable parameter numbers: {len(trainable_params)}")
fc.weight torch.Size([1, 1])
fc.bias torch.Size([1])
all parameter numbers: 2
trainable parameter numbers: 2
The parameters of MindSpore and PyTorch are similar except BatchNorm. Note that MindSpore does not have parameters corresponding to num_batches_tracked
. You can replace this parameter with global_step
in the optimizer.
MindSpore | PyTorch |
---|---|
gamma | weight |
beta | bias |
moving_mean | running_mean |
moving_variance | running_var |
- | num_batches_tracked |
The implementation of the model algorithm is irrelevant to the framework. The trained parameters can be converted into the checkpoint file of MindSpore and loaded to the network for inference verification.
For details about the model verification process, see ResNet Network Migration.
After confirming that the model structures are the same, you are advised to perform inference verification again. In addition to models, the entire inference process also involves datasets and metrics. When the inference results are inconsistent, you can use the control variable method to gradually rectify the fault.
For details about the inference verification process, see ResNet Network Migration.
After the inference verification is complete, the basic model, data processing, and metrics calculation are normal. If the training accuracy is still abnormal, how do we locate the fault?
import mindspore as ms
from mindspore import nn
# Model
loss_scale_manager = ms.FixedLossScaleManager(drop_overflow_update=False) # Static loss scale
# loss_scale_manager = ms.DynamicLossScaleManager() # Dynamic loss scale
# 1. General process
loss = nn.MSELoss()
opt = nn.Adam(params=msnet.trainable_params(), learning_rate=0.01)
model = ms.Model(network=msnet, loss_fn=loss, optimizer=opt, loss_scale_manager=loss_scale_manager)
# 2. Self-packaged forward network and loss function
msnet.to_float(ms.float16)
loss.to_float(ms.float32)
net_with_loss = nn.WithLossCell(msnet, loss)
# It is recommended that loss_fn be used for the mixed precision of the model. Otherwise, float16 is used for calculation of the loss part, which may cause overflow.
model = ms.Model(network=net_with_loss, optimizer=opt)
# 3. Self-packaged training process
scale_sense = nn.FixedLossScaleUpdateCell(1)#(config.loss_scale) # Static loss scale
# scale_sense = nn.DynamicLossScaleUpdateCell(loss_scale_value=config.loss_scale,
# scale_factor=2, scale_window=1000) # Dynamic loss scale
train_net = nn.TrainOneStepWithLossScaleCell(net_with_loss, optimizer=opt, scale_sense=scale_sense)
model = ms.Model(network=train_net)
import numpy as np
from mindspore import dataset as ds
def get_data(num, w=2.0, b=3.0):
for _ in range(num):
x = np.random.uniform(-10.0, 10.0)
noise = np.random.normal(0, 1)
y = x * w + b + noise
yield np.array([x]).astype(np.float32), np.array([y]).astype(np.float32)
def create_dataset(num_data, batch_size=16, repeat_size=1):
input_data = ds.GeneratorDataset(list(get_data(num_data)), column_names=['data', 'label'])
input_data = input_data.batch(batch_size, drop_remainder=True)
input_data = input_data.repeat(repeat_size)
return input_data
train_net.set_train()
dataset = create_dataset(1600)
iterator = dataset.create_tuple_iterator()
for i, data in enumerate(iterator):
loss, overflow, scaling_sens = train_net(*data)
print("step: {}, loss: {}, overflow:{}, scale:{}".format(i, loss, overflow, scaling_sens))
step: 0, loss: 138.42825, overflow:False, scale:1.0
step: 1, loss: 118.172104, overflow:False, scale:1.0
step: 2, loss: 159.14542, overflow:False, scale:1.0
step: 3, loss: 150.65671, overflow:False, scale:1.0
... ...
step: 97, loss: 69.513245, overflow:False, scale:1.0
step: 98, loss: 51.903114, overflow:False, scale:1.0
step: 99, loss: 42.250656, overflow:False, scale:1.0
import mindspore as ms
ms.set_seed(1) # The random seeds of MindSpore, NumPy, and dataset are fixed. The random seed of the API needs to be set in the API attribute.
For details about more accuracy debugging policies, see Accuracy Debugging.
The performance tuning directions are as follows:
For details, see ResNet Network Migration.
Some networks are large or there are many process control statements. In this case, the build is slow in graph mode. During performance tuning, distinguish graph build from network execution. This section describes the performance tuning policies in the network execution phase. If graph build is slow, try incremental operator build or contact MindSpore community for feedback.
If a single operator takes a long time and the performance of the same operator varies greatly in different shapes or data types, the problem is caused by the operator performance. The solution is as follows:
If you find an operator with poor performance, you are advised to contact MindSpore community for feedback. We will optimize the operator in time after confirming that the problem is caused by poor performance.
Generally, MindSpore in static graph mode is much faster than that in PyNative mode. It is recommended that training and inference be performed in static graph mode.
You only need to set dataset_sink_mode=True
in model.train
. Note that this configuration is True
by default. When this configuration is enabled, one epoch returns the result of only one network. You are advised to change the value to False
during debugging.
The mixed precision training method accelerates the deep neural network training process by mixing the single-precision floating-point data format and the half-precision floating-point data format without compromising the network accuracy. Mixed precision training can accelerate the computing process, reduce memory usage and retrieval, and enable a larger model or batch size to be trained on specific hardware.
For details, see Mixed Precision Tutorial.
Graph kernel fusion is a unique network performance optimization technology of MindSpore. It can automatically analyze and optimize the logic of existing network computational graphs, simplify and replace computational graphs, split and fuse operators, and build operators in a special way based on the target hardware capability to improve the computing resource utilization of devices and optimize the overall network performance. Compared with traditional optimization technologies, the graph kernel fusion technology has unique advantages, such as joint optimization of multiple operators across boundaries, cross-layer collaboration with operator compilation, and real-time compilation of operators based on Polyhedral. In addition, the entire optimization process of graph kernel fusion can be automatically completed after users enable the corresponding configuration. Network developers do not need to perform extra perception, so that users can focus on network algorithm implementation.
Graph kernel fusion applies to scenarios that have high requirements on network execution time. Basic operators are combined to implement customized combination operators and these basic operators are automatically fused to improve the performance of the customized combination operators.
If there are too many conversion operators (TransData and Cast operators) and the conversion takes a long time, analyze the necessity of the manually added Cast operator. If the accuracy is not affected, delete the redundant Cast and TransData operators.
If there are too many conversion operators automatically generated by MindSpore, the MindSpore framework may not be fully optimized for some special cases. In this case, contact MindSpore community for feedback.
In dynamic shape scenario, continuous graph build is required, which may cause a long end-to-end training time. You are advised to avoid dynamic shape.
During distributed training, after forward propagation and gradient calculation are complete in a step training process, each machine starts to perform AllReduce gradient synchronization. The AllReduce synchronization time is mainly affected by the number of weights and machines. For a more complex network with a larger machine scale, the AllReduce gradient update time is longer. In this case, you can perform AllReduce segmentation to reduce the time consumption.
In normal cases, AllReduce gradient synchronization waits until all backward operators are executed. That is, after the gradient of all gradients is calculated, the gradients of all machines are synchronized at a time. After AllReduce segmentation is used, the gradients of some weights can be calculated, gradient synchronization of this part of weights is immediately performed. In this way, gradient synchronization and gradient calculation of remaining operators can be performed concurrently, and this part of AllReduce gradient synchronization time is hidden. The shard strategy is usually manually tried to find an optimal solution (more than two shards are supported). The ResNet-50 is used as an example. The network has 160 weights. [85, 160] indicates that gradient synchronization is performed immediately after the gradients of weights 0 to 85 are calculated, and gradient synchronization is performed after the gradients of weights 86 to 160 are calculated. The network is divided into two shards. Therefore, gradient synchronization needs to be performed twice. The sample code is as follows:
import os
import mindspore as ms
from mindspore.communication import init
device_id = int(os.getenv('DEVICE_ID', '0'))
rank_size = int(os.getenv('RANK_SIZE', '1'))
rank_id = int(os.getenv('RANK_ID', '0'))
# init context
ms.set_context(mode=ms.GRAPH_MODE, device_target='Ascend', device_id=device_id)
if rank_size > 1:
ms.set_auto_parallel_context(device_num=rank_size, parallel_mode=ms.ParallelMode.DATA_PARALLEL,
gradients_mean=True)
ms.set_auto_parallel_context(all_reduce_fusion_config=[85, 160])
init()
For details, see Cluster Performance Profiling.
The performance jitter of a single step and the empty data queue for a period of time are caused by the poor performance of the data preprocessing part. As a result, the data processing speed cannot keep up with the iteration speed of a single step. The two symptoms usually occur in pairs.
When the data processing speed is slow, the empty queue is gradually consumed from the beginning when the queue is full. The training process starts to wait for the empty queue to fill in data. Once new data is filled in, the network continues single-step training. Because no queue is used as the buffer for data processing, the performance jitter of data processing is directly reflected by the performance of a single step. Therefore, the performance jitter of a single step is also caused.
For details about data performance problems, see Data Preparation Performance Analysis of MindInsight. This describes common data performance problems and solutions.
For more performance debugging methods, see Performance Tuning.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。