Constructing MindSpore Network
This chapter will introduce the related contents of MindSpore scripting,
including datasets, network models and loss functions, optimizers,
training processes, inference processes from the basic modules needed
for training and inference. It will include some functional techniques
commonly used in network migration, such as network writing
specifications, training and inference process templates, and dynamic
shape mitigation strategies.
Network Training Principle
The basic principle of network training is shown in the figure above.
The training process of the whole network consists of 5 modules:
- dataset: for obtaining data, containing input of network and labels.
MindSpore provides a basic common dataset processing
interface,
and also supports constructing datasets by using python iterators.
- network: network model implementation, typically encapsulated by
using Cell. Declare the required modules and operators in init, and
implement graph construction in construct.
- loss: loss function. Used to measure the degree of difference between
the predicted value and the true value. In deep learning, model
training is the process of shrinking the loss function value by
iterating continuously. Defining a good loss function can help the
loss function value converge faster to achieve better precision.
MindSpore provides many common loss
functions,
but of course you can define and implement your own loss function.
- Automatic gradient derivation: Generally, network and loss are
encapsulated together as a forward network and the forward network is
given to the automatic gradient derivation module for gradient
calculation. MindSpore provides an automatic gradient derivation
interface, which shields the user from a large number of derivation
details and procedures and greatly reduces the threshold of
framework. When you need to customize the gradient, MindSpore also
provides
interface
to freely implement the gradient calculation.
- Optimizer: used to calculate and update network parameters during
model training. MindSpore provides a number of general-purpose
optimizers
for users to choose, and also supports users to customize the
optimizers.
Principles of Network Inference
The basic principles of network inference are shown in the figure above.
The training process of the whole network consists of 3 modules:
- dataset: used to obtain data, including the input of the network and
labels. Since entire inference dataset needs to be inferred during
inference process, batchsize is recommended to set to 1. If batchsize
is not 1, note that when adding batch, add drop_remainder=False. In
addition the inference process is a fixed process. Loading the same
parameters every time has the same inference results, and the
inference process should not have random data augmentation.
- network: network model implementation, generally encapsulated by
using Cell. The network structure during inference is generally the
same as the network structure during training. It should be noted
that Cell is tagged with set_train(False) for inference and
set_train(True) for training, just like PyTorch model.eval() (model
evaluation mode) and model.train() (model training mode).
- metrics: When the training task is over, evaluation metrics (Metrics)
and evaluation functions are used to assess whether the model works
well. Commonly used evaluation metrics include Confusion Matrix,
Accuracy, Precision, and Recall. The mindspore.nn module provides the
common evaluation
functions,
and users can also define their own evaluation metrics as needed.
Customized Metrics functions need to inherit nn.Metric parent class
and reimplement the clear method, update method and eval method of
the parent class.
Constructing Network
.. toctree::
:maxdepth: 1
dataset
model_and_loss
learning_rate_and_optimizer
training_and_gradient
training_and_evaluation_procession
Note
When doing network migration, we recommend doing inference validation
of the model as a priority after completing the network scripting.
This has several benefits:
- Compared with training, the inference process is fixed and able to
be compared with the reference implementation.
- Compared with training, the time required for inference is
relatively short, enabling rapid verification of the correctness
of the network structure and inference process.
- The trained results need to be validated through the inference
process to verify results of the model. It is necessary that the
correctness of the inference be ensured first, then to prove that
the training is valid.
Considerations for MindSpore Network Authoring
During MindSpore network implementation, there are some problem-prone
areas. When you encounter problems, please prioritize troubleshooting
for the following situations:
- The MindSpore operator is used in data processing.
Multi-threaded/multi-process is usually in the data processing
process, so there is a limitation of using MindSpore operators in
this scenario. It is recommended to use a three-party implemented
operation as an alternative in the data processing process, such as
numpy, opencv, pandas, PIL.
- Control flow. For details, refer to Flow Control
Statements.
Compilation in graph mode can be slow when multiple layers of
conditional control statements are called.
- Slicing operation. When it comes to slicing a Tensor, note that
whether subscript of the slice is a variable. When it is a variable,
there will be restrictions. Please refer to network body and loss
building
for dynamic shape mitigation.
- Customized mixed precision conflicts with
amp_level
in Model, so
don’t set amp_level
in Model if you use customized mixed
precision.
- In Ascend environment, Conv, Sort and TopK can only be float16, and
add loss
scale
to avoid overflow.
- In the Ascend environment, operators with the stride property such as
Conv and Pooling have rules about the length of the stride, which
needs to be mitigated.
- In a distributed environment, seed must be added to ensure that the
initialized parameters of multiple cards are consistent.
- In the case of using list of Cell or list of Parameter in the
network, please convert the list to
CellList,
SequentialCell,
and
ParameterTuple
in
init
.
# Define the required layers for graph construction in init, and don't write it like this
self.layer = [nn.Conv2d(1, 3), nn.BatchNorm(3), nn.ReLU()]
# Need to encapsulate as CellList or SequentialCell
self.layer = nn.CellList([nn.Conv2d(1, 3), nn.BatchNorm(3), nn.ReLU()])
# Or
self.layer = nn.SequentialCell([nn.Conv2d(1, 3), nn.BatchNorm(3), nn.ReLU()])