Deep learning technologies are used on an increasing number of applications on mobile or edge devices. Take mobile phones as an example. To provide user-friendly and intelligent services, the deep learning function is integrated into operating systems and applications. However, this function involves training or inference, containing a large number of models and weight files. The original weight file of AlexNet has exceeded 200 MB, and the new model is developing towards a more complex structure with more parameters. Due to limited hardware resources of a mobile or edge device, a model needs to be simplified and the quantization technology is used to solve this problem.
Quantization is a process of approximating (usually INT8) a fixed point of a floating-point model weight of a continuous value (or a large quantity of possible discrete values) or tensor data that flows through a model to a limited quantity (or a relatively small quantity) of discrete values at a relatively low inference accuracy loss. It is a process of approximately representing 32-bit floating-point data with fewer bits, while the input and output of the model are still floating-point data. In this way, the model size and memory usage can be reduced, the model inference speed can be accelerated, and the power consumption can be reduced.
As described above, compared with the FP32 type, low-accuracy data representation types such as FP16, INT8, and INT4 occupy less space. Replacing the high-accuracy data representation type with the low-accuracy data representation type can greatly reduce the storage space and transmission time. Low-bit computing has higher performance. Compared with FP32, INT8 has a three-fold or even higher acceleration ratio. For the same computing, INT8 has obvious advantages in power consumption.
Currently, there are two types of quantization solutions in the industry: quantization aware training and post-training quantization training.
A fake quantization node is a node inserted during quantization aware training, and is used to search for network data distribution and feed back a lost accuracy. The specific functions are as follows:
MindSpore quantization aware training is to replace high-accuracy data with low-accuracy data to simplify the model training process. In this process, the accuracy loss is inevitable. Therefore, a fake quantization node is used to simulate the accuracy loss, and backward propagation learning is used to reduce the accuracy loss. MindSpore adopts the solution in reference [1] for the quantization of weights and data.
Aware quantization training specifications
Specification | Description |
---|---|
Hardware | Supports hardware platforms based on the GPU or Ascend AI 910 processor. |
Network | Supports networks such as LeNet and ResNet50. For details, see https://gitee.com/mindspore/mindspore/tree/r0.6/model_zoo. |
Algorithm | Supports symmetric and asymmetric quantization algorithms in MindSpore fake quantization training. |
Solution | Supports 4-, 7-, and 8-bit quantization solutions. |
The procedure for the quantization aware training model is the same as that for the common training. After the network is defined and the model is generated, additional operations need to be performed. The complete process is as follows:
Compared with common training, the quantization aware training requires additional steps which are steps 3, 6, and 7 in the preceding process.
- Fusion network: network after the specified operators (
nn.Conv2dBnAct
andnn.DenseBnAct
) are used for replacement.- Fusion model: model in the checkpoint format generated by the fusion network training.
- Quantization network: network obtained after the fusion model uses a conversion API (
convert_quant_network
) to insert a fake quantization node.- Quantization model: model in the checkpoint format obtained after the quantization network training.
Next, the LeNet network is used as an example to describe steps 3 and 6.
You can obtain the complete executable sample code at https://gitee.com/mindspore/mindspore/tree/r0.6/model_zoo/lenet_quant.
Define a fusion network and replace the specified operators.
nn.Conv2dBnAct
operator to replace the three operators nn.Conv2d
, nn.batchnorm
, and nn.relu
in the original network model.nn.DenseBnAct
operator to replace the three operators nn.Dense
, nn.batchnorm
, and nn.relu
in the original network model.Even if the
nn.Dense
andnn.Conv2d
operators are not followed bynn.batchnorm
andnn.relu
, the preceding two replacement operations must be performed as required.
The definition of the original network model is as follows:
class LeNet5(nn.Cell):
def __init__(self, num_class=10):
super(LeNet5, self).__init__()
self.num_class = num_class
self.conv1 = nn.Conv2d(1, 6, kernel_size=5)
self.bn1 = nn.batchnorm(6)
self.act1 = nn.relu()
self.conv2 = nn.Conv2d(6, 16, kernel_size=5)
self.bn2 = nn.batchnorm(16)
self.act2 = nn.relu()
self.fc1 = nn.Dense(16 * 5 * 5, 120)
self.fc2 = nn.Dense(120, 84)
self.act3 = nn.relu()
self.fc3 = nn.Dense(84, self.num_class)
self.max_pool2d = nn.MaxPool2d(kernel_size=2, stride=2)
def construct(self, x):
x = self.conv1(x)
x = self.bn1(x)
x = self.act1(x)
x = self.max_pool2d(x)
x = self.conv2(x)
x = self.bn2(x)
x = self.act2(x)
x = self.max_pool2d(x)
x = self.flattern(x)
x = self.fc1(x)
x = self.act3(x)
x = self.fc2(x)
x = self.act3(x)
x = self.fc3(x)
return x
The following shows the fusion network after operators are replaced:
class LeNet5(nn.Cell):
def __init__(self, num_class=10):
super(LeNet5, self).__init__()
self.num_class = num_class
self.conv1 = nn.Conv2dBnAct(1, 6, kernel_size=5, batchnorm=True, activation='relu')
self.conv2 = nn.Conv2dBnAct(6, 16, kernel_size=5, batchnorm=True, activation='relu')
self.fc1 = nn.DenseBnAct(16 * 5 * 5, 120, activation='relu')
self.fc2 = nn.DenseBnAct(120, 84, activation='relu')
self.fc3 = nn.DenseBnAct(84, self.num_class)
self.max_pool2d = nn.MaxPool2d(kernel_size=2, stride=2)
def construct(self, x):
x = self.conv1(x)
x = self.max_pool2d(x)
x = self.conv2(x)
x = self.max_pool2d(x)
x = self.flattern(x)
x = self.fc1(x)
x = self.fc2(x)
x = self.fc3(x)
return x
Use the convert_quant_network
API to automatically insert a fake quantization node into the fusion model to convert the fusion model into a quantization network.
from mindspore.train.quant import quant as qat
net = qat.convert_quant_network(net, quant_delay=0, bn_fold=False, freeze_bn=10000, weight_bits=8, act_bits=8)
The preceding describes the quantization aware training from scratch. A more common case is that an existing model file needs to be converted to a quantization model. The model file and training script obtained through common network model training are available for quantization aware training. To use a checkpoint file for retraining, perform the following steps:
1. Process data and load datasets.
2. Define a network.
3. Define a fusion network.
4. Define an optimizer and loss function.
5. Load a model file and retrain the model. Load an existing model file and retrain the model based on the fusion network to generate a fusion model. For details, see <https://www.mindspore.cn/tutorial/en/r0.6/use/saving_and_loading_model_parameters.html#id6>.
6. Generate a quantization network.
7. Perform quantization training.
The inference using a quantization model is the same as common model inference. The inference can be performed by directly using the checkpoint file or converting the checkpoint file into a common model format (such as ONNX or GEIR).
For details, see https://www.mindspore.cn/tutorial/en/r0.6/use/multi_platform_inference.html.
To use a checkpoint file obtained after quantization aware training for inference, perform the following steps:
Convert the checkpoint file into a common model format such as ONNX for inference. (This function is coming soon.)
[1] Jacob B, Kligys S, Chen B, et al. Quantization and training of neural networks for efficient integer-arithmetic-only inference[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 2704-2713.
[2] Krishnamoorthi R. Quantizing deep convolutional networks for efficient inference: A whitepaper[J]. arXiv preprint arXiv:1806.08342, 2018.
[3] Jacob B, Kligys S, Chen B, et al. Quantization and training of neural networks for efficient integer-arithmetic-only inference[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 2704-2713.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。