Post-training quantization is a model-size reducing technique useful for deploying model on the web and in storage-limited environments such as mobile devices. TensorFlow.js's converter module supports reducing the numeric precision of weights to 16-bit and 8-bit integers after the completion of the model training, which leads to approximately 50% and 75% reduction in model size, respectively.
The following figure provides an intuitive understanding of the degree to which weight values are discretized under the 16- and 8-bit quantization regimes. The figure is based on a zoomed-in view of a sinusoidal wave.
This example focuses on how such quantization of weights affect the model's predicton accuracy.
This demo on quantization consists of four examples:
In the first three demos, quantizing the weights to 16 or 8 bits does not have any significant effect on the accuracy. In the MobileNetV2 demo, however, quantizing the weights to 8 bits leads to a significant deterioration in accuracy, as measured by the top-1 and top-5 accuracies. See example results in the table below:
Dataset and Model | Original (no-quantization) | 16-bit quantization | 8-bit quantization |
---|---|---|---|
housing: multi-layer regressor | MAE=0.311984 | MAE=0.311983 | MAE=0.312780 |
MNIST: convnet | accuracy=0.9952 | accuracy=0.9952 | accuracy=0.9952 |
Fashion MNIST: convnet | accuracy=0.922 | accuracy=0.922 | accuracy=0.9211 |
MobileNetV2 | top-1 accuracy=0.618; top-5 accuracy=0.788 | top-1 accuracy=0.624; top-5 accuracy=0.789 | top-1 accuracy=0.280; top-5 accuracy=0.490 |
MAE Stands for mean absolute error (lower is better).
They demonstrate different effects of the same quantization technique on different problems.
An additional factor affecting the over-the-wire size of models under quantization is the gzip ratio. This factor should be taken into account because gzip is widely used to transmit large files over the web.
Most non-quantized models (i.e.,
models with 32-bit float weights) are not very compressible, due to
the noise-like variation in their weight parameters, which contain
few repeating patterns. The same is true for models with weights
quantized at the 16-bit precision. However, when models are quantized
at the 8-bit precision, there is usually a significant increase in the
gzip compression ratio. The yarn quantize-and-evalute*
commands in
this example (see sections below) not only evaluates accuracy, but also
calculates the gzip compression ratio of model files under different
levels of quantization. The table below summarizes the compression ratios
from the four models covered by this example (higher is better):
gzip compression ratio:
(total size of the model.json and weight files) / (size of gzipped tar ball)
Model | Original (no-quantization) | 16-bit quantization | 8-bit quantization |
---|---|---|---|
housing: multi-layer regressor | 1.121 | 1.161 | 1.388 |
MNIST: convnet | 1.082 | 1.037 | 1.184 |
Fashion MNIST: convnet | 1.078 | 1.048 | 1.229 |
MobileNetV2 | 1.085 | 1.063 | 1.271 |
In preparation, do:
yarn
To run the train and save the model from scratch, do:
yarn train-housing
If you are running on a Linux system that is CUDA compatible, try installing the GPU:
yarn train-housing --gpu
To perform quantization on the model saved in the yarn train
step
and evaluate the effects on the model's test accuracy, do:
yarn quantize-and-evaluate-housing
In preparation, do:
yarn
To run the train and save the model from scratch, do:
yarn train-mnist
or with CUDA acceleration:
yarn train-mnist --gpu
To perform quantization on the model saved in the yarn train
step
and evaluate the effects on the model's test accuracy, do:
yarn quantize-and-evaluate-mnist
The command also calculates the ratio of gzip compression for the model's saved artifacts under the three different levels of quantization (no-quantization, 16-bit, and 8-bit).
In preparation, do:
yarn
To run the train and save the model from scratch, do:
yarn train-fashion-mnist
or with CUDA acceleration:
yarn train-fashion-mnist --gpu
To perform quantization on the model saved in the yarn train
step
and evaluate the effects on the model's test accuracy, do:
yarn quantize-and-evaluate-fashion-mnist
Unlike the previous three demos, the MobileNetV2 demo doesn't involve a model training step. Instead, the model is loaded as a Keras application and converted to the TensorFlow.js format for quantization and evaluation.
The non-quantized and quantized versions of MobileNetV2 are evaluated on a sample of 1000 images from the ImageNet dataset. The image files are downloaded from the hosted location on the web. This subset is based on the sampling done by https://github.com/ajschumacher/imagen.
All these steps can be performed with a single command:
yarn quantize-and-evaluate-MobileNetV2
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。