Minimizing inference costs presents a significant challenge as generative AI models continue to grow in complexity and size. The NVIDIA TensorRT Model Optimizer (referred to as Model Optimizer, or ModelOpt) is a library comprising state-of-the-art model optimization techniques including quantization and sparsity to compress models. It accepts a torch or ONNX model as inputs and provides Python APIs for users to easily stack different model optimization techniques to produce an optimized quantized checkpoint. Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like TensorRT-LLM or TensorRT. Further integrations are planned for NVIDIA NeMo and Megatron-LM for training-in-the-loop optimization techniques. For enterprise users, the 8-bit quantization with Stable Diffusion is also available on NVIDIA NIM.
Model Optimizer is available for free for all developers on NVIDIA PyPI. This repository is for sharing examples and GPU-optimized recipes as well as collecting feedback from the community.
pip install "nvidia-modelopt[all]~=0.15.0" --extra-index-url https://pypi.nvidia.com
See the installation guide for more fine-grained control over the installation.
After installing the NVIDIA Container Toolkit, please run the following commands to build the Model Optimizer example docker container.
# Build the docker
docker/build.sh
# Obtain and start the basic docker image environment.
# The default built docker image is docker.io/library/modelopt_examples:latest
docker run --gpus all -it --shm-size 20g --rm docker.io/library/modelopt_examples:latest bash
# Check installation
python -c "import modelopt"
Alternatively for PyTorch, you can also use NVIDIA NGC PyTorch container with Model Optimizer pre-installed starting from 24.06 container. Make sure to update the Model Optimizer version to the latest one if not already.
Quantization is an effective model optimization technique for large models. Quantization with Model Optimizer can compress model size by 2x-4x, speeding up inference while preserving model quality. Model Optimizer enables highly performant quantization formats including FP8, INT8, INT4, etc and supports advanced algorithms such as SmoothQuant, AWQ, and Double Quantization with easy-to-use Python APIs. Both Post-training quantization (PTQ) and Quantization-aware training (QAT) are supported.
Sparsity is a technique to further reduce the memory footprint of deep learning models and accelerate the inference. Model Optimizer provides Python API mts.sparsity()
to apply weight sparsity to a given model. mts.sparsity()
supports NVIDIA 2:4 sparsity pattern and various sparsification methods, such as NVIDIA ASP and SparseGPT.
Please find the benchmarks here.
Please see Model Optimizer Changelog here.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。