1 Star 0 Fork 0

limit/TensorRT-Model-Optimizer

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
MIT

NVIDIA TensorRT Model Optimizer

A Library to Quantize and Compress Deep Learning Models for Optimized Inference on GPUs

Documentation version license

Examples | Benchmark Results | Documentation

Latest News

Table of Contents

Model Optimizer Overview

Minimizing inference costs presents a significant challenge as generative AI models continue to grow in complexity and size. The NVIDIA TensorRT Model Optimizer (referred to as Model Optimizer, or ModelOpt) is a library comprising state-of-the-art model optimization techniques including quantization and sparsity to compress models. It accepts a torch or ONNX model as inputs and provides Python APIs for users to easily stack different model optimization techniques to produce an optimized quantized checkpoint. Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like TensorRT-LLM or TensorRT. Further integrations are planned for NVIDIA NeMo and Megatron-LM for training-in-the-loop optimization techniques. For enterprise users, the 8-bit quantization with Stable Diffusion is also available on NVIDIA NIM.

Model Optimizer is available for free for all developers on NVIDIA PyPI. This repository is for sharing examples and GPU-optimized recipes as well as collecting feedback from the community.

Installation

PIP

pip install "nvidia-modelopt[all]~=0.15.0" --extra-index-url https://pypi.nvidia.com

See the installation guide for more fine-grained control over the installation.

Docker

After installing the NVIDIA Container Toolkit, please run the following commands to build the Model Optimizer example docker container.

# Build the docker
docker/build.sh

# Obtain and start the basic docker image environment.
# The default built docker image is docker.io/library/modelopt_examples:latest
docker run --gpus all -it --shm-size 20g --rm docker.io/library/modelopt_examples:latest bash

# Check installation
python -c "import modelopt"

Alternatively for PyTorch, you can also use NVIDIA NGC PyTorch container with Model Optimizer pre-installed starting from 24.06 container. Make sure to update the Model Optimizer version to the latest one if not already.

Techniques

Quantization

Quantization is an effective model optimization technique for large models. Quantization with Model Optimizer can compress model size by 2x-4x, speeding up inference while preserving model quality. Model Optimizer enables highly performant quantization formats including FP8, INT8, INT4, etc and supports advanced algorithms such as SmoothQuant, AWQ, and Double Quantization with easy-to-use Python APIs. Both Post-training quantization (PTQ) and Quantization-aware training (QAT) are supported.

Sparsity

Sparsity is a technique to further reduce the memory footprint of deep learning models and accelerate the inference. Model Optimizer provides Python API mts.sparsity() to apply weight sparsity to a given model. mts.sparsity() supports NVIDIA 2:4 sparsity pattern and various sparsification methods, such as NVIDIA ASP and SparseGPT.

Examples

  • PTQ for LLMs covers how to use Post-training quantization (PTQ) and export to TensorRT-LLM for deployment for popular pre-trained models from frameworks like
  • PTQ for Diffusers walks through how to quantize a diffusion model with FP8 or INT8, export to ONNX, and deploy with TensorRT. The Diffusers example in this repo is complementary to the demoDiffusion example in TensorRT repo and includes FP8 plugins as well as the latest updates on INT8 quantization.
  • QAT for LLMs demonstrates the recipe and workflow for Quantization-aware Training (QAT), which can further preserve model accuracy at low precisions (e.g., INT4, or 4-bit in NVIDIA Blackwell platform).
  • Sparsity for LLMs shows how to perform Post-training Sparsification and Sparsity-aware fine-tuning on a pre-trained Hugging Face model.
  • ONNX PTQ shows how to quantize the ONNX models in INT4 or INT8 quantization mode. The examples also include the deployment of quantized ONNX models using TensorRT.

Support Matrix

Benchmark

Please find the benchmarks here.

Release Notes

Please see Model Optimizer Changelog here.

SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. SPDX-License-Identifier: MIT Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

简介

暂无描述 展开 收起
MIT
取消

发行版

暂无发行版

贡献者

全部

近期动态

不能加载更多了
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
1
https://gitee.com/limitmhw/TensorRT-Model-Optimizer.git
git@gitee.com:limitmhw/TensorRT-Model-Optimizer.git
limitmhw
TensorRT-Model-Optimizer
TensorRT-Model-Optimizer
main

搜索帮助