1 Star 0 Fork 0

凤凰于飞 / TensorRT-LLM

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
Apache-2.0

TensorRT-LLM

A TensorRT Toolbox for Optimized Large Language Model Inference

Documentation python cuda trt version license

Architecture   |   Results   |   Examples   |   Documentation


Latest News

Table of Contents

TensorRT-LLM Overview

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines. It also includes a backend for integration with the NVIDIA Triton Inference Server; a production-quality system to serve LLMs. Models built with TensorRT-LLM can be executed on a wide range of configurations going from a single GPU to multiple nodes with multiple GPUs (using Tensor Parallelism and/or Pipeline Parallelism).

The Python API of TensorRT-LLM is architectured to look similar to the PyTorch API. It provides users with a functional module containing functions like einsum, softmax, matmul or view. The layers module bundles useful building blocks to assemble LLMs; like an Attention block, a MLP or the entire Transformer layer. Model-specific components, like GPTAttention or BertAttention, can be found in the models module.

TensorRT-LLM comes with several popular models pre-defined. They can easily be modified and extended to fit custom needs. See below for a list of supported models.

To maximize performance and reduce memory footprint, TensorRT-LLM allows the models to be executed using different quantization modes (see examples/gpt for concrete examples). TensorRT-LLM supports INT4 or INT8 weights (and FP16 activations; a.k.a. INT4/INT8 weight-only) as well as a complete implementation of the SmoothQuant technique.

For a more detailed presentation of the software architecture and the key concepts used in TensorRT-LLM, we recommend you to read the following document.

Installation

After installing the NVIDIA Container Toolkit, please run the following commands to install TensorRT-LLM for x86_64 users.

# Obtain and start the basic docker image environment.
docker run --rm --runtime=nvidia --gpus all --entrypoint /bin/bash -it nvidia/cuda:12.1.0-devel-ubuntu22.04

# Install dependencies, TensorRT-LLM requires Python 3.10
apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev

# Install the latest preview version (corresponding to the main branch) of TensorRT-LLM.
# If you want to install the stable version (corresponding to the release branch), please
# remove the `--pre` option.
pip3 install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com

# Check installation
python3 -c "import tensorrt_llm"

For developers who have the best performance requirements, debugging needs, or use the aarch64 architecture, please refer to the instructions for building from source code.

For Windows installation, see Windows.

Quick Start

Please be sure to complete the installation steps before proceeding with the following steps.

To create a TensorRT engine for an existing model, there are 3 steps:

  1. Download pre-trained weights,
  2. Build a fully-optimized engine of the model,
  3. Deploy the engine, in other words, run the fully-optimized model.

The following sections show how to use TensorRT-LLM to run the BLOOM-560m model.

0. In the BLOOM folder

Inside the Docker container, you have to install the requirements:

pip install -r examples/bloom/requirements.txt
git lfs install

1. Download the model weights from HuggingFace

From the BLOOM example folder, you must download the weights of the model.

cd examples/bloom
rm -rf ./bloom/560M
mkdir -p ./bloom/560M && git clone https://huggingface.co/bigscience/bloom-560m ./bloom/560M

2. Build the engine

# Single GPU on BLOOM 560M
python convert_checkpoint.py --model_dir ./bloom/560M/ \
                --dtype float16 \
                --output_dir ./bloom/560M/trt_ckpt/fp16/1-gpu/
# May need to add trtllm-build to PATH, export PATH=/usr/local/bin:$PATH
trtllm-build --checkpoint_dir ./bloom/560M/trt_ckpt/fp16/1-gpu/ \
                --gemm_plugin float16 \
                --output_dir ./bloom/560M/trt_engines/fp16/1-gpu/

See the BLOOM example for more details and options regarding the trtllm-build command.

3. Run

The ../summarize.py script can be used to perform the summarization of articles from the CNN Daily dataset:

python ../summarize.py --test_trt_llm \
                       --hf_model_dir ./bloom/560M/ \
                       --data_type fp16 \
                       --engine_dir ./bloom/560M/trt_engines/fp16/1-gpu/

More details about the script and how to run the BLOOM model can be found in the example folder. Many more models than BLOOM are implemented in TensorRT-LLM. They can be found in the examples directory.

Beyond local execution, you can also use the NVIDIA Triton Inference Server to create a production-ready deployment of your LLM as described in this blog.

Support Matrix

TensorRT-LLM optimizes the performance of a range of well-known models on NVIDIA GPUs. The following sections provide a list of supported GPU architectures as well as important features implemented in TensorRT-LLM.

Devices

TensorRT-LLM supports the following architectures:

It is important to note that TensorRT-LLM is expected to work on all GPUs based on the Volta, Turing, Ampere, Hopper, and Ada Lovelace architectures. Certain limitations may apply.

Precision

Various numerical precisions are supported in TensorRT-LLM. The support for some of those numerical features require specific architectures:

FP32 FP16 BF16 FP8 INT8 INT4
Volta (SM70) Y Y N N Y (1) Y (2)
Turing (SM75) Y Y N N Y (1) Y (2)
Ampere (SM80, SM86) Y Y Y N Y Y (3)
Ada-Lovelace (SM89) Y Y Y Y Y Y
Hopper (SM90) Y Y Y Y Y Y

(1) INT8 SmoothQuant is not supported on SM70 and SM75.
(2) INT4 AWQ and GPTQ are not supported on SM < 80.
(3) INT4 AWQ and GPTQ with FP8 activations require SM >= 89.

In this release of TensorRT-LLM, the support for FP8 and quantized data types (INT8 or INT4) is not implemented for all the models. See the precision document and the examples folder for additional details.

Key Features

TensorRT-LLM contains examples that implement the following features.

  • Multi-head Attention(MHA)
  • Multi-query Attention (MQA)
  • Group-query Attention(GQA)
  • In-flight Batching
  • Paged KV Cache for the Attention
  • Tensor Parallelism
  • Pipeline Parallelism
  • INT4/INT8 Weight-Only Quantization (W4A16 & W8A16)
  • SmoothQuant
  • GPTQ
  • AWQ
  • FP8
  • Greedy-search
  • Beam-search
  • RoPE

In this release of TensorRT-LLM, some of the features are not enabled for all the models listed in the examples folder.

Models

The list of supported models is:

Note: Encoder-Decoder provides general encoder-decoder functionality that supports many encoder-decoder models such as T5 family, BART family, Whisper family, NMT family, etc. We unroll the exact model names in the list above to let users find specific models easier.

The list of supported multi-modal models is:

Note: Multi-modal provides general multi-modal functionality that supports many multi-modal architectures such as BLIP family, LLaVA family, etc. We unroll the exact model names in the list above to let users find specific models easier.

Performance

Please refer to the performance page for performance numbers. That page contains measured numbers for four variants of popular models (GPT-J, LLAMA-7B, LLAMA-70B, Falcon-180B), measured on the H100, L40S and A100 GPU(s).

Advanced Topics

Quantization

This document describes the different quantization methods implemented in TensorRT-LLM and contains a support matrix for the different models.

In-flight Batching

TensorRT-LLM supports in-flight batching of requests (also known as continuous batching or iteration-level batching). It's a technique that aims at reducing wait times in queues, eliminating the need for padding requests and allowing for higher GPU utilization.

Attention

TensorRT-LLM implements several variants of the Attention mechanism that appears in most the Large Language Models. This document summarizes those implementations and how they are optimized in TensorRT-LLM.

Graph Rewriting

TensorRT-LLM uses a declarative approach to define neural networks and contains techniques to optimize the underlying graph. For more details, please refer to doc

Benchmark

TensorRT-LLM provides C++ and Python tools to perform benchmarking. Note, however, that it is recommended to use the C++ version.

Troubleshooting

  • If you encounter accuracy issues in the generated text, you may want to increase the internal precision in the attention layer. For that, pass the --context_fmha_fp32_acc enable to trtllm-build.

  • It's recommended to add options –shm-size=1g –ulimit memlock=-1 to the docker or nvidia-docker run command. Otherwise you may see NCCL errors when running multiple GPU inferences. See https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#errors for details.

  • When building models, memory-related issues such as

[09/23/2023-03:13:00] [TRT] [E] 9: GPTLMHeadModel/layers/0/attention/qkv/PLUGIN_V2_Gemm_0: could not find any supported formats consistent with input/output data types
[09/23/2023-03:13:00] [TRT] [E] 9: [pluginV2Builder.cpp::reportPluginError::24] Error Code 9: Internal Error (GPTLMHeadModel/layers/0/attention/qkv/PLUGIN_V2_Gemm_0: could not find any supported formats consistent with input/output data types)

may happen. One possible solution is to reduce the amount of memory needed by reducing the maximum batch size, input and output lengths. Another option is to enable plugins, for example: --gpt_attention_plugin.

  • MPI + Slurm

TensorRT-LLM is a MPI-aware package that uses mpi4py. If you are running scripts in a Slurm environment, you might encounter interferences:

--------------------------------------------------------------------------
PMI2_Init failed to initialize.  Return code: 14
--------------------------------------------------------------------------
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------

As a rule of thumb, if you are running TensorRT-LLM interactively on a Slurm node, prefix your commands with mpirun -n 1 to run TensorRT-LLM in a dedicated MPI environment, not the one provided by your Slurm allocation.

For example: mpirun -n 1 python3 examples/run.py ...

Release notes

  • TensorRT-LLM requires TensorRT 9.2 and 23.12 containers.

Change Log

Versions 0.8.0

  • Model Support
    • Phi-1.5/2.0
    • Mamba support (see examples/mamba/README.md)
      • The support is limited to beam width = 1 and single-node single-GPU
    • Nougat support (see examples/multimodal/README.md#nougat)
    • Qwen-VL support (see examples/qwenvl/README.md)
    • RoBERTa support, thanks to the contribution from @erenup
    • Skywork model support
    • Add example for multimodal models (BLIP with OPT or T5, LlaVA)
  • Features
    • Chunked context support (see docs/source/gpt_attention.md#chunked-context)
    • LoRA support for C++ runtime (see docs/source/lora.md)
    • Medusa decoding support (see examples/medusa/README.md)
      • The support is limited to Python runtime for Ampere or newer GPUs with fp16 and bf16 accuracy, and the temperature parameter of sampling configuration should be 0
    • StreamingLLM support for LLaMA (see docs/source/gpt_attention.md#streamingllm)
    • Support for batch manager to return logits from context and/or generation phases
      • Include support in the Triton backend
    • Support AWQ and GPTQ for QWEN
    • Support ReduceScatter plugin
    • Support for combining repetition_penalty and presence_penalty #274
    • Support for frequency_penalty #275
    • OOTB functionality support:
      • Baichuan
      • InternLM
      • Qwen
      • BART
    • LLaMA
      • Support enabling INT4-AWQ along with FP8 KV Cache
      • Support BF16 for weight-only plugin
    • Baichuan
      • P-tuning support
      • INT4-AWQ and INT4-GPTQ support
    • Decoder iteration-level profiling improvements
    • Add masked_select and cumsum function for modeling
    • Smooth Quantization support for ChatGLM2-6B / ChatGLM3-6B / ChatGLM2-6B-32K
    • Add Weight-Only Support To Whisper #794, thanks to the contribution from @Eddie-Wang1120
    • Support FP16 fMHA on NVIDIA V100 GPU
  • API
    • Add a set of High-level APIs for end-to-end generation tasks (see examples/high-level-api/README.md)
    • [BREAKING CHANGES] Migrate models to the new build workflow, including LLaMA, Mistral, Mixtral, InternLM, ChatGLM, Falcon, GPT-J, GPT-NeoX, Medusa, MPT, Baichuan and Phi (see docs/source/checkpoint.md)
    • [BREAKING CHANGES] Deprecate LayerNorm and RMSNorm plugins and removed corresponding build parameters
    • [BREAKING CHANGES] Remove optional parameter maxNumSequences for GPT manager
  • Bug fixes
    • Fix the first token being abnormal issue when --gather_all_token_logits is enabled #639
    • Fix LLaMA with LoRA enabled build failure #673
    • Fix InternLM SmoothQuant build failure #705
    • Fix Bloom int8_kv_cache functionality #741
    • Fix crash in gptManagerBenchmark #649
    • Fix Blip2 build error #695
    • Add pickle support for InferenceRequest #701
    • Fix Mixtral-8x7b build failure with custom_all_reduce #825
    • Fix INT8 GEMM shape #935
    • Minor bug fixes
  • Performance
    • [BREAKING CHANGES] Increase default freeGpuMemoryFraction parameter from 0.85 to 0.9 for higher throughput
    • [BREAKING CHANGES] Disable enable_trt_overlap argument for GPT manager by default
    • Performance optimization of beam search kernel
    • Add bfloat16 and paged kv cache support for optimized generation MQA/GQA kernels
    • Custom AllReduce plugins performance optimization
    • Top-P sampling performance optimization
    • LoRA performance optimization
    • Custom allreduce performance optimization by introducing a ping-pong buffer to avoid an extra synchronization cost
    • Integrate XQA kernels for GPT-J (beamWidth=4)
  • Documentation
    • Batch manager arguments documentation updates
    • Add documentation for best practices for tuning the performance of TensorRT-LLM (See docs/source/perf_best_practices.md)
    • Add documentation for Falcon AWQ support (See examples/falcon/README.md)
    • Update to the docs/source/checkpoint.md documentation
    • Update AWQ INT4 weight only quantization documentation for GPT-J
    • Add blog: Speed up inference with SOTA quantization techniques in TRT-LLM
    • Refine TensorRT-LLM backend README structure #133
    • Typo fix #739

For history change log, please see CHANGELOG.md.

Known Issues

  • On windows, running context FMHA plugin with FP16 accumulation on LLaMA, Mistral and Phi models suffers from poor accuracy and the resulting inference output may be garbled. The suggestion to workaround these is to enable FP32 accumulation when building the models, i.e. passing the options --context_fmha disable --context_fmha_fp32_acc enable to trtllm-build command as a work-around, and this should be fixed in the next version

  • The hang reported in issue #149 has not been reproduced by the TensorRT-LLM team. If it is caused by a bug in TensorRT-LLM, that bug may be present in that release

Report Issues

You can use GitHub issues to report issues with TensorRT-LLM.

Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work. To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives. Copyright [yyyy] [name of copyright owner] Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

简介

暂无描述 展开 收起
C++ 等 4 种语言
Apache-2.0
取消

发行版

暂无发行版

贡献者

全部

近期动态

加载更多
不能加载更多了
1
https://gitee.com/tifacloud/TensorRT-LLM.git
git@gitee.com:tifacloud/TensorRT-LLM.git
tifacloud
TensorRT-LLM
TensorRT-LLM
main

搜索帮助