1 Star 0 Fork 0

sunlei198911/TensorRT-LLM

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
文件
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README

OpenAI Triton Plugin in TensorRT-LLM

This document describes how to build and run a custom plugin leveraging OpenAI Triton in TensorRT-LLM. The workflow can be summarized as follows.

  1. Implement a kernel using Triton in Python.
  2. Compile that kernel using Triton AoT (Ahead-of-Time) compilation tool to generate C files.
  3. Implement a custom TensorRT-LLM plugin to execute the compiled kernel.
  4. Build the TensorRT engine.
  5. It is ready to be executed by TensorRT.

In this example, we show how to create a TensorRT-LLM plugin to wrap a Fused Attention kernel implemented in OpenAI Triton. As a prerequisite, it is necessary to have the TensorRT-LLM C++ runtime library. The instructions to build that library can be found here.

1. Triton AoT Preparation

OpenAI Triton offers an Ahead-of-Time (AoT) compilation tool to generate C files that wrap compiled GPU kernel. To use the AoT feature, you need a Triton version posterior to the d0c35b3 commit and this example has been tested on the b43c28f commit.

git clone https://github.com/openai/triton
cd triton/python/
pip install cmake && pip install -e .
cd -

For AoT compilation, it is necessary to provide a kernel signature and specify the values of tl.constexpr parameters in a comma-separated format. Details can be found in the compile.py file in the Triton project.

Here are examples of kernel AOT compilations for the Fused Attention kernel.

# Kernel for data type=float16, BLOCK_M=128, BLOCK_DMODEL=64, BLOCK_N=128
mkdir -p aot/fp16
python triton/python/triton/tools/compile.py \
    fmha_triton.py \
    -n fused_attention_kernel \
    -o aot/fp16/fmha_kernel_d64_fp16 \
    --out-name fmha_d64_fp16 \
    -w 4 \
    -s "*fp16:16, *fp32:16, *fp32:16, *fp16:16, *fp16:16, *fp16:16, fp32, i32, i32, i32, 128, 64, 128" \
    -g "(seq_len + 127) / 128, batch_size * num_heads, 1"
# Kernel for data type=float32, BLOCK_M=64, BLOCK_DMODEL=64, BLOCK_N=64
mkdir -p aot/fp32
python triton/python/triton/tools/compile.py \
    fmha_triton.py \
    -n fused_attention_kernel \
    -o aot/fp32/fmha_kernel_d64_fp32 \
    --out-name fmha_d64_fp32 \
    -w 4 \
    -s "*fp32:16, *fp32:16, *fp32:16, *fp32:16, *fp32:16, *fp32:16, fp32, i32, i32, i32, 64, 64, 64" \
    -g "(seq_len + 63) / 64, batch_size * num_heads, 1"

# Link generated headers and create dispatchers.
python triton/python/triton/tools/link.py aot/fp16/*.h -o aot/fmha_kernel_fp16
python triton/python/triton/tools/link.py aot/fp32/*.h -o aot/fmha_kernel_fp32

The tool will generate .c and .h files to launch the GPU kernel. Note that it is necessary to specify the kernel name using the --out-name option, it allows to define dispatcher names for the different data types. The above invocations will generate aot/fmha_kernel_{fp16|fp32}.{c|h} files that contain three functions:

  • the load_fmha_d64_{fp16|fp32} function to load the code of the GPU kernel,
  • the fmha_d64_{fp16|fp32} function to launch the kernel,
  • the unload_fmha_d64_{fp16|fp32} function to unload the GPU kernel.

2. Implement a Custom TensorRT Plugin

This section describes how to implement a custom plugin for TensorRT-LLM to execute the Triton kernel created in the previous section. We provide an example of plugin implementation.

  • TritonFlashAttentionPlugin(.cpp, .h): TensorRT plugin.
  • plugin.py: Python wrapper.

TritonFlashAttentionPlugin is a TensorRT plugin that integrates a Triton kernel generated with the AoT compiler. The initialize and terminate functions show how to initialize and terminate the TensorRT plugin. The enqueue member function shows how to call the generated Triton kernel on the GPU. Note that the name of the Triton kernel depends on the function's signature, meaning that different types or specialization leads a different kernel name. Thus, if you change the -s <signature> option during AoT compilation, you also have to update file names in CMakeLists.txt in order to match the names generated by the AoT compiler.

To build a shared library for the custom Triton plugin, run:

mkdir -p build && cd build
cmake .. && make
cd ..

As mentioned in the previouse section, it is necessary to have the TensorRT-LLM C++ runtime library. If you want to specify the library paths, run:

cmake -DTRT_LIB_DIR=</path/to/trt_lib> -DTRT_INCLUDE_DIR=</path/to/trt_headers> -DTRT_LLM_LIB_DIR=</path/to/trt_llm_lib> -DTRT_LLM_INCLUDE_DIR=</path/to/trt_llm_headers> ..

If the build is successful, you should be able to find a shared library for the custom plugin at build/libtrt_llm_custom_plugins.so.

A Python wrapper of the Fused Multihead Attention (FMHA) operator and the corresponding TensorRT-LLM layer are implemented in plugin.py. It is similar to other TensorRT-LLM operators and layers implemented in functional.py and layers, respectively. That FMHA operator uses the custom plugin that wraps the functions generated from the Triton kernel.

3. Build and Run the TensorRT Engine

We are now ready to build and run the TensorRT engine that uses the Triton kernel. Here are the two commands to build and run the engine:

python build.py --num_heads 32 --head_size 64 --max_batch_size 8 --max_seq_len 512 --dtype float16
python run.py --num_heads 32 --head_size 64 --batch_size 8 --seq_len 512 --log_level verbose --benchmark

4. Known Issues

1. A generated dispatcher might not execute a kernel without raising an error due to a missing branch.

The kernel dispatcher written by link.py has a missing branch, which can result in returning without executing a kernel. For instance, in our example, the generated dispatcher looks like this:

CUresult fmha_d64_fp16(CUstream stream, unsigned int gX, unsigned int gY, unsigned int gZ, CUdeviceptr Out, CUdeviceptr L, CUdeviceptr M, CUdeviceptr Q, CUdeviceptr K, CUdeviceptr V, float sm_scale, int32_t seq_len){
  if ((Out % 16 == 0) && (L % 16 == 0) && (M % 16 == 0) && (Q % 16 == 0) && (K % 16 == 0) && (V % 16 == 0))
    return fmha_d64_fp16_0eb6b090_0d1d2d3d4d5d67(stream, gX, gY, gZ, Out, L, M, Q, K, V, sm_scale, seq_len);
}

It is recommended to manually update the generated functions by link.py to return a proper error for proper error handling.

2. The shared memory required by a generated kernel may exceed the hardware limitation.

The AoT compiler does not verify the limitations of shared memory size during compilation time, which could potentially lead to the out-of-resource errors during runtime. It would be helpful to verify if the requirment of the dynamic shared meory size in a generated kernel exceeds the hardware limitation. You can find the number at the line of cuLaunchKernel call in the generated .c file. For instnace, the shared memory size is 114690 btyes in our example.

CUresult fmha_d64_fp16_0eb6b090_0d1d2d3d4d5d67(CUstream stream, unsigned int gX, unsigned int gY, unsigned int gZ, CUdeviceptr Out, CUdeviceptr L, CUdeviceptr M, CUdeviceptr Q, CUdeviceptr K, CUdeviceptr V, float sm_scale, int32_t seq_len) {
    if (fmha_d64_fp16_0eb6b090_0d1d2d3d4d5d67_func == NULL)
       load_fmha_d64_fp16_0eb6b090_0d1d2d3d4d5d67();
    void *args[8] = { &Out, &L, &M, &Q, &K, &V, &sm_scale, &seq_len };
    // TODO: shared memory
    if(gX * gY * gZ > 0)
      return cuLaunchKernel(fmha_d64_fp16_0eb6b090_0d1d2d3d4d5d67_func, gX, gY, gZ, 4 * 32, 1, 1, 114690, stream, args, NULL);
}

It may be resolved by reduing the block size.

马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
Python
1
https://gitee.com/sunlei198911/TensorRT-LLM.git
git@gitee.com:sunlei198911/TensorRT-LLM.git
sunlei198911
TensorRT-LLM
TensorRT-LLM
dependabot/pip/onnx-1.13.0

搜索帮助