This document describes how to build and run a custom plugin leveraging OpenAI Triton in TensorRT-LLM. The workflow can be summarized as follows.
In this example, we show how to create a TensorRT-LLM plugin to wrap a Fused Attention kernel implemented in OpenAI Triton. As a prerequisite, it is necessary to have the TensorRT-LLM C++ runtime library. The instructions to build that library can be found here.
OpenAI Triton offers an Ahead-of-Time (AoT) compilation tool to generate C files that wrap compiled GPU kernel. To use the AoT feature, you need a Triton version posterior to the d0c35b3 commit and this example has been tested on the b43c28f commit.
git clone https://github.com/openai/triton
cd triton/python/
pip install cmake && pip install -e .
cd -
For AoT compilation, it is necessary to provide a kernel signature and specify the values of tl.constexpr
parameters in a comma-separated format.
Details can be found in the compile.py file in the Triton project.
Here are examples of kernel AOT compilations for the Fused Attention kernel.
# Kernel for data type=float16, BLOCK_M=128, BLOCK_DMODEL=64, BLOCK_N=128
mkdir -p aot/fp16
python triton/python/triton/tools/compile.py \
fmha_triton.py \
-n fused_attention_kernel \
-o aot/fp16/fmha_kernel_d64_fp16 \
--out-name fmha_d64_fp16 \
-w 4 \
-s "*fp16:16, *fp32:16, *fp32:16, *fp16:16, *fp16:16, *fp16:16, fp32, i32, i32, i32, 128, 64, 128" \
-g "(seq_len + 127) / 128, batch_size * num_heads, 1"
# Kernel for data type=float32, BLOCK_M=64, BLOCK_DMODEL=64, BLOCK_N=64
mkdir -p aot/fp32
python triton/python/triton/tools/compile.py \
fmha_triton.py \
-n fused_attention_kernel \
-o aot/fp32/fmha_kernel_d64_fp32 \
--out-name fmha_d64_fp32 \
-w 4 \
-s "*fp32:16, *fp32:16, *fp32:16, *fp32:16, *fp32:16, *fp32:16, fp32, i32, i32, i32, 64, 64, 64" \
-g "(seq_len + 63) / 64, batch_size * num_heads, 1"
# Link generated headers and create dispatchers.
python triton/python/triton/tools/link.py aot/fp16/*.h -o aot/fmha_kernel_fp16
python triton/python/triton/tools/link.py aot/fp32/*.h -o aot/fmha_kernel_fp32
The tool will generate .c and .h files to launch the GPU kernel.
Note that it is necessary to specify the kernel name using the --out-name option, it allows to define dispatcher names for the different data types.
The above invocations will generate aot/fmha_kernel_{fp16|fp32}.{c|h}
files that contain three functions:
load_fmha_d64_{fp16|fp32}
function to load the code of the GPU kernel,fmha_d64_{fp16|fp32}
function to launch the kernel,unload_fmha_d64_{fp16|fp32}
function to unload the GPU kernel.This section describes how to implement a custom plugin for TensorRT-LLM to execute the Triton kernel created in the previous section. We provide an example of plugin implementation.
TritonFlashAttentionPlugin
is a TensorRT plugin that integrates a Triton kernel generated with the AoT compiler.
The initialize
and terminate
functions show how to initialize and terminate the TensorRT plugin.
The enqueue
member function shows how to call the generated Triton kernel on the GPU.
Note that the name of the Triton kernel depends on the function's signature, meaning that different types or specialization leads a different kernel name.
Thus, if you change the -s <signature>
option during AoT compilation, you also have to update file names in CMakeLists.txt in order to match the names generated by the AoT compiler.
To build a shared library for the custom Triton plugin, run:
mkdir -p build && cd build
cmake .. && make
cd ..
As mentioned in the previouse section, it is necessary to have the TensorRT-LLM C++ runtime library. If you want to specify the library paths, run:
cmake -DTRT_LIB_DIR=</path/to/trt_lib> -DTRT_INCLUDE_DIR=</path/to/trt_headers> -DTRT_LLM_LIB_DIR=</path/to/trt_llm_lib> -DTRT_LLM_INCLUDE_DIR=</path/to/trt_llm_headers> ..
If the build is successful, you should be able to find a shared library for the custom plugin at build/libtrt_llm_custom_plugins.so
.
A Python wrapper of the Fused Multihead Attention (FMHA) operator and the corresponding TensorRT-LLM layer are implemented in plugin.py. It is similar to other TensorRT-LLM operators and layers implemented in functional.py and layers, respectively. That FMHA operator uses the custom plugin that wraps the functions generated from the Triton kernel.
We are now ready to build and run the TensorRT engine that uses the Triton kernel. Here are the two commands to build and run the engine:
python build.py --num_heads 32 --head_size 64 --max_batch_size 8 --max_seq_len 512 --dtype float16
python run.py --num_heads 32 --head_size 64 --batch_size 8 --seq_len 512 --log_level verbose --benchmark
The kernel dispatcher written by link.py
has a missing branch, which can result in returning without executing a kernel.
For instance, in our example, the generated dispatcher looks like this:
CUresult fmha_d64_fp16(CUstream stream, unsigned int gX, unsigned int gY, unsigned int gZ, CUdeviceptr Out, CUdeviceptr L, CUdeviceptr M, CUdeviceptr Q, CUdeviceptr K, CUdeviceptr V, float sm_scale, int32_t seq_len){
if ((Out % 16 == 0) && (L % 16 == 0) && (M % 16 == 0) && (Q % 16 == 0) && (K % 16 == 0) && (V % 16 == 0))
return fmha_d64_fp16_0eb6b090_0d1d2d3d4d5d67(stream, gX, gY, gZ, Out, L, M, Q, K, V, sm_scale, seq_len);
}
It is recommended to manually update the generated functions by link.py
to return a proper error for proper error handling.
The AoT compiler does not verify the limitations of shared memory size during compilation time, which could potentially lead to the out-of-resource errors during runtime.
It would be helpful to verify if the requirment of the dynamic shared meory size in a generated kernel exceeds the hardware limitation.
You can find the number at the line of cuLaunchKernel
call in the generated .c
file.
For instnace, the shared memory size is 114690 btyes in our example.
CUresult fmha_d64_fp16_0eb6b090_0d1d2d3d4d5d67(CUstream stream, unsigned int gX, unsigned int gY, unsigned int gZ, CUdeviceptr Out, CUdeviceptr L, CUdeviceptr M, CUdeviceptr Q, CUdeviceptr K, CUdeviceptr V, float sm_scale, int32_t seq_len) {
if (fmha_d64_fp16_0eb6b090_0d1d2d3d4d5d67_func == NULL)
load_fmha_d64_fp16_0eb6b090_0d1d2d3d4d5d67();
void *args[8] = { &Out, &L, &M, &Q, &K, &V, &sm_scale, &seq_len };
// TODO: shared memory
if(gX * gY * gZ > 0)
return cuLaunchKernel(fmha_d64_fp16_0eb6b090_0d1d2d3d4d5d67_func, gX, gY, gZ, 4 * 32, 1, 1, 114690, stream, args, NULL);
}
It may be resolved by reduing the block size.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。