XQA kernel provides optimization for MQA and GQA during generation phase. It also provides optimization for beam search. Using tensor cores for acceleration, reducing data loading and conversion, it delivers increased throughput within the same latency budget. Increased throughput allows serving greater number of user requests while providing the same experience.
Support matrix and usage flags are described in docs/source/gpt_attention.
Increased Throughput: Looking at the Throughput-Latency curves below, we see that the enabling of XQA optimization increases throughput. Higher throughput equates to serving more users, and we can see that TPOT on the Y-axis flattens out when XQA gets enabled.
Preliminary measured Performance, subject to change. TPOT lower is better. FP8, 8xH100 GPUs, Single Engine, ISL/OSL: 512/2048, BS: 1 - 256, TensorRT-LLM v0.8a
H200 2.4x with XQA
Model | GPUs | Input Length | Output Length | Throughput w/o XQA (tok/s/GPU) | Throughput w/ XQA (tok/s/GPU) | Speedup |
---|---|---|---|---|---|---|
Llama-70B | 1 | 128 | 2048 | 1,227 | 2,941 | 2.4x |
8 | 128 | 2048 | 13,232 | 25,300 | 1.9x |
These improvements will be published in the main
branch soon, and will be
included in the v0.8 releases.
For more information about H200, please see the H200 announcement blog.
Throughput is calculated as output tokens per second per gpu.
out_tps=output_seqlen*batch_size/total_latency/tp
Glossary: | DP = Data Parallel ISL = Input Sequence Length | PP = Pipeline Parallel | OSL = Output Sequence Length | OOM = Out of Memory | TP = Tensor Parallel
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。