1 Star 0 Fork 0

xastra/TensorRT-LLM

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
文件
克隆/下载
README.md 2.65 KB
一键复制 编辑 原始数据 按行查看 历史

Benchmark for Python Runtime

This document explains how to benchmark the models supported by TensorRT-LLM on a single GPU, a single node with multiple GPUs or multiple nodes with multiple GPUs.

Overview

The benchmark implementation and entrypoint can be found in benchmarks/python/benchmark.py. There are some other scripts in the directory:

Usage

Please use help option for detailed usages.

python benchmark.py -h

1. Single GPU benchmark

Take GPT-350M as an example:

python benchmark.py \
    -m gpt_350m \
    --mode plugin \
    --batch_size "1;8;64" \
    --input_output_len "60,20;128,20"

Expected outputs:

[BENCHMARK] model_name gpt_350m world_size 1 num_heads 16 num_kv_heads 16 num_layers 24 hidden_size 1024 vocab_size 51200 precision float16 batch_size 1 input_length 60 output_length 20 gpu_peak_mem(gb) 4.2 build_time(s) 25.67 tokens_per_sec 483.54 percentile95(ms) 41.537 percentile99(ms) 42.102 latency(ms) 41.362 compute_cap sm80
[BENCHMARK] model_name gpt_350m world_size 1 num_heads 16 num_kv_heads 16 num_layers 24 hidden_size 1024 vocab_size 51200 precision float16 batch_size 8 input_length 60 output_length 20 gpu_peak_mem(gb) 4.28 build_time(s) 25.67 tokens_per_sec 3477.28 percentile95(ms) 46.129 percentile99(ms) 46.276 latency(ms) 46.013 compute_cap sm80
[BENCHMARK] model_name gpt_350m world_size 1 num_heads 16 num_kv_heads 16 num_layers 24 hidden_size 1024 vocab_size 51200 precision float16 batch_size 64 input_length 60 output_length 20 gpu_peak_mem(gb) 4.8 build_time(s) 25.67 tokens_per_sec 19698.07 percentile95(ms) 65.739 percentile99(ms) 65.906 latency(ms) 64.981 compute_cap sm80
...

Please note that the expected outputs is only for reference, specific performance numbers depend on the GPU you're using.

2. Multi-GPU benchmark

Take GPT-175B as an example:

mpirun -n 8 python benchmark.py \
    -m gpt_175b \
    --mode plugin \
    --batch_size "1;8;64" \
    --input_output_len "60,20;128,20"
Loading...
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
1
https://gitee.com/xastra/TensorRT-LLM.git
git@gitee.com:xastra/TensorRT-LLM.git
xastra
TensorRT-LLM
TensorRT-LLM
main

搜索帮助