1 Star 0 Fork 0

jonny0713/llm-inference-benchmark

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
MIT

llm-inference-benchmark

LLM Inference benchmark

Inference frameworks

Framework Producibility**** Docker Image API Server OpenAI API Server WebUI Multi Models** Multi-node Backends Embedding Model
text-generation-webui Low Yes Yes Yes Yes No No Transformers/llama.cpp/ExLlama/ExLlamaV2/AutoGPTQ/AutoAWQ/GPTQ-for-LLaMa/CTransformers No
OpenLLM High Yes Yes Yes No With BentoML With BentoML Transformers(int8,int4,gptq), vLLM(awq/squeezellm), TensorRT No
vLLM* High Yes Yes Yes No No Yes(With Ray) vLLM No
Xinference High Yes Yes Yes Yes Yes Yes Transformers/vLLM/TensorRT/GGML Yes
TGI*** Medium Yes Yes No No No No Transformers/AutoGPTQ/AWQ/EETP/vLLM/ExLlama/ExLlamaV2 No
ScaleLLM Medium Yes Yes Yes Yes No No Transformers/AutoGPTQ/AWQ/vLLM/ExLlama/ExLlamaV2 No
FastChat High Yes Yes Yes Yes Yes Yes Transformers/AutoGPTQ/AWQ/vLLM/ExLlama/ExLlamaV2 Yes
  • *vLLM/TGI can also serve as a backend.
  • **Multi Models: Capable of loading multiple models simultaneously.
  • ***TGI does not support chat mode; manual parsing of the prompt is required.

Inference backends

Backend Device Compatibility** PEFT Adapters* Quatisation Batching Distributed Inference Streaming
Transformers GPU High Yes bitsandbytes(int8/int4), AutoGPTQ(gptq), AutoAWQ(awq) Yes accelerate Yes
vLLM GPU High No awq/squeezellm Yes Yes Yes
ExLlamaV2 GPU/CPU Low No GPTQ Yes Yes Yes
TensorRT GPU Medium No some models Yes Yes Yes
Candle GPU/CPU Low No No Yes Yes Yes
CTranslate2 GPU Low No Yes Yes Yes Yes
TGI GPU Medium Yes awq/eetq/gptq/bitsandbytes Yes Yes Yes
llama-cpp*** GPU/CPU High No GGUF/GPTQ Yes No Yes
lmdeploy GPU Medium No AWQ Yes Yes Yes
Deepspeed-FastGen GPU Low No No Yes Yes Yes
  • *PEFT Adapters: support to load seperate PEFT adapters(mostly lora).
  • **Compatibility: High: Compatible with most models; Medium: Compatible with some models; Low: Compatible with few models.
  • ***llama.cpp's Python binding: llama-cpp-python.

Benchmark

Hardware:

  • GPU: 1x NVIDIA RTX4090 24GB
  • CPU: Intel Core i9-13900K
  • Memory: 96GB

Software:

  • VM: WSL2 on Windows 11
  • Guest OS: Ubuntu 22.04
  • NVIDIA Driver Version: 536.67
  • CUDA Version: 12.2
  • PyTorch: 2.1.1

Model:

Data:

  • Prompt Length: 512 (with some random characters to avoid cache).
  • Max Tokens: 200.

Backend Benchmark

No Quantisation

Backend TPS@4 QPS@4 TPS@1 QPS@1 FTL@1
text-generation-webui Transformer 40.39 0.15 41.47 0.21 344.61
text-generation-webui Transformer with flash-attention-2 58.30 0.21 43.52 0.21 341.39
text-generation-webui ExllamaV2 69.09 0.26 50.71 0.27 564.80
OpenLLM PyTorch 60.79 0.22 44.73 0.21 514.55
TGI 192.58 0.90 59.68 0.28 82.72
vLLM 222.63 1.08 62.69 0.30 95.43
TensorRT - - - - -
CTranslate2* - - - - -
lmdeploy 236.03 1.15 67.86 0.33 76.81
  • bs: Batch Size. bs=4 indicates the batch size is 4.

  • TPS: Tokens Per Second.

  • QPS: Queries Per Second.

  • FTL: First Token Latency, measured in milliseconds. Applicable only in stream mode.

  • Encountered an error using CTranslate2 to convert Yi-6B-Chat. See details in the issue.

8Bit Quantisation

Backend TPS@4 QPS@4 TPS@1 QPS@1 FTL@1
TGI eetq 8bit 293.08 1.41 88.08 0.42 63.69
TGI GPTQ 8bit - - - - -
OpenLLM PyTorch AutoGPTQ 8bit 49.8 0.17 29.54 0.14 930.16
  • bitsandbytes is very slow (int8 6.8 tokens/s), so we don't benchmark it.
  • eetq-8bit doesn't require specific model.
  • TGI GPTQ 8bit load failed: Server error: module 'triton.compiler' has no attribute 'OutOfResources'
    • TGI GPTQ bit use exllama or triton backend.

4Bit Quantisation

Backend TPS@4 QPS@4 TPS@1 QPS@1 FTL@1
TGI AWQ 4bit 336.47 1.61 102.00 0.48 94.84
vLLM AWQ 4bit 29.03 0.14 37.48 0.19 3711.0
text-generation-webui llama-cpp GGUF 4bit 67.63 0.37 56.65 0.34 331.57
MIT License Copyright (c) 2023 Tao Yang Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

简介

暂无描述 展开 收起
README
MIT
取消

发行版

暂无发行版

贡献者

全部

语言

近期动态

不能加载更多了
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
1
https://gitee.com/jonny0713/llm-inference-benchmark.git
git@gitee.com:jonny0713/llm-inference-benchmark.git
jonny0713
llm-inference-benchmark
llm-inference-benchmark
main

搜索帮助