| Quick Start | DeepSeek Deployment Guide | Qwen3 Deployment Guide | Changelog |
fastllm is a high-performance LLMs inference library implemented in C++ with no backend dependencies (e.g. PyTorch).
It enables hybrid inference of MOE models, achieving 20+ tps on consumer-grade single GPUs (e.g., 4090) for DeepSeek R1 671B INT4 model inference.
Deployment discussion QQ group: 831641348
WeChat group:
Linux systems can try direct pip installation:
pip install ftllm -U
(Note: Due to PyPI size limitations, the package doesn't include CUDA dependencies - manual installation of CUDA 12+ is recommended)
wget https://developer.download.nvidia.com/compute/cuda/12.8.1/local_installers/cuda_12.8.1_570.124.06_linux.run
sudo sh cuda_12.8.1_570.124.06_linux.run
If pip installation fails or you have special requirements, you can build from source:
Thie project is built with cmake. Requires pre-installed gcc, g++ (7.5+ tested, 9.4+ recommended), make, cmake (3.23+ recommended)
GPU compilation requires CUDA environment (9.2+ istested). Use the newest CUDA version possible.
Compilation commands:
bash install.sh -DUSE_CUDA=ON -D CMAKE_CUDA_COMPILER=$(which nvcc) # GPU version
# bash install.sh -DUSE_CUDA=ON -DCUDA_ARCH=89 -D CMAKE_CUDA_COMPILER=$(which nvcc) # Specify CUDA arch (e.g. 89 for RTX 4090)
# bash install.sh # CPU-only version
For compilation instructions on other platforms, please refer to the documentation:
If you meet problem during compilation, see FAQ doc.
Taking the Qwen/Qwen3-0.6B model as an example:
ftllm run Qwen/Qwen3-0.6B
ftllm webui Qwen/Qwen3-0.6B
ftllm server Qwen/Qwen3-0.6B
You can launch a locally downloaded Hugging Face model. Assuming the local model path is /mnt/Qwen/Qwen2-0.5B-Instruct/
, use the following command (similar for webui
and server
):
ftllm run /mnt/Qwen/Qwen3-0.6B/
If you can't remember the exact model name, you can input an approximate name (matching is not guaranteed).
For example:
ftllm run qwen2-7b-awq
ftllm run deepseek-v3-0324-int4
If you don't want to use the default cache directory, you can set it via parameter --cache_dir
, for example:
ftllm run deepseek-v3-0324-int4 --cache_dir /mnt/
Or you can set it via the environment variable FASTLLM_CACHEDIR
. For example, on Linux:
export FASTLLM_CACHEDIR=/mnt/
The following are common parameters when running the ftllm
module:
-t
or --threads
:
-t 27
--dtype
:
int4
or other supported data types.--dtype int4
--device
:
cpu
, cuda
, or numa
.--device cpu
or --device cuda
--moe_device
:
cpu
, cuda
, or numa
.--moe_device cpu
--moe_experts
:
--moe_experts 6
--port
:
--port 8080
Please read Arguments for Demos for further information.
Use the following command to download a model locally:
ftllm download deepseek-ai/DeepSeek-R1
If using quantized model loading (e.g., --dtype int4
), the model will be quantized online each time it is loaded, which can be slow.
ftllm.export
is a tool for exporting and converting model weights. It supports converting model weights to different data types. Below are detailed instructions on how to use ftllm.export
.
ftllm export <model_path> -o <output_path> --dtype <data_type> -t <threads>
ftllm export /mnt/DeepSeek-V3 -o /mnt/DeepSeek-V3-INT4 --dtype int4 -t 16
You can specify --moe_dtype
for mixed precision of a MoE model, for example:
ftllm export /mnt/DeepSeek-V3 -o /mnt/DeepSeek-V3-FP16INT4 --dtype float16 --moe_dtype int4 -t 16
The exported model can be used similarly to the original model. The --dtype
parameter will be ignored when using the exported model.
For example:
ftllm run /mnt/DeepSeek-V3-INT4/
Fastllm supports original, AWQ and FASTLLM models. Please refer Supported Models for older models.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
1. Open source ecosystem
2. Collaboration, People, Software
3. Evaluation model