# ServerlessLLM
**Repository Path**: underdogs/ServerlessLLM
## Basic Information
- **Project Name**: ServerlessLLM
- **Description**: No description available
- **Primary Language**: Python
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-01-31
- **Last Updated**: 2026-01-31
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
ServerlessLLM
Load models 10x faster. Serve 10 models with 1 GPU.
Docs β’
Quick Start β’
OSDI'24 Paper
---
## β‘ Performance
**ServerlessLLM loads models 6-10x faster than SafeTensors**, enabling true serverless deployment where multiple models efficiently share GPU resources.
| Model |
Scenario |
SafeTensors |
ServerlessLLM |
Speedup |
| Qwen/Qwen3-32B |
Random |
20.6s |
3.2s |
6.40x |
| Cached |
12.5s |
1.3s |
9.95x |
| DeepSeek-R1-Distill-Qwen-32B |
Random |
19.1s |
3.2s |
5.93x |
| Cached |
10.2s |
1.2s |
8.58x |
| Llama-3.1-8B-Instruct |
Random |
4.4s |
0.7s |
6.54x |
*Results obtained on NVIDIA H100 GPUs with NVMe SSD. "Random" simulates serverless multi-model serving; "Cached" shows repeated loading of the same model.*
## What is ServerlessLLM?
ServerlessLLM is a fast, low-cost system for deploying multiple AI models on shared GPUs, with three core innovations:
1. **β‘ Ultra-Fast Checkpoint Loading**: Custom storage format with O_DIRECT I/O loads models 6-10x faster than state-of-the-art checkpoint loaders
2. **π GPU Multiplexing**: Multiple models share GPUs with fast switching and intelligent scheduling
3. **π― Unified Inference + Fine-Tuning**: Seamlessly integrates LLM serving with LoRA fine-tuning on shared resources
**Result:** Serve 10 models on 1 GPU, fine-tune on-demand, and serve a base model + 100s of LoRA adapters.
---
## π Quick Start (90 Seconds)
### Start ServerlessLLM Cluster
> **Don't have Docker?** Jump to [Use the Fast Loader in Your Code](#-use-the-fast-loader-in-your-code) for a Docker-free example.
```bash
# Download the docker-compose.yml file
curl -O https://raw.githubusercontent.com/ServerlessLLM/ServerlessLLM/main/examples/docker/docker-compose.yml
# Set model storage location
export MODEL_FOLDER=/path/to/models
# Launch cluster (head node + worker with GPU)
docker compose up -d
# Wait for the cluster to be ready
docker logs -f sllm_head
```
### Deploy a Model
```bash
docker exec sllm_head /opt/conda/envs/head/bin/sllm deploy --model Qwen/Qwen3-0.6B --backend transformers
```
### Query the Model
```bash
curl http://127.0.0.1:8343/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "What is ServerlessLLM?"}],
"temperature": 0.7
}'
```
**That's it!** Your model is now serving requests with an OpenAI-compatible API.
---
## π‘ Use the Fast Loader in Your Code
Use ServerlessLLM Store standalone to speed up torch-based model loading.
### Install
```bash
pip install serverless-llm-store
```
### Convert a Model
```bash
sllm-store save --model Qwen/Qwen3-0.6B --backend transformers
```
### Start the Store Server
```bash
# Start the store server first
sllm-store start --storage-path ./models --mem-pool-size 4GB
```
### Load it 6-10x Faster in Your Python Code
```python
from sllm_store.transformers import load_model
# Load model (6-10x faster than from_pretrained!)
model = load_model(
"Qwen/Qwen3-0.6B",
device_map="auto",
torch_dtype="float16"
)
# Use as a normal PyTorch/Transformers model
output = model.generate(**inputs)
```
**How it works:**
- Custom binary format optimized for sequential reads
- O_DIRECT I/O bypassing OS page cache
- Pinned memory pool for DMA-accelerated GPU transfers
- Parallel multi-threaded loading
---
## π― Key Features
### β‘ Ultra-Fast Model Loading
- **6-10x faster** than the SafeTensors checkpoint loader
- Supports both NVIDIA and AMD GPUs
- Works with vLLM, Transformers, and custom models
**π Docs:** [Fast Loading Guide](https://serverlessllm.github.io/docs/store/quickstart) | [ROCm Guide](https://serverlessllm.github.io/docs/store/rocm_quickstart)
---
### π GPU Multiplexing
- **Run 10+ models on 1 GPU** with fast switching
- Storage-aware scheduling minimizes loading time
- Auto-scale instances per model (scale to zero when idle)
- Live migration for zero-downtime resource optimization
**π Docs:** [Deployment Guide](https://serverlessllm.github.io/docs/getting_started)
---
### π― Unified Inference + LoRA Fine-Tuning
- Integrates LLM serving with serverless LoRA fine-tuning
- Deploys fine-tuned adapters for inference on-demand
- Serves a base model + 100s of LoRA adapters efficiently
**π Docs:** [Fine-Tuning Guide](https://serverlessllm.github.io/docs/features/peft_lora_fine_tuning)
---
### π Embedding Models for RAG
- Deploy embedding models alongside LLMs
- Provides an OpenAI-compatible `/v1/embeddings` endpoint
**π‘ Example:** [RAG Example](https://github.com/ServerlessLLM/ServerlessLLM/tree/main/examples/embedding)
---
### π Production-Ready
- **OpenAI-compatible API** (drop-in replacement)
- Docker and Kubernetes deployment
- Multi-node clusters with distributed scheduling
**π Docs:** [Deployment Guide](https://serverlessllm.github.io/docs/developer/supporting_a_new_hardware) | [API Reference](https://serverlessllm.github.io/docs/api/intro)
---
### π» Supported Hardware
- **NVIDIA GPUs**: Compute capability 7.0+ (V100, A100, H100, RTX 3060+)
- **AMD GPUs**: ROCm 6.2+ (MI100, MI200 series) - Experimental
**More Examples:** [./examples/](./examples/)
---
## π€ Community
- **Discord**: [Join our community](https://discord.gg/AEF8Gduvm8) - Get help, share ideas
- **GitHub Issues**: [Report bugs](https://github.com/ServerlessLLM/ServerlessLLM/issues)
- **WeChat**: [QR Code](./docs/images/wechat.png) - δΈζζ―ζ
- **Contributing**: See [CONTRIBUTING.md](./CONTRIBUTING.md)
Maintained by 10+ contributors worldwide. Community contributions are welcome!
---
## π Citation
If you use ServerlessLLM in your research, please cite our [OSDI'24 paper](https://www.usenix.org/conference/osdi24/presentation/fu):
```bibtex
@inproceedings{fu2024serverlessllm,
title={ServerlessLLM: Low-Latency Serverless Inference for Large Language Models},
author={Fu, Yao and Xue, Leyang and Huang, Yeqi and Brabete, Andrei-Octavian and Ustiugov, Dmitrii and Patel, Yuvraj and Mai, Luo},
booktitle={OSDI'24},
year={2024}
}
```
---
## π License
Apache 2.0 - See [LICENSE](./LICENSE)
---
β Star this repo if ServerlessLLM helps you!