# ServerlessLLM **Repository Path**: underdogs/ServerlessLLM ## Basic Information - **Project Name**: ServerlessLLM - **Description**: No description available - **Primary Language**: Python - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-01-31 - **Last Updated**: 2026-01-31 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

ServerlessLLM

Load models 10x faster. Serve 10 models with 1 GPU.

Docs • Quick Start • OSDI'24 Paper

--- ## ⚡ Performance **ServerlessLLM loads models 6-10x faster than SafeTensors**, enabling true serverless deployment where multiple models efficiently share GPU resources.

Model	Scenario	SafeTensors	ServerlessLLM	Speedup
Qwen/Qwen3-32B	Random	20.6s	3.2s	6.40x
Qwen/Qwen3-32B	Cached	12.5s	1.3s	9.95x
DeepSeek-R1-Distill-Qwen-32B	Random	19.1s	3.2s	5.93x
DeepSeek-R1-Distill-Qwen-32B	Cached	10.2s	1.2s	8.58x
Llama-3.1-8B-Instruct	Random	4.4s	0.7s	6.54x

*Results obtained on NVIDIA H100 GPUs with NVMe SSD. "Random" simulates serverless multi-model serving; "Cached" shows repeated loading of the same model.* ## What is ServerlessLLM? ServerlessLLM is a fast, low-cost system for deploying multiple AI models on shared GPUs, with three core innovations: 1. **⚡ Ultra-Fast Checkpoint Loading**: Custom storage format with O_DIRECT I/O loads models 6-10x faster than state-of-the-art checkpoint loaders 2. **🔄 GPU Multiplexing**: Multiple models share GPUs with fast switching and intelligent scheduling 3. **🎯 Unified Inference + Fine-Tuning**: Seamlessly integrates LLM serving with LoRA fine-tuning on shared resources **Result:** Serve 10 models on 1 GPU, fine-tune on-demand, and serve a base model + 100s of LoRA adapters. --- ## 🚀 Quick Start (90 Seconds) ### Start ServerlessLLM Cluster > **Don't have Docker?** Jump to [Use the Fast Loader in Your Code](#-use-the-fast-loader-in-your-code) for a Docker-free example. ```bash # Download the docker-compose.yml file curl -O https://raw.githubusercontent.com/ServerlessLLM/ServerlessLLM/main/examples/docker/docker-compose.yml # Set model storage location export MODEL_FOLDER=/path/to/models # Launch cluster (head node + worker with GPU) docker compose up -d # Wait for the cluster to be ready docker logs -f sllm_head ``` ### Deploy a Model ```bash docker exec sllm_head /opt/conda/envs/head/bin/sllm deploy --model Qwen/Qwen3-0.6B --backend transformers ``` ### Query the Model ```bash curl http://127.0.0.1:8343/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen3-0.6B", "messages": [{"role": "user", "content": "What is ServerlessLLM?"}], "temperature": 0.7 }' ``` **That's it!** Your model is now serving requests with an OpenAI-compatible API. --- ## 💡 Use the Fast Loader in Your Code Use ServerlessLLM Store standalone to speed up torch-based model loading. ### Install ```bash pip install serverless-llm-store ``` ### Convert a Model ```bash sllm-store save --model Qwen/Qwen3-0.6B --backend transformers ``` ### Start the Store Server ```bash # Start the store server first sllm-store start --storage-path ./models --mem-pool-size 4GB ``` ### Load it 6-10x Faster in Your Python Code ```python from sllm_store.transformers import load_model # Load model (6-10x faster than from_pretrained!) model = load_model( "Qwen/Qwen3-0.6B", device_map="auto", torch_dtype="float16" ) # Use as a normal PyTorch/Transformers model output = model.generate(**inputs) ``` **How it works:** - Custom binary format optimized for sequential reads - O_DIRECT I/O bypassing OS page cache - Pinned memory pool for DMA-accelerated GPU transfers - Parallel multi-threaded loading --- ## 🎯 Key Features ### ⚡ Ultra-Fast Model Loading - **6-10x faster** than the SafeTensors checkpoint loader - Supports both NVIDIA and AMD GPUs - Works with vLLM, Transformers, and custom models **📖 Docs:** [Fast Loading Guide](https://serverlessllm.github.io/docs/store/quickstart) | [ROCm Guide](https://serverlessllm.github.io/docs/store/rocm_quickstart) --- ### 🔄 GPU Multiplexing - **Run 10+ models on 1 GPU** with fast switching - Storage-aware scheduling minimizes loading time - Auto-scale instances per model (scale to zero when idle) - Live migration for zero-downtime resource optimization **📖 Docs:** [Deployment Guide](https://serverlessllm.github.io/docs/getting_started) --- ### 🎯 Unified Inference + LoRA Fine-Tuning - Integrates LLM serving with serverless LoRA fine-tuning - Deploys fine-tuned adapters for inference on-demand - Serves a base model + 100s of LoRA adapters efficiently **📖 Docs:** [Fine-Tuning Guide](https://serverlessllm.github.io/docs/features/peft_lora_fine_tuning) --- ### 🔍 Embedding Models for RAG - Deploy embedding models alongside LLMs - Provides an OpenAI-compatible `/v1/embeddings` endpoint **💡 Example:** [RAG Example](https://github.com/ServerlessLLM/ServerlessLLM/tree/main/examples/embedding) --- ### 🚀 Production-Ready - **OpenAI-compatible API** (drop-in replacement) - Docker and Kubernetes deployment - Multi-node clusters with distributed scheduling **📖 Docs:** [Deployment Guide](https://serverlessllm.github.io/docs/developer/supporting_a_new_hardware) | [API Reference](https://serverlessllm.github.io/docs/api/intro) --- ### 💻 Supported Hardware - **NVIDIA GPUs**: Compute capability 7.0+ (V100, A100, H100, RTX 3060+) - **AMD GPUs**: ROCm 6.2+ (MI100, MI200 series) - Experimental **More Examples:** [./examples/](./examples/) --- ## 🤝 Community - **Discord**: [Join our community](https://discord.gg/AEF8Gduvm8) - Get help, share ideas - **GitHub Issues**: [Report bugs](https://github.com/ServerlessLLM/ServerlessLLM/issues) - **WeChat**: [QR Code](./docs/images/wechat.png) - 中文支持 - **Contributing**: See [CONTRIBUTING.md](./CONTRIBUTING.md) Maintained by 10+ contributors worldwide. Community contributions are welcome! --- ## 📄 Citation If you use ServerlessLLM in your research, please cite our [OSDI'24 paper](https://www.usenix.org/conference/osdi24/presentation/fu): ```bibtex @inproceedings{fu2024serverlessllm, title={ServerlessLLM: Low-Latency Serverless Inference for Large Language Models}, author={Fu, Yao and Xue, Leyang and Huang, Yeqi and Brabete, Andrei-Octavian and Ustiugov, Dmitrii and Patel, Yuvraj and Mai, Luo}, booktitle={OSDI'24}, year={2024} } ``` --- ## 📝 License Apache 2.0 - See [LICENSE](./LICENSE) ---

⭐ Star this repo if ServerlessLLM helps you!