# mini_sglang
**Repository Path**: seeventh/mini_sglang
## Basic Information
- **Project Name**: mini_sglang
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-03-04
- **Last Updated**: 2026-03-09
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# Mini-SGLang
A **lightweight yet high-performance** inference framework for Large Language Models.
---
Mini-SGLang is a compact implementation of [SGLang](https://github.com/sgl-project/sglang), designed to demystify the complexities of modern LLM serving systems. With a compact codebase of **~5,000 lines of Python**, it serves as both a capable inference engine and a transparent reference for researchers and developers.
## ✨ Key Features
- **High Performance**: Achieves state-of-the-art throughput and latency with advanced optimizations.
- **Lightweight & Readable**: A clean, modular, and fully type-annotated codebase that is easy to understand and modify.
- **Advanced Optimizations**:
- **Radix Cache**: Reuses KV cache for shared prefixes across requests.
- **Chunked Prefill**: Reduces peak memory usage for long-context serving.
- **Overlap Scheduling**: Hides CPU scheduling overhead with GPU computation.
- **Tensor Parallelism**: Scales inference across multiple GPUs.
- **Optimized Kernels**: Integrates **FlashAttention** and **FlashInfer** for maximum efficiency.
- ...
## 🚀 Quick Start
> **⚠️ Platform Support**: Mini-SGLang currently supports **Linux only** (x86_64 and aarch64). Windows and macOS are not supported due to dependencies on Linux-specific CUDA kernels (`sgl-kernel`, `flashinfer`). We recommend using [WSL2](https://learn.microsoft.com/en-us/windows/wsl/install) on Windows or Docker for cross-platform compatibility.
### 1. Environment Setup
We recommend using `uv` for a fast and reliable installation (note that `uv` does not conflict with `conda`).
```bash
# Create a virtual environment (Python 3.10+ recommended)
uv venv --python=3.12
source .venv/bin/activate
```
**Prerequisites**: Mini-SGLang relies on CUDA kernels that are JIT-compiled. Ensure you have the **NVIDIA CUDA Toolkit** installed and that its version matches your driver's version. You can check your driver's CUDA capability with `nvidia-smi`.
### 2. Installation
Install Mini-SGLang directly from the source:
```bash
git clone https://github.com/sgl-project/mini-sglang.git
cd mini-sglang && uv venv --python=3.12 && source .venv/bin/activate
uv pip install -e .
```
💡 Installing on Windows (WSL2)
Since Mini-SGLang requires Linux-specific dependencies, Windows users should use WSL2:
1. **Install WSL2** (if not already installed):
```powershell
# In PowerShell (as Administrator)
wsl --install
```
2. **Install CUDA on WSL2**:
- Follow [NVIDIA's WSL2 CUDA guide](https://docs.nvidia.com/cuda/wsl-user-guide/index.html)
- Ensure your Windows GPU drivers support WSL2
3. **Install Mini-SGLang in WSL2**:
```bash
# Inside WSL2 terminal
git clone https://github.com/sgl-project/mini-sglang.git
cd mini-sglang && uv venv --python=3.12 && source .venv/bin/activate
uv pip install -e .
```
4. **Access from Windows**: The server will be accessible at `http://localhost:8000` from Windows browsers and applications.
🐳 Running with Docker
**Prerequisites**:
- [Docker](https://docs.docker.com/get-docker/)
- [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)
1. **Build the Docker image**:
```bash
docker build -t minisgl .
```
2. **Run the server**:
```bash
docker run --gpus all -p 1919:1919 \
minisgl --model Qwen/Qwen3-0.6B --host 0.0.0.0
```
3. **Run in interactive shell mode**:
```bash
docker run -it --gpus all \
minisgl --model Qwen/Qwen3-0.6B --shell
```
4. **Using Docker Volumes for persistent caches** (recommended for faster subsequent startups):
```bash
docker run --gpus all -p 1919:1919 \
-v huggingface_cache:/app/.cache/huggingface \
-v tvm_cache:/app/.cache/tvm-ffi \
-v flashinfer_cache:/app/.cache/flashinfer \
minisgl --model Qwen/Qwen3-0.6B --host 0.0.0.0
```
### 3. Online Serving
Launch an OpenAI-compatible API server with a single command.
```bash
# Deploy Qwen/Qwen3-0.6B on a single GPU
python -m minisgl --model "Qwen/Qwen3-0.6B"
# Deploy meta-llama/Llama-3.1-70B-Instruct on 4 GPUs with Tensor Parallelism, on port 30000
python -m minisgl --model "meta-llama/Llama-3.1-70B-Instruct" --tp 4 --port 30000
```
Once the server is running, you can send requests using standard tools like `curl` or any OpenAI-compatible client.
### 4. Interactive Shell
Chat with your model directly in the terminal by adding the `--shell` flag.
```bash
python -m minisgl --model "Qwen/Qwen3-0.6B" --shell
```

You can also use `/reset` to clear the chat history.
## Benchmark
### Offline inference
See [bench.py](./benchmark/offline/bench.py) for more details. Set `MINISGL_DISABLE_OVERLAP_SCHEDULING=1` for ablation study on overlap scheduling.
Test Configuration:
- Hardware: 1xH200 GPU.
- Model: Qwen3-0.6B, Qwen3-14B
- Total Requests: 256 sequences
- Input Length: Randomly sampled between 100-1024 tokens
- Output Length: Randomly sampled between 100-1024 tokens

### Online inference
See [benchmark_qwen.py](./benchmark/online/bench_qwen.py) for more details.
Test Configuration:
- Hardware: 4xH200 GPU, connected by NVLink.
- Model: Qwen3-32B
- Dataset: [Qwen trace](https://github.com/alibaba-edu/qwen-bailian-usagetraces-anon/blob/main/qwen_traceA_blksz_16.jsonl), replaying first 1000 requests.
Launch command:
```bash
# Mini-SGLang
python -m minisgl --model "Qwen/Qwen3-32B" --tp 4 --cache naive
# SGLang
python3 -m sglang.launch_server --model "Qwen/Qwen3-32B" --tp 4 \
--disable-radix --port 1919 --decode-attention flashinfer
```
> **Note**: If you encounter network issues when downloading models from HuggingFace, try using `--model-source modelscope` to download from ModelScope instead:
> ```bash
> python -m minisgl --model "Qwen/Qwen3-32B" --tp 4 --model-source modelscope
> ```

## 📚 Learn More
- **[Detailed Features](./docs/features.md)**: Explore all available features and command-line arguments.
- **[System Architecture](./docs/structures.md)**: Dive deep into the design and data flow of Mini-SGLang.