# mini_sglang **Repository Path**: seeventh/mini_sglang ## Basic Information - **Project Name**: mini_sglang - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-03-04 - **Last Updated**: 2026-03-09 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

# Mini-SGLang A **lightweight yet high-performance** inference framework for Large Language Models. --- Mini-SGLang is a compact implementation of [SGLang](https://github.com/sgl-project/sglang), designed to demystify the complexities of modern LLM serving systems. With a compact codebase of **~5,000 lines of Python**, it serves as both a capable inference engine and a transparent reference for researchers and developers. ## ✨ Key Features - **High Performance**: Achieves state-of-the-art throughput and latency with advanced optimizations. - **Lightweight & Readable**: A clean, modular, and fully type-annotated codebase that is easy to understand and modify. - **Advanced Optimizations**: - **Radix Cache**: Reuses KV cache for shared prefixes across requests. - **Chunked Prefill**: Reduces peak memory usage for long-context serving. - **Overlap Scheduling**: Hides CPU scheduling overhead with GPU computation. - **Tensor Parallelism**: Scales inference across multiple GPUs. - **Optimized Kernels**: Integrates **FlashAttention** and **FlashInfer** for maximum efficiency. - ... ## 🚀 Quick Start > **⚠️ Platform Support**: Mini-SGLang currently supports **Linux only** (x86_64 and aarch64). Windows and macOS are not supported due to dependencies on Linux-specific CUDA kernels (`sgl-kernel`, `flashinfer`). We recommend using [WSL2](https://learn.microsoft.com/en-us/windows/wsl/install) on Windows or Docker for cross-platform compatibility. ### 1. Environment Setup We recommend using `uv` for a fast and reliable installation (note that `uv` does not conflict with `conda`). ```bash # Create a virtual environment (Python 3.10+ recommended) uv venv --python=3.12 source .venv/bin/activate ``` **Prerequisites**: Mini-SGLang relies on CUDA kernels that are JIT-compiled. Ensure you have the **NVIDIA CUDA Toolkit** installed and that its version matches your driver's version. You can check your driver's CUDA capability with `nvidia-smi`. ### 2. Installation Install Mini-SGLang directly from the source: ```bash git clone https://github.com/sgl-project/mini-sglang.git cd mini-sglang && uv venv --python=3.12 && source .venv/bin/activate uv pip install -e . ```

💡 Installing on Windows (WSL2)

Since Mini-SGLang requires Linux-specific dependencies, Windows users should use WSL2: 1. **Install WSL2** (if not already installed): ```powershell # In PowerShell (as Administrator) wsl --install ``` 2. **Install CUDA on WSL2**: - Follow [NVIDIA's WSL2 CUDA guide](https://docs.nvidia.com/cuda/wsl-user-guide/index.html) - Ensure your Windows GPU drivers support WSL2 3. **Install Mini-SGLang in WSL2**: ```bash # Inside WSL2 terminal git clone https://github.com/sgl-project/mini-sglang.git cd mini-sglang && uv venv --python=3.12 && source .venv/bin/activate uv pip install -e . ``` 4. **Access from Windows**: The server will be accessible at `http://localhost:8000` from Windows browsers and applications.

🐳 Running with Docker

**Prerequisites**: - [Docker](https://docs.docker.com/get-docker/) - [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) 1. **Build the Docker image**: ```bash docker build -t minisgl . ``` 2. **Run the server**: ```bash docker run --gpus all -p 1919:1919 \ minisgl --model Qwen/Qwen3-0.6B --host 0.0.0.0 ``` 3. **Run in interactive shell mode**: ```bash docker run -it --gpus all \ minisgl --model Qwen/Qwen3-0.6B --shell ``` 4. **Using Docker Volumes for persistent caches** (recommended for faster subsequent startups): ```bash docker run --gpus all -p 1919:1919 \ -v huggingface_cache:/app/.cache/huggingface \ -v tvm_cache:/app/.cache/tvm-ffi \ -v flashinfer_cache:/app/.cache/flashinfer \ minisgl --model Qwen/Qwen3-0.6B --host 0.0.0.0 ```

### 3. Online Serving Launch an OpenAI-compatible API server with a single command. ```bash # Deploy Qwen/Qwen3-0.6B on a single GPU python -m minisgl --model "Qwen/Qwen3-0.6B" # Deploy meta-llama/Llama-3.1-70B-Instruct on 4 GPUs with Tensor Parallelism, on port 30000 python -m minisgl --model "meta-llama/Llama-3.1-70B-Instruct" --tp 4 --port 30000 ``` Once the server is running, you can send requests using standard tools like `curl` or any OpenAI-compatible client. ### 4. Interactive Shell Chat with your model directly in the terminal by adding the `--shell` flag. ```bash python -m minisgl --model "Qwen/Qwen3-0.6B" --shell ``` ![shell-example](https://lmsys.org/images/blog/minisgl/shell.png) You can also use `/reset` to clear the chat history. ## Benchmark ### Offline inference See [bench.py](./benchmark/offline/bench.py) for more details. Set `MINISGL_DISABLE_OVERLAP_SCHEDULING=1` for ablation study on overlap scheduling. Test Configuration: - Hardware: 1xH200 GPU. - Model: Qwen3-0.6B, Qwen3-14B - Total Requests: 256 sequences - Input Length: Randomly sampled between 100-1024 tokens - Output Length: Randomly sampled between 100-1024 tokens ![offline](https://lmsys.org/images/blog/minisgl/offline.png) ### Online inference See [benchmark_qwen.py](./benchmark/online/bench_qwen.py) for more details. Test Configuration: - Hardware: 4xH200 GPU, connected by NVLink. - Model: Qwen3-32B - Dataset: [Qwen trace](https://github.com/alibaba-edu/qwen-bailian-usagetraces-anon/blob/main/qwen_traceA_blksz_16.jsonl), replaying first 1000 requests. Launch command: ```bash # Mini-SGLang python -m minisgl --model "Qwen/Qwen3-32B" --tp 4 --cache naive # SGLang python3 -m sglang.launch_server --model "Qwen/Qwen3-32B" --tp 4 \ --disable-radix --port 1919 --decode-attention flashinfer ``` > **Note**: If you encounter network issues when downloading models from HuggingFace, try using `--model-source modelscope` to download from ModelScope instead: > ```bash > python -m minisgl --model "Qwen/Qwen3-32B" --tp 4 --model-source modelscope > ``` ![online](https://lmsys.org/images/blog/minisgl/online.png) ## 📚 Learn More - **[Detailed Features](./docs/features.md)**: Explore all available features and command-line arguments. - **[System Architecture](./docs/structures.md)**: Dive deep into the design and data flow of Mini-SGLang.