# fastllm **Repository Path**: samzong/fastllm ## Basic Information - **Project Name**: fastllm - **Description**: 轻量级、单文件实现的本地大语言模型服务器,旨在提供与 OpenAI API 兼容的接口,方便开发者在本地快速测试和开发基于大语言模型的应用,特别适合原型验证和学习场景。 - **Primary Language**: Python - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-04-01 - **Last Updated**: 2025-04-02 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # FastLLM A minimal LLM server launcher in just ~100 lines of Python code. ## Features - 🚀 **Simple**: Launch an OpenAI-compatible LLM API server with a single command - 📦 **Flexible**: Works with both local models and models from HuggingFace - ⚡ **Fast**: Includes optimization options for faster loading and inference - 🔄 **Interruptible**: Clean Ctrl+C handling for easy operation - 🌐 **HF Mirror**: Built-in support for HuggingFace mirror (for faster downloads in some regions) ## Requirements - Python 3.8+ - vLLM library installed (`pip install vllm`) - GPU with CUDA support (recommended) ## Usage ### Basic Usage ```bash # Start an LLM server with a local model python llm_server.py ./models/my-local-model # Download and run a model from HuggingFace python llm_server.py Qwen/Qwen2-7B-Instruct # Download without starting the server python llm_server.py Qwen/Qwen2-7B-Instruct --download-only ``` ### Optimization Options ```bash # Run with half-precision for faster loading on low-end GPUs python llm_server.py Qwen/Qwen2-7B-Instruct --dtype half # Use quantization to reduce memory usage python llm_server.py Qwen/Qwen2-7B-Instruct --quantization awq # Use safetensors format for faster loading python llm_server.py Qwen/Qwen2-7B-Instruct --load-format safetensors # Use multiple GPUs python llm_server.py Qwen/Qwen2-7B-Instruct --gpu-count 2 ``` ### Server Configuration ```bash # Change host and port python llm_server.py Qwen/Qwen2-7B-Instruct --host 0.0.0.0 --port 8080 ``` ## API Usage Once the server is running, you can use it as an OpenAI-compatible API endpoint: ```bash curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen2-7B-Instruct", "messages": [{"role": "user", "content": "Tell me a joke"}], "temperature": 0.7 }' ``` Or with Python: ```python from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1") response = client.chat.completions.create( model="Qwen/Qwen2-7B-Instruct", messages=[{"role": "user", "content": "Tell me a joke"}] ) print(response.choices[0].message.content) ``` ## How It Works This tool is a thin wrapper around vLLM's `serve` command, adding convenient features like: 1. Automatic model downloading from HuggingFace 2. Simple configuration of optimization parameters 3. Clean interruption handling 4. Helpful status messages All in less than 100 lines of core code! ## 📝 License MIT License Copyright (c) 2024 samzong See the [LICENSE](LICENSE) file for details.