# vllm-playground **Repository Path**: cdpro/vllm-playground ## Basic Information - **Project Name**: vllm-playground - **Description**: No description available - **Primary Language**: Python - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-01-06 - **Last Updated**: 2026-01-06 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # vLLM Playground A modern web interface for managing and interacting with vLLM servers (www.github.com/vllm-project/vllm). Supports both GPU and CPU modes, with special optimizations for macOS Apple Silicon and enterprise deployment on OpenShift/Kubernetes. ### ✨ New UI with Tool Calling Support ![vLLM Playground Interface](https://raw.githubusercontent.com/micytao/vllm-playground/main/assets/vllm-playground-newUI.png) ### ✨ New UI with Structured Outputs Support ![vLLM Playground with Structured Outputs](https://raw.githubusercontent.com/micytao/vllm-playground/main/assets/vllm-playground-structured-outputs.png) ### ✨ New UI Enhancements - **🎨 Modern Dark Theme**: Sleek, professional interface with improved visual hierarchy and contrast - **πŸ’¬ Streamlined Chat Interface**: Clean, distraction-free chat UI with inline expandable panels - **πŸ”§ Icon Toolbar**: Compact icon bar for quick access to advanced features (ChatGPT-style) - βš™οΈ **Chat Settings**: Temperature and max tokens configuration - πŸ’¬ **System Prompt**: Customizable with 8 preset templates (Helpful, Coder, Writer, Teacher, etc.) - πŸ—οΈ **Structured Outputs**: Constrain model responses to Choice, Regex, JSON Schema, or Grammar - πŸ”§ **Tool Calling/Function Calling**: Define custom tools with parameters for function calling - πŸ”— **MCP Servers**: Model Context Protocol integration *(Coming Soon)* - βž• **RAG**: Retrieval-Augmented Generation support *(Coming Soon)* ## πŸ“¦ Quick Install via PyPI ```bash # Basic installation pip install vllm-playground # With GuideLLM benchmarking support pip install vllm-playground[benchmark] # First time? Pre-download the container image (~10GB for GPU) vllm-playground pull # Start the playground vllm-playground ``` Open http://localhost:7860 in your browser - that's it! πŸš€ > πŸ’‘ **Tip**: The `vllm-playground pull` command pre-downloads the large container image with progress display, so you don't have to wait during server startup! ![Pre-pull Container Image](https://raw.githubusercontent.com/micytao/vllm-playground/main/assets/vllm-playground-pull.png) *Pre-download the ~10GB GPU container image with progress display - no more waiting during server startup!* ## πŸ“¦ New: PyPI Package **One-command installation!** vLLM Playground is now available on PyPI for easy distribution and installation. ```bash pip install vllm-playground vllm-playground ``` **Key Benefits:** - βœ… **Simple Installation**: Just `pip install` - no cloning required - βœ… **CLI Support**: `vllm-playground` command with options - βœ… **Pre-Pull Images**: `vllm-playground pull` downloads container images with progress display - βœ… **Auto GPU/CPU**: Automatically uses sudo for GPU containers, rootless for CPU - βœ… **Optional Extras**: Install with `[benchmark]` for GuideLLM support - βœ… **Easy Updates**: `pip install --upgrade vllm-playground` --- ## 🐳 New: Containerized vLLM Service **No more manual vLLM installation!** The Web UI now automatically manages vLLM in isolated containers, providing a seamless experience from local development to enterprise deployment. **πŸ“Ή Watch Demo: Automatic Container Startup** ![Start vLLM Demo](https://raw.githubusercontent.com/micytao/vllm-playground/main/assets/start-vllm.gif) *See how easy it is: Just click "Start Server" and the container orchestrator automatically starts the vLLM container - no manual installation or configuration needed!* **πŸ“Ή Watch Demo: Automatic Container Shutdown** ![Stop vLLM Demo](https://raw.githubusercontent.com/micytao/vllm-playground/main/assets/stop-vllm.gif) *Clean shutdown: Click "Stop Server" and the container orchestrator gracefully stops the vLLM container with automatic cleanup!* **Key Benefits:** - βœ… **Zero Setup**: No vLLM installation required - containers handle everything - βœ… **Isolated Environment**: vLLM runs in its own container, preventing conflicts - βœ… **Smart Management**: Automatic container lifecycle (start, stop, logs, health checks) - βœ… **Fast Restarts**: Configuration caching for quick server restarts - βœ… **Hybrid Architecture**: Same UI works locally (Podman) and in cloud (Kubernetes) **Architecture:** - **Local Development**: Podman-based container orchestration - **Enterprise Deployment**: OpenShift/Kubernetes with dynamic pod creation - **Container Manager**: Automatic lifecycle management with smart reuse ## πŸ“Š New: GuideLLM Benchmarking Integrated GuideLLM for comprehensive performance benchmarking and analysis. Run load tests and get detailed metrics on throughput, latency, and token generation performance! ![GuideLLM Benchmark Results](https://raw.githubusercontent.com/micytao/vllm-playground/main/assets/guidellm.png) ## πŸ“š New: vLLM Community Recipes **One-click model configurations from the official [vLLM Recipes Repository](https://github.com/vllm-project/recipes)!** Browse community-maintained configurations for popular models like DeepSeek, Qwen, Llama, Mistral, and more. ![vLLM Recipes Browser](https://raw.githubusercontent.com/micytao/vllm-playground/main/assets/vllm-recipes-1.png) *Browse 17+ model categories with optimized configurations - just click "Load Config" to auto-fill all settings!* ![vLLM Recipes Details](https://raw.githubusercontent.com/micytao/vllm-playground/main/assets/vllm-recipes-2.png) *Each recipe includes hardware requirements, vLLM parameters, and direct links to documentation.* **Key Features:** - βœ… **One-Click Configuration**: Load optimized vLLM settings instantly - βœ… **Community-Maintained**: Syncs with official vLLM recipes repository - βœ… **Searchable Catalog**: Filter by model name, category, or tags (multi-gpu, vision, reasoning, etc.) - βœ… **Hardware Guidance**: See recommended GPU configurations for each model - βœ… **Custom Recipes**: Add, edit, or delete your own recipes - βœ… **GitHub Sync**: Update catalog from GitHub with optional token for higher rate limits **Supported Model Families:** DeepSeek, Qwen, Llama, Mistral, InternVL, GLM, NVIDIA Nemotron, Moonshot AI (Kimi), MiniMax, Jina AI, Tencent Hunyuan, Ernie, OpenAI, PaddlePaddle, Seed, inclusionAI, and CPU-friendly models. ## πŸ”§ Model Compression **Looking for model compression and quantization?** Check out the separate **[LLMCompressor Playground](https://github.com/micytao/llmcompressor-playground)** project for: - Model quantization (INT8, INT4, FP8) - GPTQ, AWQ, and SmoothQuant algorithms - Built-in compression presets - Integration with vLLM This keeps the vLLM Playground focused on serving and benchmarking, while providing a dedicated tool for model optimization. ## πŸ“ Project Structure ``` vllm-playground/ β”œβ”€β”€ app.py # Main FastAPI backend application β”œβ”€β”€ run.py # Backend server launcher β”œβ”€β”€ container_manager.py # πŸ†• Podman-based container orchestration (local) β”œβ”€β”€ index.html # Main HTML interface β”œβ”€β”€ pyproject.toml # πŸ†• PyPI package configuration β”œβ”€β”€ requirements.txt # Python dependencies β”œβ”€β”€ env.example # Example environment variables β”œβ”€β”€ LICENSE # MIT License β”œβ”€β”€ README.md # This file β”‚ β”œβ”€β”€ vllm_playground/ # πŸ†• PyPI package source β”‚ β”œβ”€β”€ __init__.py # Package version and exports β”‚ β”œβ”€β”€ app.py # FastAPI application β”‚ β”œβ”€β”€ cli.py # CLI entry point β”‚ β”œβ”€β”€ container_manager.py # Container orchestration β”‚ └── ... # Static assets and templates β”‚ β”œβ”€β”€ containers/ # Container definitions 🐳 β”‚ β”œβ”€β”€ Containerfile.vllm-playground # πŸ†• Web UI container (orchestrator) β”‚ β”œβ”€β”€ Containerfile.mac # πŸ†• vLLM service container (macOS/CPU) β”‚ └── README.md # Container variants documentation β”‚ β”œβ”€β”€ openshift/ # πŸ†• OpenShift/Kubernetes deployment ☸️ β”‚ β”œβ”€β”€ kubernetes_container_manager.py # K8s API-based orchestration β”‚ β”œβ”€β”€ Containerfile # Web UI container for OpenShift β”‚ β”œβ”€β”€ requirements-k8s.txt # Python dependencies (with K8s client) β”‚ β”œβ”€β”€ deploy.sh # Automated deployment (CPU/GPU) β”‚ β”œβ”€β”€ undeploy.sh # Automated undeployment β”‚ β”œβ”€β”€ build.sh # Container build script β”‚ β”œβ”€β”€ manifests/ # Kubernetes manifests β”‚ β”‚ β”œβ”€β”€ 00-secrets-template.yaml β”‚ β”‚ β”œβ”€β”€ 01-namespace.yaml β”‚ β”‚ β”œβ”€β”€ 02-rbac.yaml β”‚ β”‚ β”œβ”€β”€ 03-configmap.yaml β”‚ β”‚ β”œβ”€β”€ 04-webui-deployment.yaml β”‚ β”‚ └── 05-pvc-optional.yaml β”‚ β”œβ”€β”€ README.md # Architecture overview β”‚ └── QUICK_START.md # Quick deployment guide β”‚ β”œβ”€β”€ deployments/ # Legacy deployment scripts β”‚ β”œβ”€β”€ kubernetes-deployment.yaml β”‚ β”œβ”€β”€ openshift-deployment.yaml β”‚ └── deploy-to-openshift.sh β”‚ β”œβ”€β”€ static/ # Frontend assets β”‚ β”œβ”€β”€ css/ β”‚ β”‚ └── style.css # Main stylesheet β”‚ └── js/ β”‚ └── app.js # Frontend JavaScript β”‚ β”œβ”€β”€ scripts/ # Utility scripts β”‚ β”œβ”€β”€ run_cpu.sh # Start vLLM in CPU mode (macOS compatible) β”‚ β”œβ”€β”€ start.sh # General start script β”‚ β”œβ”€β”€ install.sh # Installation script β”‚ β”œβ”€β”€ verify_setup.py # Setup verification β”‚ β”œβ”€β”€ kill_playground.py # Kill running playground instances β”‚ └── restart_playground.sh # Restart playground β”‚ β”œβ”€β”€ config/ # Configuration files β”‚ β”œβ”€β”€ vllm_cpu.env # CPU mode environment variables β”‚ └── example_configs.json # Example configurations β”‚ β”œβ”€β”€ cli_demo/ # πŸ†• Command-line demo workflow β”‚ β”œβ”€β”€ scripts/ # Demo shell scripts β”‚ └── docs/ # Demo documentation β”‚ β”œβ”€β”€ recipes/ # πŸ†• vLLM Community Recipes πŸ“š β”‚ β”œβ”€β”€ recipes_catalog.json # Model configurations catalog β”‚ └── sync_recipes.py # GitHub sync script β”‚ β”œβ”€β”€ assets/ # Images and assets β”‚ β”œβ”€β”€ vllm-playground.png # WebUI screenshot β”‚ β”œβ”€β”€ guidellm.png # GuideLLM benchmark results screenshot β”‚ β”œβ”€β”€ vllm-recipes-1.png # πŸ†• Recipes browser screenshot β”‚ β”œβ”€β”€ vllm-recipes-2.png # πŸ†• Recipes details screenshot β”‚ β”œβ”€β”€ vllm.png # vLLM logo β”‚ └── vllm_only.png # vLLM logo (alternate) β”‚ └── docs/ # Documentation β”œβ”€β”€ QUICKSTART.md # Quick start guide β”œβ”€β”€ MACOS_CPU_GUIDE.md # macOS CPU setup guide β”œβ”€β”€ CPU_MODELS_QUICKSTART.md # CPU-optimized models guide β”œβ”€β”€ GATED_MODELS_GUIDE.md # Guide for accessing Llama, Gemma, etc. β”œβ”€β”€ TROUBLESHOOTING.md # Common issues and solutions β”œβ”€β”€ FEATURES.md # Feature documentation β”œβ”€β”€ PERFORMANCE_METRICS.md # Performance metrics └── QUICK_REFERENCE.md # Command reference ``` ## πŸš€ Quick Start ### πŸ“¦ Option 1: PyPI Installation (Easiest) Install and run with a single command: ```bash # Install from PyPI pip install vllm-playground # Or with benchmarking support pip install vllm-playground[benchmark] # Start the playground vllm-playground ``` Open http://localhost:7860 and click "Start Server" - vLLM container starts automatically! **CLI Options:** ```bash vllm-playground --help # Show all options vllm-playground pull # Pre-download GPU image (~10GB) with progress vllm-playground pull --cpu # Pre-download CPU image vllm-playground pull --all # Pre-download all images vllm-playground --port 8080 # Use custom port vllm-playground --host localhost # Bind to localhost only vllm-playground stop # Stop running instance vllm-playground status # Check if running ``` --- ### 🐳 Option 2: Container Orchestration (From Source) For development or customization: ```bash # 1. Clone the repository git clone https://github.com/micytao/vllm-playground.git cd vllm-playground # 2. Install Podman (if not already installed) # macOS: brew install podman # Linux: dnf install podman or apt install podman # 3. Install Python dependencies pip install -r requirements.txt # 4. Start the Web UI python run.py # 5. Open http://localhost:7860 # 6. Click "Start Server" - vLLM container starts automatically! ``` **✨ Benefits:** - βœ… No vLLM installation required - βœ… Automatic container lifecycle management - βœ… Isolated vLLM environment - βœ… Same UI works locally and on OpenShift/Kubernetes **How it works:** - Web UI runs on your host - vLLM runs in an isolated container - Container manager (`container_manager.py`) orchestrates everything **Note:** The Web UI will automatically pull and start the vLLM container when you click "Start Server" --- ### ☸️ Option 3: OpenShift/Kubernetes Deployment Deploy the entire stack to OpenShift or Kubernetes with dynamic pod management: ```bash # 1. Build and push Web UI container cd openshift/ podman build -f Containerfile -t your-registry/vllm-playground:latest . podman push your-registry/vllm-playground:latest # 2. Deploy to cluster (GPU or CPU mode) ./deploy.sh --gpu # For GPU clusters ./deploy.sh --cpu # For CPU-only clusters # 3. Get the URL oc get route vllm-playground -n vllm-playground ``` **✨ Benefits:** - βœ… Enterprise-grade deployment - βœ… Dynamic vLLM pod creation via Kubernetes API - βœ… Same UI and workflow as local setup - βœ… Auto-scaling and resource management **πŸ“– See [openshift/README.md](openshift/README.md)** and **[openshift/QUICK_START.md](openshift/QUICK_START.md)** for detailed instructions. --- ### πŸ’» Option 4: Local Installation (Traditional) For local development without containers: #### 1. Install vLLM ```bash # For macOS/CPU mode pip install vllm ``` #### 2. Install Dependencies ```bash pip install -r requirements.txt ``` #### 3. Start the WebUI ```bash python run.py ``` Then open http://localhost:7860 in your browser. #### 4. Start vLLM Server **Option A: Using the WebUI** - Select CPU or GPU mode - Click "Start Server" **Option B: Using the script (macOS/CPU)** ```bash ./scripts/run_cpu.sh ``` ## ☸️ OpenShift/Kubernetes Deployment Deploy vLLM Playground to enterprise Kubernetes/OpenShift clusters with dynamic pod management: **Features:** - βœ… Dynamic vLLM pod creation via Kubernetes API - βœ… GPU and CPU mode support with Red Hat images - βœ… RBAC-based security model - βœ… Automated deployment scripts - βœ… Same UI and workflow as local setup **Quick Deploy:** ```bash cd openshift/ ./deploy.sh --gpu # For GPU clusters ./deploy.sh --cpu # For CPU-only clusters ``` **πŸ“– Full Documentation:** See [openshift/README.md](openshift/README.md) and [openshift/QUICK_START.md](openshift/QUICK_START.md) --- ## πŸ’» macOS Apple Silicon Support For macOS users, vLLM runs in CPU mode using containerization: **Container Mode (Recommended):** ```bash # Just start the Web UI - it handles containers automatically python run.py # Click "Start Server" in the UI ``` **Direct Mode:** ```bash # Edit CPU configuration nano config/vllm_cpu.env # Run vLLM directly ./scripts/run_cpu.sh ``` **πŸ“– See [docs/MACOS_CPU_GUIDE.md](docs/MACOS_CPU_GUIDE.md)** for detailed setup. ## ✨ Features - **πŸ’¬ Modern Chat Interface**: Streamlined, ChatGPT-style chat experience πŸ†• - Clean, distraction-free interface with inline expandable panels - Icon toolbar for quick access to advanced features - System prompt templates (8 presets: Helpful, Coder, Writer, Teacher, etc.) - Real-time response metrics and token counting - **πŸ—οΈ Structured Outputs**: Constrain model responses to specific formats πŸ†• - **Choice**: Force output to one of specific values (sentiment, yes/no, etc.) - **Regex**: Match output to regex patterns (email, phone, date formats) - **JSON Schema**: Generate valid JSON matching your schema - **Grammar (EBNF)**: Define complex output structures ![Structured Outputs](https://raw.githubusercontent.com/micytao/vllm-playground/main/assets/vllm-playground-structured-outputs.png) *Structured Outputs with Choice mode for sentiment analysis - responses constrained to "positive", "negative", or "neutral"* - **πŸ”§ Tool Calling / Function Calling**: Define custom tools for the model πŸ†• - Server-side configuration: Enable in Server Configuration panel before starting - Auto-detected parsers: Llama 3.x, Mistral, Hermes, Qwen, Granite, InternLM - Create tools with name, description, and parameters - Preset tools (Weather, Calculator, Search) - Parallel tool calls support - Per-request tool_choice control (none/auto) - **πŸ”— MCP Server Integration**: Model Context Protocol support *(Coming Soon)* πŸ†• - **βž• RAG Support**: Retrieval-Augmented Generation *(Coming Soon)* πŸ†• - **🐳 Container Orchestration**: Automatic vLLM container lifecycle management - Local development: Podman-based orchestration - Enterprise deployment: Kubernetes API-based orchestration - Seamless switching between local and cloud environments - Smart container reuse (fast restarts with same config) - Unified CLI args: All container images now use same interface as official vLLM πŸ†• - **☸️ OpenShift/Kubernetes Deployment**: Production-ready cloud deployment πŸ†• - Dynamic pod creation via Kubernetes API - CPU and GPU mode support - RBAC-based security - Automated deployment scripts - **🎯 Intelligent Hardware Detection**: Automatic GPU availability detection πŸ†• - Kubernetes-native: Queries cluster nodes for `nvidia.com/gpu` resources - Automatic UI adaptation: GPU mode enabled/disabled based on availability - No nvidia-smi required: Uses Kubernetes API for detection - Fallback support: nvidia-smi detection for local environments - **Performance Benchmarking**: GuideLLM integration for comprehensive load testing with detailed metrics - Request statistics (success rate, duration, avg times) - Token throughput analysis (mean/median tokens per second) - Latency percentiles (P50, P75, P90, P95, P99) - Configurable load patterns and request rates - **πŸ“š vLLM Community Recipes**: One-click model configurations πŸ†• - Browse 17+ model categories from official vLLM recipes - One-click configuration loading for optimized settings - Searchable by model name, category, or tags - Add, edit, or sync custom recipes - Hardware requirements and documentation links - **Server Management**: Start/stop vLLM servers from the UI - **Chat Interface**: Interactive chat with streaming responses - **Smart Chat Templates**: Automatic model-specific template detection - **Performance Metrics**: Real-time token counts and generation speed - **Model Support**: Pre-configured popular models + custom model support - **Gated Model Access**: Built-in HuggingFace token support for Llama, Gemma, etc. - **CPU & GPU Modes**: Automatic detection and configuration - **macOS Optimized**: Special support for Apple Silicon - **Resizable Panels**: Customizable layout - **Command Preview**: See exact commands before execution ## πŸ“– Documentation ### Getting Started - **[Quick Start Guide](docs/QUICKSTART.md)** - Get up and running in minutes - **[Command-Line Demo Guide](cli_demo/docs/CLI_DEMO_GUIDE.md)** - Full workflow demo with vLLM & GuideLLM - [macOS CPU Setup](docs/MACOS_CPU_GUIDE.md) - Apple Silicon optimization guide - [CPU Models Quickstart](docs/CPU_MODELS_QUICKSTART.md) - Best models for CPU ### Container & Deployment - **[OpenShift/Kubernetes Deployment](openshift/README.md)** ☸️ - Enterprise deployment guide πŸ†• - **[OpenShift Quick Start](openshift/QUICK_START.md)** - 5-minute deployment πŸ†• - **[Container Variants](containers/README.md)** 🐳 - Local container setup - [Legacy Deployment Scripts](deployments/README.md) - Kubernetes manifests ### Model Configuration - **[Gated Models Guide (Llama, Gemma)](docs/GATED_MODELS_GUIDE.md)** ⭐ - Access restricted models ### Reference - [Feature Overview](docs/FEATURES.md) - Complete feature list - [Performance Metrics](docs/PERFORMANCE_METRICS.md) - Benchmarking and metrics - [Command Reference](docs/QUICK_REFERENCE.md) - Command cheat sheet - [CLI Quick Reference](cli_demo/docs/CLI_QUICK_REFERENCE.md) - Command-line demo quick reference - [Troubleshooting](docs/TROUBLESHOOTING.md) - Common issues and solutions ## πŸ”§ Configuration ### CPU Mode (macOS) Edit `config/vllm_cpu.env`: ```bash export VLLM_CPU_KVCACHE_SPACE=40 export VLLM_CPU_OMP_THREADS_BIND=auto ``` ### Tool Calling Configuration πŸ†• Tool calling enables models to use functions/tools you define. This is a **server-side feature** that must be enabled before starting the server. **How to Enable:** 1. Check "Enable Tool Calling" in the **Server Configuration** panel 2. Select a Tool Call Parser (or leave on "Auto-detect") 3. Start the server 4. Define tools in the **Tool Calling** panel (πŸ”§ icon in toolbar) **Supported Models & Parsers:** | Model Family | Parser | Example Models | |--------------|--------|----------------| | Llama 3.x | `llama3_json` | Llama-3.2-1B-Instruct, Llama-3.1-8B-Instruct | | Mistral | `mistral` | Mistral-7B-Instruct-v0.3, Mixtral-8x7B | | Hermes | `hermes` | Hermes-2-Pro, Hermes-3 | | Qwen | `hermes` | Qwen2.5-7B-Instruct, Qwen2-VL | | Granite | `granite-20b-fc` | granite-20b-functioncalling | | InternLM | `internlm` | InternLM2.5-7B-Chat | **Per-Request Options:** - **Tool Choice**: `none` (disable) or `auto` (let model decide) - **Tools**: Define in the Tool Calling panel **Note:** Tool calling adds `--enable-auto-tool-choice --tool-call-parser ` to the vLLM startup command. ### Supported Models **CPU-Optimized Models (Recommended for macOS):** - **TinyLlama/TinyLlama-1.1B-Chat-v1.0** (default) - Fast, no token required - **meta-llama/Llama-3.2-1B** - Latest Llama, requires HF token (gated) - **google/gemma-2-2b** - High quality, requires HF token (gated) - facebook/opt-125m - Tiny test model **Larger Models (Slow on CPU, better on GPU):** - meta-llama/Llama-2-7b-chat-hf (requires HF token) - mistralai/Mistral-7B-Instruct-v0.2 - Custom models via text input **πŸ“Œ Note**: Gated models (Llama, Gemma) require a HuggingFace token. See [Gated Models Guide](docs/GATED_MODELS_GUIDE.md) for setup. ## πŸ› οΈ Development ### Architecture The project uses a **hybrid architecture** that works seamlessly in both local and cloud environments: ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Web UI (FastAPI) β”‚ β”‚ app.py + index.html + static/ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”œβ”€β†’ container_manager.py (Local) β”‚ └─→ Podman CLI β”‚ └─→ vLLM Container β”‚ └─→ kubernetes_container_manager.py (Cloud) └─→ Kubernetes API └─→ vLLM Pods ``` **Key Components:** - **Backend**: FastAPI (`app.py`) - **Container Manager (Local)**: Podman orchestration (`container_manager.py`) - **Container Manager (K8s)**: Kubernetes API orchestration (`openshift/kubernetes_container_manager.py`) - **Frontend**: Vanilla JavaScript (`static/js/app.js`) - **Styling**: Custom CSS (`static/css/style.css`) - **Scripts**: Bash scripts in `scripts/` - **Config**: Environment files in `config/` ### Running in Development ```bash # Start backend with auto-reload uvicorn app:app --reload --port 7860 # Or use the run script python run.py ``` ### Container Development ```bash # Build vLLM service container (macOS/CPU ARM64) podman build -f containers/Containerfile.mac -t vllm-mac:v0.11.0 . # Build vLLM service container (Linux x86_64 CPU) podman build -f containers/Containerfile.cpu -t vllm-cpu:v0.11.0 . # Build Web UI orchestrator container podman build -f containers/Containerfile.vllm-playground -t vllm-playground:latest . # Build OpenShift Web UI container podman build -f openshift/Containerfile -t vllm-playground-webui:latest . ``` **Container Architecture:** All custom container images now use the same CLI argument pattern as the official vLLM image: ```bash # All images accept vLLM CLI args directly podman run vllm-mac:v0.11.0 \ --model meta-llama/Llama-3.2-1B-Instruct \ --host 0.0.0.0 \ --port 8000 \ --enable-auto-tool-choice \ --tool-call-parser llama3_json ``` ## πŸ“ License MIT License - See [LICENSE](LICENSE) file for details ## 🀝 Contributing Contributions welcome! Please feel free to submit issues and pull requests. ## πŸ”— Links - [vLLM Official Documentation](https://docs.vllm.ai/) - [vLLM CPU Mode Guide](https://docs.vllm.ai/en/stable/getting_started/installation/cpu.html) - [vLLM GitHub](https://github.com/vllm-project/vllm) - **[LLMCompressor Playground](https://github.com/micytao/llmcompressor-playground)** - Separate project for model compression and quantization - [GuideLLM](https://github.com/neuralmagic/guidellm) - Performance benchmarking tool ## πŸ—οΈ Architecture Overview ### Local Development (Container Orchestration) ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ User Browser β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ http://localhost:7860 ↓ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Web UI (Host) β”‚ ← FastAPI app β”‚ app.py β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ Podman CLI ↓ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ container_managerβ”‚ ← Podman orchestration β”‚ .py β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ podman run/stop ↓ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ vLLM Container β”‚ ← Isolated vLLM service β”‚ (Port 8000) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` ### OpenShift/Kubernetes Deployment ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ User Browser β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ https://route-url ↓ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ OpenShift Route β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ↓ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Web UI Pod β”‚ ← FastAPI app in container β”‚ (Deployment) β”‚ ← Auto-detects GPU availability β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ Kubernetes API β”‚ (reads nodes, creates/deletes pods) ↓ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ kubernetes_ β”‚ ← K8s API orchestration β”‚ container_ β”‚ ← Checks nvidia.com/gpu resources β”‚ manager.py β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ create/delete pods ↓ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ vLLM Pod β”‚ ← Dynamically created β”‚ (Dynamic) β”‚ ← GPU: Official vLLM image β”‚ β”‚ ← CPU: Self-built optimized image β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` **Container Images:** - **GPU Mode**: Official vLLM image (`vllm/vllm-openai:v0.11.0`) - **CPU Mode (Linux x86)**: Self-built optimized image (`quay.io/rh_ee_micyang/vllm-cpu:v0.11.0`) - **CPU Mode (macOS ARM64)**: Self-built optimized image (`quay.io/rh_ee_micyang/vllm-mac:v0.11.0`) πŸ†• **Key Features:** - Same UI code works in both environments - Container manager is swapped at build time (Podman β†’ Kubernetes) - Identical user experience locally and in the cloud - Smart container/pod lifecycle management - **Automatic GPU detection**: UI adapts based on cluster hardware - Kubernetes-native: Queries nodes for `nvidia.com/gpu` resources - Automatic mode selection: GPU mode disabled if no GPUs available - RBAC-secured: Requires node read permissions (automatically configured) - No registry authentication needed (all images are publicly accessible) --- ## πŸ†˜ Troubleshooting ### Container-Related Issues #### Container Won't Start ```bash # Check if Podman is installed podman --version # Check Podman connectivity podman ps # View container logs podman logs vllm-service ``` #### "Address Already in Use" Error If you lose connection to the Web UI and get `ERROR: address already in use`: ```bash # Quick Fix: Auto-detect and kill old process python run.py # Alternative: Manual restart ./scripts/restart_playground.sh # Or kill manually python scripts/kill_playground.py ``` #### vLLM Container Issues ```bash # Check if container is running podman ps -a | grep vllm-service # View vLLM logs podman logs -f vllm-service # Stop and remove container podman stop vllm-service && podman rm vllm-service # Pull latest vLLM images podman pull quay.io/rh_ee_micyang/vllm-mac:v0.11.0 # macOS ARM64 podman pull quay.io/rh_ee_micyang/vllm-cpu:v0.11.0 # Linux x86_64 podman pull vllm/vllm-openai:v0.11.0 # GPU (official) ``` #### Tool Calling Not Working Tool calling requires **server-side configuration**. If tools aren't being called: 1. **Verify server was started with tool calling enabled:** - Check "Enable Tool Calling" in Server Configuration BEFORE starting - Look for this in startup logs: `Tool calling enabled with parser: llama3_json` 2. **Verify CLI args are passed:** ``` vLLM arguments: --model ... --enable-auto-tool-choice --tool-call-parser llama3_json ``` 3. **If using container mode**, ensure the container was started fresh after enabling tool calling (stop and restart if needed) ### OpenShift/Kubernetes Issues #### GPU Mode Not Available The Web UI automatically detects GPU availability by querying Kubernetes nodes for `nvidia.com/gpu` resources. If GPU mode is disabled in the UI: **Check GPU availability in your cluster:** ```bash # List nodes with GPU capacity oc get nodes -o custom-columns=NAME:.metadata.name,GPU:.status.capacity.nvidia\.com/gpu # Or check all node details oc describe nodes | grep nvidia.com/gpu ``` **If GPUs exist but not detected:** 1. Verify RBAC permissions: ```bash # Check if service account has node read permissions oc auth can-i list nodes --as=system:serviceaccount:vllm-playground:vllm-playground-sa # Should return "yes" ``` 2. Reapply RBAC if needed: ```bash oc apply -f openshift/manifests/02-rbac.yaml ``` 3. Check Web UI logs for detection errors: ```bash oc logs -f deployment/vllm-playground-cpu -n vllm-playground | grep -i gpu ``` **Expected behavior:** - **GPU available**: Both CPU and GPU modes enabled in UI - **No GPU**: GPU mode automatically disabled, forced to CPU-only mode - **Detection method logged**: Check logs for "GPU detected via Kubernetes API" or "No GPUs found" #### Pod Not Starting ```bash # Check pod status oc get pods -n vllm-playground # View pod logs oc logs -f deployment/vllm-playground-gpu -n vllm-playground # Describe pod for events oc describe pod -n vllm-playground ``` #### Out of Memory (OOM) Issues **⚠️ IMPORTANT: Resource Requirements for GuideLLM Benchmarks** The Web UI pod requires sufficient memory to avoid OOM kills when running GuideLLM benchmarks. GuideLLM generates many concurrent requests for load testing, which can quickly consume available memory. **Memory usage scales with:** - Number of concurrent users/requests - Request rate (requests per second) - Model size and response length - Benchmark duration **Recommended Memory Limits:** - **GPU Mode (default)**: 16Gi minimum - For intensive GuideLLM benchmarks: **32Gi+** - For high-concurrency tests (50+ users): **64Gi+** - **CPU Mode**: 64Gi minimum - For intensive GuideLLM benchmarks: **128Gi+** **To increase resources:** Edit `openshift/manifests/04-webui-deployment.yaml`: ```yaml resources: limits: memory: "32Gi" # Increase based on benchmark intensity cpu: "8" ``` Then reapply: ```bash oc apply -f openshift/manifests/04-webui-deployment.yaml ``` **Symptoms of OOM:** - Pod restarts during benchmarks - Benchmark failures with connection errors - `OOMKilled` status in pod events: `oc describe pod ` #### Image Pull Errors **Note:** The deployment now uses publicly accessible container images: - **GPU**: `vllm/vllm-openai:v0.11.0` (official vLLM image) - **CPU**: `quay.io/rh_ee_micyang/vllm-cpu:v0.11.0` (self-built, publicly accessible) No registry authentication or pull secrets are required. If you encounter image pull errors: ```bash # Verify image accessibility podman pull vllm/vllm-openai:v0.11.0 # For GPU podman pull quay.io/rh_ee_micyang/vllm-cpu:v0.11.0 # For CPU # Check pod events for details oc describe pod -n vllm-playground ``` **πŸ“– See [openshift/QUICK_START.md](openshift/QUICK_START.md)** for detailed OpenShift troubleshooting ### Local Installation Issues #### macOS Segmentation Fault Use CPU mode with proper environment variables or use container mode (recommended). See [docs/MACOS_CPU_GUIDE.md](docs/MACOS_CPU_GUIDE.md). #### Server Won't Start 1. Check if vLLM is installed: `python -c "import vllm; print(vllm.__version__)"` 2. Check port availability: `lsof -i :8000` 3. Review server logs in the WebUI #### Chat Not Streaming Check browser console (F12) for errors and ensure the server is running. --- Made with ❀️ for the vLLM community