# vllm-playground

**Repository Path**: cdpro/vllm-playground

## Basic Information

- **Project Name**: vllm-playground
- **Description**: No description available
- **Primary Language**: Python
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-01-06
- **Last Updated**: 2026-01-06

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# vLLM Playground

A modern web interface for managing and interacting with vLLM servers (www.github.com/vllm-project/vllm). Supports both GPU and CPU modes, with special optimizations for macOS Apple Silicon and enterprise deployment on OpenShift/Kubernetes.

### ✨ New UI with Tool Calling Support
![vLLM Playground Interface](https://raw.githubusercontent.com/micytao/vllm-playground/main/assets/vllm-playground-newUI.png)

### ✨ New UI with Structured Outputs Support
![vLLM Playground with Structured Outputs](https://raw.githubusercontent.com/micytao/vllm-playground/main/assets/vllm-playground-structured-outputs.png)

### ✨ New UI Enhancements

- **🎨 Modern Dark Theme**: Sleek, professional interface with improved visual hierarchy and contrast
- **💬 Streamlined Chat Interface**: Clean, distraction-free chat UI with inline expandable panels
- **🔧 Icon Toolbar**: Compact icon bar for quick access to advanced features (ChatGPT-style)
  - ⚙️ **Chat Settings**: Temperature and max tokens configuration
  - 💬 **System Prompt**: Customizable with 8 preset templates (Helpful, Coder, Writer, Teacher, etc.)
  - 🏗️ **Structured Outputs**: Constrain model responses to Choice, Regex, JSON Schema, or Grammar
  - 🔧 **Tool Calling/Function Calling**: Define custom tools with parameters for function calling
  - 🔗 **MCP Servers**: Model Context Protocol integration *(Coming Soon)*
  - ➕ **RAG**: Retrieval-Augmented Generation support *(Coming Soon)*

## 📦 Quick Install via PyPI

```bash
# Basic installation
pip install vllm-playground

# With GuideLLM benchmarking support
pip install vllm-playground[benchmark]

# First time? Pre-download the container image (~10GB for GPU)
vllm-playground pull

# Start the playground
vllm-playground
```

Open http://localhost:7860 in your browser - that's it! 🚀

> 💡 **Tip**: The `vllm-playground pull` command pre-downloads the large container image with progress display, so you don't have to wait during server startup!

![Pre-pull Container Image](https://raw.githubusercontent.com/micytao/vllm-playground/main/assets/vllm-playground-pull.png)

*Pre-download the ~10GB GPU container image with progress display - no more waiting during server startup!*

## 📦 New: PyPI Package

**One-command installation!** vLLM Playground is now available on PyPI for easy distribution and installation.

```bash
pip install vllm-playground
vllm-playground
```

**Key Benefits:**
- ✅ **Simple Installation**: Just `pip install` - no cloning required
- ✅ **CLI Support**: `vllm-playground` command with options
- ✅ **Pre-Pull Images**: `vllm-playground pull` downloads container images with progress display
- ✅ **Auto GPU/CPU**: Automatically uses sudo for GPU containers, rootless for CPU
- ✅ **Optional Extras**: Install with `[benchmark]` for GuideLLM support
- ✅ **Easy Updates**: `pip install --upgrade vllm-playground`

---

## 🐳 New: Containerized vLLM Service

**No more manual vLLM installation!** The Web UI now automatically manages vLLM in isolated containers, providing a seamless experience from local development to enterprise deployment.

**📹 Watch Demo: Automatic Container Startup**

![Start vLLM Demo](https://raw.githubusercontent.com/micytao/vllm-playground/main/assets/start-vllm.gif)

*See how easy it is: Just click "Start Server" and the container orchestrator automatically starts the vLLM container - no manual installation or configuration needed!*

**📹 Watch Demo: Automatic Container Shutdown**

![Stop vLLM Demo](https://raw.githubusercontent.com/micytao/vllm-playground/main/assets/stop-vllm.gif)

*Clean shutdown: Click "Stop Server" and the container orchestrator gracefully stops the vLLM container with automatic cleanup!*

**Key Benefits:**
- ✅ **Zero Setup**: No vLLM installation required - containers handle everything
- ✅ **Isolated Environment**: vLLM runs in its own container, preventing conflicts
- ✅ **Smart Management**: Automatic container lifecycle (start, stop, logs, health checks)
- ✅ **Fast Restarts**: Configuration caching for quick server restarts
- ✅ **Hybrid Architecture**: Same UI works locally (Podman) and in cloud (Kubernetes)

**Architecture:**
- **Local Development**: Podman-based container orchestration
- **Enterprise Deployment**: OpenShift/Kubernetes with dynamic pod creation
- **Container Manager**: Automatic lifecycle management with smart reuse

## 📊 New: GuideLLM Benchmarking

Integrated GuideLLM for comprehensive performance benchmarking and analysis. Run load tests and get detailed metrics on throughput, latency, and token generation performance!

![GuideLLM Benchmark Results](https://raw.githubusercontent.com/micytao/vllm-playground/main/assets/guidellm.png)

## 📚 New: vLLM Community Recipes

**One-click model configurations from the official [vLLM Recipes Repository](https://github.com/vllm-project/recipes)!** Browse community-maintained configurations for popular models like DeepSeek, Qwen, Llama, Mistral, and more.

![vLLM Recipes Browser](https://raw.githubusercontent.com/micytao/vllm-playground/main/assets/vllm-recipes-1.png)

*Browse 17+ model categories with optimized configurations - just click "Load Config" to auto-fill all settings!*

![vLLM Recipes Details](https://raw.githubusercontent.com/micytao/vllm-playground/main/assets/vllm-recipes-2.png)

*Each recipe includes hardware requirements, vLLM parameters, and direct links to documentation.*

**Key Features:**
- ✅ **One-Click Configuration**: Load optimized vLLM settings instantly
- ✅ **Community-Maintained**: Syncs with official vLLM recipes repository
- ✅ **Searchable Catalog**: Filter by model name, category, or tags (multi-gpu, vision, reasoning, etc.)
- ✅ **Hardware Guidance**: See recommended GPU configurations for each model
- ✅ **Custom Recipes**: Add, edit, or delete your own recipes
- ✅ **GitHub Sync**: Update catalog from GitHub with optional token for higher rate limits

**Supported Model Families:**
DeepSeek, Qwen, Llama, Mistral, InternVL, GLM, NVIDIA Nemotron, Moonshot AI (Kimi), MiniMax, Jina AI, Tencent Hunyuan, Ernie, OpenAI, PaddlePaddle, Seed, inclusionAI, and CPU-friendly models.

## 🔧 Model Compression

**Looking for model compression and quantization?** Check out the separate **[LLMCompressor Playground](https://github.com/micytao/llmcompressor-playground)** project for:
- Model quantization (INT8, INT4, FP8)
- GPTQ, AWQ, and SmoothQuant algorithms
- Built-in compression presets
- Integration with vLLM

This keeps the vLLM Playground focused on serving and benchmarking, while providing a dedicated tool for model optimization.

## 📁 Project Structure

```
vllm-playground/
├── app.py                       # Main FastAPI backend application
├── run.py                       # Backend server launcher
├── container_manager.py         # 🆕 Podman-based container orchestration (local)
├── index.html                   # Main HTML interface
├── pyproject.toml               # 🆕 PyPI package configuration
├── requirements.txt             # Python dependencies
├── env.example                  # Example environment variables
├── LICENSE                      # MIT License
├── README.md                    # This file
│
├── vllm_playground/             # 🆕 PyPI package source
│   ├── __init__.py              # Package version and exports
│   ├── app.py                   # FastAPI application
│   ├── cli.py                   # CLI entry point
│   ├── container_manager.py     # Container orchestration
│   └── ...                      # Static assets and templates
│
├── containers/                  # Container definitions 🐳
│   ├── Containerfile.vllm-playground  # 🆕 Web UI container (orchestrator)
│   ├── Containerfile.mac       # 🆕 vLLM service container (macOS/CPU)
│   └── README.md               # Container variants documentation
│
├── openshift/                   # 🆕 OpenShift/Kubernetes deployment ☸️
│   ├── kubernetes_container_manager.py  # K8s API-based orchestration
│   ├── Containerfile           # Web UI container for OpenShift
│   ├── requirements-k8s.txt    # Python dependencies (with K8s client)
│   ├── deploy.sh               # Automated deployment (CPU/GPU)
│   ├── undeploy.sh             # Automated undeployment
│   ├── build.sh                # Container build script
│   ├── manifests/              # Kubernetes manifests
│   │   ├── 00-secrets-template.yaml
│   │   ├── 01-namespace.yaml
│   │   ├── 02-rbac.yaml
│   │   ├── 03-configmap.yaml
│   │   ├── 04-webui-deployment.yaml
│   │   └── 05-pvc-optional.yaml
│   ├── README.md               # Architecture overview
│   └── QUICK_START.md          # Quick deployment guide
│
├── deployments/                 # Legacy deployment scripts
│   ├── kubernetes-deployment.yaml
│   ├── openshift-deployment.yaml
│   └── deploy-to-openshift.sh
│
├── static/                      # Frontend assets
│   ├── css/
│   │   └── style.css           # Main stylesheet
│   └── js/
│       └── app.js              # Frontend JavaScript
│
├── scripts/                     # Utility scripts
│   ├── run_cpu.sh              # Start vLLM in CPU mode (macOS compatible)
│   ├── start.sh                # General start script
│   ├── install.sh              # Installation script
│   ├── verify_setup.py         # Setup verification
│   ├── kill_playground.py      # Kill running playground instances
│   └── restart_playground.sh   # Restart playground
│
├── config/                      # Configuration files
│   ├── vllm_cpu.env            # CPU mode environment variables
│   └── example_configs.json    # Example configurations
│
├── cli_demo/                    # 🆕 Command-line demo workflow
│   ├── scripts/                # Demo shell scripts
│   └── docs/                   # Demo documentation
│
├── recipes/                     # 🆕 vLLM Community Recipes 📚
│   ├── recipes_catalog.json    # Model configurations catalog
│   └── sync_recipes.py         # GitHub sync script
│
├── assets/                      # Images and assets
│   ├── vllm-playground.png     # WebUI screenshot
│   ├── guidellm.png            # GuideLLM benchmark results screenshot
│   ├── vllm-recipes-1.png      # 🆕 Recipes browser screenshot
│   ├── vllm-recipes-2.png      # 🆕 Recipes details screenshot
│   ├── vllm.png                # vLLM logo
│   └── vllm_only.png           # vLLM logo (alternate)
│
└── docs/                        # Documentation
    ├── QUICKSTART.md            # Quick start guide
    ├── MACOS_CPU_GUIDE.md       # macOS CPU setup guide
    ├── CPU_MODELS_QUICKSTART.md # CPU-optimized models guide
    ├── GATED_MODELS_GUIDE.md    # Guide for accessing Llama, Gemma, etc.
    ├── TROUBLESHOOTING.md       # Common issues and solutions
    ├── FEATURES.md              # Feature documentation
    ├── PERFORMANCE_METRICS.md   # Performance metrics
    └── QUICK_REFERENCE.md       # Command reference
```

## 🚀 Quick Start

### 📦 Option 1: PyPI Installation (Easiest)

Install and run with a single command:

```bash
# Install from PyPI
pip install vllm-playground

# Or with benchmarking support
pip install vllm-playground[benchmark]

# Start the playground
vllm-playground
```

Open http://localhost:7860 and click "Start Server" - vLLM container starts automatically!

**CLI Options:**
```bash
vllm-playground --help              # Show all options
vllm-playground pull                # Pre-download GPU image (~10GB) with progress
vllm-playground pull --cpu          # Pre-download CPU image
vllm-playground pull --all          # Pre-download all images
vllm-playground --port 8080         # Use custom port
vllm-playground --host localhost    # Bind to localhost only
vllm-playground stop                # Stop running instance
vllm-playground status              # Check if running
```

---

### 🐳 Option 2: Container Orchestration (From Source)

For development or customization:

```bash
# 1. Clone the repository
git clone https://github.com/micytao/vllm-playground.git
cd vllm-playground

# 2. Install Podman (if not already installed)
# macOS: brew install podman
# Linux: dnf install podman or apt install podman

# 3. Install Python dependencies
pip install -r requirements.txt

# 4. Start the Web UI
python run.py

# 5. Open http://localhost:7860
# 6. Click "Start Server" - vLLM container starts automatically!
```

**✨ Benefits:**
- ✅ No vLLM installation required
- ✅ Automatic container lifecycle management
- ✅ Isolated vLLM environment
- ✅ Same UI works locally and on OpenShift/Kubernetes

**How it works:**
- Web UI runs on your host
- vLLM runs in an isolated container
- Container manager (`container_manager.py`) orchestrates everything

**Note:** The Web UI will automatically pull and start the vLLM container when you click "Start Server"

---

### ☸️ Option 3: OpenShift/Kubernetes Deployment

Deploy the entire stack to OpenShift or Kubernetes with dynamic pod management:

```bash
# 1. Build and push Web UI container
cd openshift/
podman build -f Containerfile -t your-registry/vllm-playground:latest .
podman push your-registry/vllm-playground:latest

# 2. Deploy to cluster (GPU or CPU mode)
./deploy.sh --gpu   # For GPU clusters
./deploy.sh --cpu   # For CPU-only clusters

# 3. Get the URL
oc get route vllm-playground -n vllm-playground
```

**✨ Benefits:**
- ✅ Enterprise-grade deployment
- ✅ Dynamic vLLM pod creation via Kubernetes API
- ✅ Same UI and workflow as local setup
- ✅ Auto-scaling and resource management

**📖 See [openshift/README.md](openshift/README.md)** and **[openshift/QUICK_START.md](openshift/QUICK_START.md)** for detailed instructions.

---

### 💻 Option 4: Local Installation (Traditional)

For local development without containers:

#### 1. Install vLLM

```bash
# For macOS/CPU mode
pip install vllm
```

#### 2. Install Dependencies

```bash
pip install -r requirements.txt
```

#### 3. Start the WebUI

```bash
python run.py
```

Then open http://localhost:7860 in your browser.

#### 4. Start vLLM Server

**Option A: Using the WebUI**
- Select CPU or GPU mode
- Click "Start Server"

**Option B: Using the script (macOS/CPU)**
```bash
./scripts/run_cpu.sh
```

## ☸️ OpenShift/Kubernetes Deployment

Deploy vLLM Playground to enterprise Kubernetes/OpenShift clusters with dynamic pod management:

**Features:**
- ✅ Dynamic vLLM pod creation via Kubernetes API
- ✅ GPU and CPU mode support with Red Hat images
- ✅ RBAC-based security model
- ✅ Automated deployment scripts
- ✅ Same UI and workflow as local setup

**Quick Deploy:**
```bash
cd openshift/
./deploy.sh --gpu    # For GPU clusters
./deploy.sh --cpu    # For CPU-only clusters
```

**📖 Full Documentation:** See [openshift/README.md](openshift/README.md) and [openshift/QUICK_START.md](openshift/QUICK_START.md)

---

## 💻 macOS Apple Silicon Support

For macOS users, vLLM runs in CPU mode using containerization:

**Container Mode (Recommended):**
```bash
# Just start the Web UI - it handles containers automatically
python run.py
# Click "Start Server" in the UI
```

**Direct Mode:**
```bash
# Edit CPU configuration
nano config/vllm_cpu.env

# Run vLLM directly
./scripts/run_cpu.sh
```

**📖 See [docs/MACOS_CPU_GUIDE.md](docs/MACOS_CPU_GUIDE.md)** for detailed setup.

## ✨ Features

- **💬 Modern Chat Interface**: Streamlined, ChatGPT-style chat experience 🆕
  - Clean, distraction-free interface with inline expandable panels
  - Icon toolbar for quick access to advanced features
  - System prompt templates (8 presets: Helpful, Coder, Writer, Teacher, etc.)
  - Real-time response metrics and token counting
- **🏗️ Structured Outputs**: Constrain model responses to specific formats 🆕
  - **Choice**: Force output to one of specific values (sentiment, yes/no, etc.)
  - **Regex**: Match output to regex patterns (email, phone, date formats)
  - **JSON Schema**: Generate valid JSON matching your schema
  - **Grammar (EBNF)**: Define complex output structures
  
  ![Structured Outputs](https://raw.githubusercontent.com/micytao/vllm-playground/main/assets/vllm-playground-structured-outputs.png)
  
  *Structured Outputs with Choice mode for sentiment analysis - responses constrained to "positive", "negative", or "neutral"*

- **🔧 Tool Calling / Function Calling**: Define custom tools for the model 🆕
  - Server-side configuration: Enable in Server Configuration panel before starting
  - Auto-detected parsers: Llama 3.x, Mistral, Hermes, Qwen, Granite, InternLM
  - Create tools with name, description, and parameters
  - Preset tools (Weather, Calculator, Search)
  - Parallel tool calls support
  - Per-request tool_choice control (none/auto)
- **🔗 MCP Server Integration**: Model Context Protocol support *(Coming Soon)* 🆕
- **➕ RAG Support**: Retrieval-Augmented Generation *(Coming Soon)* 🆕
- **🐳 Container Orchestration**: Automatic vLLM container lifecycle management
  - Local development: Podman-based orchestration
  - Enterprise deployment: Kubernetes API-based orchestration
  - Seamless switching between local and cloud environments
  - Smart container reuse (fast restarts with same config)
  - Unified CLI args: All container images now use same interface as official vLLM 🆕
- **☸️ OpenShift/Kubernetes Deployment**: Production-ready cloud deployment 🆕
  - Dynamic pod creation via Kubernetes API
  - CPU and GPU mode support
  - RBAC-based security
  - Automated deployment scripts
- **🎯 Intelligent Hardware Detection**: Automatic GPU availability detection 🆕
  - Kubernetes-native: Queries cluster nodes for `nvidia.com/gpu` resources
  - Automatic UI adaptation: GPU mode enabled/disabled based on availability
  - No nvidia-smi required: Uses Kubernetes API for detection
  - Fallback support: nvidia-smi detection for local environments
- **Performance Benchmarking**: GuideLLM integration for comprehensive load testing with detailed metrics
  - Request statistics (success rate, duration, avg times)
  - Token throughput analysis (mean/median tokens per second)
  - Latency percentiles (P50, P75, P90, P95, P99)
  - Configurable load patterns and request rates
- **📚 vLLM Community Recipes**: One-click model configurations 🆕
  - Browse 17+ model categories from official vLLM recipes
  - One-click configuration loading for optimized settings
  - Searchable by model name, category, or tags
  - Add, edit, or sync custom recipes
  - Hardware requirements and documentation links
- **Server Management**: Start/stop vLLM servers from the UI
- **Chat Interface**: Interactive chat with streaming responses
- **Smart Chat Templates**: Automatic model-specific template detection
- **Performance Metrics**: Real-time token counts and generation speed
- **Model Support**: Pre-configured popular models + custom model support
- **Gated Model Access**: Built-in HuggingFace token support for Llama, Gemma, etc.
- **CPU & GPU Modes**: Automatic detection and configuration
- **macOS Optimized**: Special support for Apple Silicon
- **Resizable Panels**: Customizable layout
- **Command Preview**: See exact commands before execution

## 📖 Documentation

### Getting Started
- **[Quick Start Guide](docs/QUICKSTART.md)** - Get up and running in minutes
- **[Command-Line Demo Guide](cli_demo/docs/CLI_DEMO_GUIDE.md)** - Full workflow demo with vLLM & GuideLLM
- [macOS CPU Setup](docs/MACOS_CPU_GUIDE.md) - Apple Silicon optimization guide
- [CPU Models Quickstart](docs/CPU_MODELS_QUICKSTART.md) - Best models for CPU

### Container & Deployment
- **[OpenShift/Kubernetes Deployment](openshift/README.md)** ☸️ - Enterprise deployment guide 🆕
- **[OpenShift Quick Start](openshift/QUICK_START.md)** - 5-minute deployment 🆕
- **[Container Variants](containers/README.md)** 🐳 - Local container setup
- [Legacy Deployment Scripts](deployments/README.md) - Kubernetes manifests

### Model Configuration
- **[Gated Models Guide (Llama, Gemma)](docs/GATED_MODELS_GUIDE.md)** ⭐ - Access restricted models

### Reference
- [Feature Overview](docs/FEATURES.md) - Complete feature list
- [Performance Metrics](docs/PERFORMANCE_METRICS.md) - Benchmarking and metrics
- [Command Reference](docs/QUICK_REFERENCE.md) - Command cheat sheet
- [CLI Quick Reference](cli_demo/docs/CLI_QUICK_REFERENCE.md) - Command-line demo quick reference
- [Troubleshooting](docs/TROUBLESHOOTING.md) - Common issues and solutions

## 🔧 Configuration

### CPU Mode (macOS)

Edit `config/vllm_cpu.env`:
```bash
export VLLM_CPU_KVCACHE_SPACE=40
export VLLM_CPU_OMP_THREADS_BIND=auto
```

### Tool Calling Configuration 🆕

Tool calling enables models to use functions/tools you define. This is a **server-side feature** that must be enabled before starting the server.

**How to Enable:**
1. Check "Enable Tool Calling" in the **Server Configuration** panel
2. Select a Tool Call Parser (or leave on "Auto-detect")
3. Start the server
4. Define tools in the **Tool Calling** panel (🔧 icon in toolbar)

**Supported Models & Parsers:**
| Model Family | Parser | Example Models |
|--------------|--------|----------------|
| Llama 3.x | `llama3_json` | Llama-3.2-1B-Instruct, Llama-3.1-8B-Instruct |
| Mistral | `mistral` | Mistral-7B-Instruct-v0.3, Mixtral-8x7B |
| Hermes | `hermes` | Hermes-2-Pro, Hermes-3 |
| Qwen | `hermes` | Qwen2.5-7B-Instruct, Qwen2-VL |
| Granite | `granite-20b-fc` | granite-20b-functioncalling |
| InternLM | `internlm` | InternLM2.5-7B-Chat |

**Per-Request Options:**
- **Tool Choice**: `none` (disable) or `auto` (let model decide)
- **Tools**: Define in the Tool Calling panel

**Note:** Tool calling adds `--enable-auto-tool-choice --tool-call-parser <parser>` to the vLLM startup command.

### Supported Models

**CPU-Optimized Models (Recommended for macOS):**
- **TinyLlama/TinyLlama-1.1B-Chat-v1.0** (default) - Fast, no token required
- **meta-llama/Llama-3.2-1B** - Latest Llama, requires HF token (gated)
- **google/gemma-2-2b** - High quality, requires HF token (gated)
- facebook/opt-125m - Tiny test model

**Larger Models (Slow on CPU, better on GPU):**
- meta-llama/Llama-2-7b-chat-hf (requires HF token)
- mistralai/Mistral-7B-Instruct-v0.2
- Custom models via text input

**📌 Note**: Gated models (Llama, Gemma) require a HuggingFace token. See [Gated Models Guide](docs/GATED_MODELS_GUIDE.md) for setup.

## 🛠️ Development

### Architecture

The project uses a **hybrid architecture** that works seamlessly in both local and cloud environments:

```
┌─────────────────────────────────────────────────────────────┐
│                     Web UI (FastAPI)                        │
│              app.py + index.html + static/                  │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ├─→ container_manager.py (Local)
                         │   └─→ Podman CLI
                         │       └─→ vLLM Container
                         │
                         └─→ kubernetes_container_manager.py (Cloud)
                             └─→ Kubernetes API
                                 └─→ vLLM Pods
```

**Key Components:**
- **Backend**: FastAPI (`app.py`)
- **Container Manager (Local)**: Podman orchestration (`container_manager.py`)
- **Container Manager (K8s)**: Kubernetes API orchestration (`openshift/kubernetes_container_manager.py`)
- **Frontend**: Vanilla JavaScript (`static/js/app.js`)
- **Styling**: Custom CSS (`static/css/style.css`)
- **Scripts**: Bash scripts in `scripts/`
- **Config**: Environment files in `config/`

### Running in Development

```bash
# Start backend with auto-reload
uvicorn app:app --reload --port 7860

# Or use the run script
python run.py
```

### Container Development

```bash
# Build vLLM service container (macOS/CPU ARM64)
podman build -f containers/Containerfile.mac -t vllm-mac:v0.11.0 .

# Build vLLM service container (Linux x86_64 CPU)
podman build -f containers/Containerfile.cpu -t vllm-cpu:v0.11.0 .

# Build Web UI orchestrator container
podman build -f containers/Containerfile.vllm-playground -t vllm-playground:latest .

# Build OpenShift Web UI container
podman build -f openshift/Containerfile -t vllm-playground-webui:latest .
```

**Container Architecture:**
All custom container images now use the same CLI argument pattern as the official vLLM image:
```bash
# All images accept vLLM CLI args directly
podman run vllm-mac:v0.11.0 \
  --model meta-llama/Llama-3.2-1B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --enable-auto-tool-choice \
  --tool-call-parser llama3_json
```

## 📝 License

MIT License - See [LICENSE](LICENSE) file for details

## 🤝 Contributing

Contributions welcome! Please feel free to submit issues and pull requests.

## 🔗 Links

- [vLLM Official Documentation](https://docs.vllm.ai/)
- [vLLM CPU Mode Guide](https://docs.vllm.ai/en/stable/getting_started/installation/cpu.html)
- [vLLM GitHub](https://github.com/vllm-project/vllm)
- **[LLMCompressor Playground](https://github.com/micytao/llmcompressor-playground)** - Separate project for model compression and quantization
- [GuideLLM](https://github.com/neuralmagic/guidellm) - Performance benchmarking tool

## 🏗️ Architecture Overview

### Local Development (Container Orchestration)

```
┌──────────────────┐
│   User Browser   │
└────────┬─────────┘
         │ http://localhost:7860
         ↓
┌──────────────────┐
│   Web UI (Host)  │  ← FastAPI app
│   app.py         │
└────────┬─────────┘
         │ Podman CLI
         ↓
┌──────────────────┐
│ container_manager│  ← Podman orchestration
│     .py          │
└────────┬─────────┘
         │ podman run/stop
         ↓
┌──────────────────┐
│  vLLM Container  │  ← Isolated vLLM service
│  (Port 8000)     │
└──────────────────┘
```

### OpenShift/Kubernetes Deployment

```
┌──────────────────┐
│   User Browser   │
└────────┬─────────┘
         │ https://route-url
         ↓
┌──────────────────┐
│ OpenShift Route  │
└────────┬─────────┘
         ↓
┌──────────────────┐
│  Web UI Pod      │  ← FastAPI app in container
│  (Deployment)    │  ← Auto-detects GPU availability
└────────┬─────────┘
         │ Kubernetes API
         │ (reads nodes, creates/deletes pods)
         ↓
┌──────────────────┐
│   kubernetes_    │  ← K8s API orchestration
│   container_     │  ← Checks nvidia.com/gpu resources
│   manager.py     │
└────────┬─────────┘
         │ create/delete pods
         ↓
┌──────────────────┐
│  vLLM Pod        │  ← Dynamically created
│  (Dynamic)       │  ← GPU: Official vLLM image
│                  │  ← CPU: Self-built optimized image
└──────────────────┘
```

**Container Images:**
- **GPU Mode**: Official vLLM image (`vllm/vllm-openai:v0.11.0`)
- **CPU Mode (Linux x86)**: Self-built optimized image (`quay.io/rh_ee_micyang/vllm-cpu:v0.11.0`)
- **CPU Mode (macOS ARM64)**: Self-built optimized image (`quay.io/rh_ee_micyang/vllm-mac:v0.11.0`) 🆕

**Key Features:**
- Same UI code works in both environments
- Container manager is swapped at build time (Podman → Kubernetes)
- Identical user experience locally and in the cloud
- Smart container/pod lifecycle management
- **Automatic GPU detection**: UI adapts based on cluster hardware
  - Kubernetes-native: Queries nodes for `nvidia.com/gpu` resources
  - Automatic mode selection: GPU mode disabled if no GPUs available
  - RBAC-secured: Requires node read permissions (automatically configured)
- No registry authentication needed (all images are publicly accessible)

---

## 🆘 Troubleshooting

### Container-Related Issues

#### Container Won't Start
```bash
# Check if Podman is installed
podman --version

# Check Podman connectivity
podman ps

# View container logs
podman logs vllm-service
```

#### "Address Already in Use" Error
If you lose connection to the Web UI and get `ERROR: address already in use`:

```bash
# Quick Fix: Auto-detect and kill old process
python run.py

# Alternative: Manual restart
./scripts/restart_playground.sh

# Or kill manually
python scripts/kill_playground.py
```

#### vLLM Container Issues
```bash
# Check if container is running
podman ps -a | grep vllm-service

# View vLLM logs
podman logs -f vllm-service

# Stop and remove container
podman stop vllm-service && podman rm vllm-service

# Pull latest vLLM images
podman pull quay.io/rh_ee_micyang/vllm-mac:v0.11.0     # macOS ARM64
podman pull quay.io/rh_ee_micyang/vllm-cpu:v0.11.0    # Linux x86_64
podman pull vllm/vllm-openai:v0.11.0                  # GPU (official)
```

#### Tool Calling Not Working

Tool calling requires **server-side configuration**. If tools aren't being called:

1. **Verify server was started with tool calling enabled:**
   - Check "Enable Tool Calling" in Server Configuration BEFORE starting
   - Look for this in startup logs: `Tool calling enabled with parser: llama3_json`

2. **Verify CLI args are passed:**
   ```
   vLLM arguments: --model ... --enable-auto-tool-choice --tool-call-parser llama3_json
   ```

3. **If using container mode**, ensure the container was started fresh after enabling tool calling (stop and restart if needed)

### OpenShift/Kubernetes Issues

#### GPU Mode Not Available

The Web UI automatically detects GPU availability by querying Kubernetes nodes for `nvidia.com/gpu` resources. If GPU mode is disabled in the UI:

**Check GPU availability in your cluster:**
```bash
# List nodes with GPU capacity
oc get nodes -o custom-columns=NAME:.metadata.name,GPU:.status.capacity.nvidia\.com/gpu

# Or check all node details
oc describe nodes | grep nvidia.com/gpu
```

**If GPUs exist but not detected:**
1. Verify RBAC permissions:
```bash
# Check if service account has node read permissions
oc auth can-i list nodes --as=system:serviceaccount:vllm-playground:vllm-playground-sa
# Should return "yes"
```

2. Reapply RBAC if needed:
```bash
oc apply -f openshift/manifests/02-rbac.yaml
```

3. Check Web UI logs for detection errors:
```bash
oc logs -f deployment/vllm-playground-cpu -n vllm-playground | grep -i gpu
```

**Expected behavior:**
- **GPU available**: Both CPU and GPU modes enabled in UI
- **No GPU**: GPU mode automatically disabled, forced to CPU-only mode
- **Detection method logged**: Check logs for "GPU detected via Kubernetes API" or "No GPUs found"

#### Pod Not Starting
```bash
# Check pod status
oc get pods -n vllm-playground

# View pod logs
oc logs -f deployment/vllm-playground-gpu -n vllm-playground

# Describe pod for events
oc describe pod <pod-name> -n vllm-playground
```

#### Out of Memory (OOM) Issues

**⚠️ IMPORTANT: Resource Requirements for GuideLLM Benchmarks**

The Web UI pod requires sufficient memory to avoid OOM kills when running GuideLLM benchmarks. GuideLLM generates many concurrent requests for load testing, which can quickly consume available memory.

**Memory usage scales with:**
- Number of concurrent users/requests
- Request rate (requests per second)
- Model size and response length
- Benchmark duration

**Recommended Memory Limits:**

- **GPU Mode (default)**: 16Gi minimum
  - For intensive GuideLLM benchmarks: **32Gi+**
  - For high-concurrency tests (50+ users): **64Gi+**

- **CPU Mode**: 64Gi minimum
  - For intensive GuideLLM benchmarks: **128Gi+**

**To increase resources:**

Edit `openshift/manifests/04-webui-deployment.yaml`:
```yaml
resources:
  limits:
    memory: "32Gi"  # Increase based on benchmark intensity
    cpu: "8"
```

Then reapply:
```bash
oc apply -f openshift/manifests/04-webui-deployment.yaml
```

**Symptoms of OOM:**
- Pod restarts during benchmarks
- Benchmark failures with connection errors
- `OOMKilled` status in pod events: `oc describe pod <pod-name>`

#### Image Pull Errors

**Note:** The deployment now uses publicly accessible container images:
- **GPU**: `vllm/vllm-openai:v0.11.0` (official vLLM image)
- **CPU**: `quay.io/rh_ee_micyang/vllm-cpu:v0.11.0` (self-built, publicly accessible)

No registry authentication or pull secrets are required. If you encounter image pull errors:

```bash
# Verify image accessibility
podman pull vllm/vllm-openai:v0.11.0  # For GPU
podman pull quay.io/rh_ee_micyang/vllm-cpu:v0.11.0  # For CPU

# Check pod events for details
oc describe pod <pod-name> -n vllm-playground
```

**📖 See [openshift/QUICK_START.md](openshift/QUICK_START.md)** for detailed OpenShift troubleshooting

### Local Installation Issues

#### macOS Segmentation Fault
Use CPU mode with proper environment variables or use container mode (recommended).
See [docs/MACOS_CPU_GUIDE.md](docs/MACOS_CPU_GUIDE.md).

#### Server Won't Start
1. Check if vLLM is installed: `python -c "import vllm; print(vllm.__version__)"`
2. Check port availability: `lsof -i :8000`
3. Review server logs in the WebUI

#### Chat Not Streaming
Check browser console (F12) for errors and ensure the server is running.

---

Made with ❤️ for the vLLM community