# llm-d-inference-sim
**Repository Path**: underdogs/llm-d-inference-sim
## Basic Information
- **Project Name**: llm-d-inference-sim
- **Description**: No description available
- **Primary Language**: Go
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-02-28
- **Last Updated**: 2026-03-09
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
[](https://goreportcard.com/report/github.com/llm-d/llm-d-inference-sim)
[](/LICENSE)
[](https://llm-d.slack.com/archives/C097SUE2HSL)
# vLLM Simulator
`llm-d-inference-sim` is a lightweight, configurable, and real-time simulator designed to mimic the behavior of vLLM without the need for GPUs or running actual heavy models. It operates as a fully OpenAI-compliant server, allowing developers to test clients, schedulers, and infrastructure using realistic request-response cycles, token streaming, and latency patterns.
## Why is this required?
Running full LLM inference requires significant GPU resources and introduces non-deterministic latency, making it difficult to isolate infrastructure bugs or iterate quickly on control-plane logic. This simulator decouples development from heavy inference, offering a controlled environment to:
- **Accelerate Infrastructure Development**: Test routing, scheduling, and KV cache locality logic without waiting for slow, expensive GPU operations.
- **Ensure Deterministic Testing**: simulate precise token timing and latency to isolate performance regressions and bugs in a way that is impossible with non-deterministic real models.
- **Validate Observability**: Mirror vLLM’s Prometheus metrics to ensure monitoring and alerting systems are functioning correctly before deploying to production.
- **Test Advanced Features**: Safely develop complex logic such as LoRA adapter lifecycles (loading, unloading, and switching) and Disaggregated Prefill integrations.
## How it Works
The simulator is designed to act as a drop-in replacement for vLLM, sitting between your client/infrastructure and the void where the GPU usually resides. It processes requests through a configurable simulation engine that governs what is returned and when it is returned.
For detailed configuraiton definitions see the [Configuration Guide](docs/configuration.md)
### Modes of Operation
The simulator decides the content of the response based on two primary modes:
- **Echo Mode** (--mode echo):
Acts as a loopback. The response content mirrors the input (e.g., the last user message in a chat request).
Useful for network throughput testing where content validity is irrelevant.
- **Random Mode** (--mode random):
The default mode. Generates synthetic responses based on requested parameters (like max_tokens).
Utilizes probabilistic histograms to determine response length.
Content is sourced from either a set of pre-defined sentences or a custom dataset (see below).
### Dual Protocol Support
Natively supports both HTTP (OpenAI-compatible) and gRPC (vLLM-compatible) interfaces on the same port, allowing for versatile integration testing across different client architectures.
For detailed API definitions see the [APIs Guide](docs/api.md).
### Response Generation & Datasets
In Random Mode, the simulator can generate content in two ways:
- **Predefined Text**: By default, it constructs responses by concatenating random sentences from a built-in list until the target token length is met.
- **Real Datasets**: If a dataset is provided (via --dataset-path or --dataset-url), the simulator attempts to match the hash of the incoming prompt to a conversation history in the database.
If a match is found, it returns the stored response.
If no match is found, it falls back to a random response from the dataset or predefined text.
Supports downloading SQLite datasets directly from HuggingFace.
For response generation algorithms details see [Response Generation Guide](docs/response_generation.md).
### Latency Simulation
Unlike simple mock servers that just "sleep" for a fixed time, this simulator models the physics of LLM inference:
- **Time to first token**: Simulates the prefill phase latency, including configurable standard deviation (jitter) for realism.
- **Inter-token latency**: Simulates the decode phase, adding delays between every subsequent token generation.
- **Load Simulation**: The simulator automatically increases latency as the number of concurrent requests becomes higher.
- **Disaggregated Prefill (PD)**: Can simulate KV-cache transfer latency instead of standard TTFT when mimicking Prefill/Decode disaggregation architectures.
### Tokenization
The simulator offers flexible tokenization to balance accuracy vs. performance. The simulator automatically selects between two tokenization modes based on the provided `--model` name:
* **HuggingFace Mode:** Used for real models (e.g., `meta-llama/Llama-3.1-8B-Instruct`). Downloads actual tokenizers for exact accuracy.
* **Simulated Mode:** Used for dummy/non-existent model names. Uses a fast regex tokenizer for maximum performance with zero startup overhead.
For details on caching, environment variables (`HF_TOKEN`), and performance tuning, see the [Tokenization Guide](docs/tokenization.md).
### LoRA Management
Simulates the lifecycle (loading/unloading) of LoRA adapters without occupying actual memory. Reports LoRA related Prometheus metrics.
### KV Cache Simulation
Tracks simulated memory usage and publishes ZMQ events for cache block allocation and eviction.
### Failure Injection
Can randomly inject specific errors (e.g., rate_limit, model_not_found) to test client resilience.
### Deployment Options
The simulator is designed to run either as a standalone binary or within a Kubernetes Pod (e.g., for testing with Kind).
### Observability
The simulator supports a subset of standard vLLM Prometheus metrics.
For detailes see the [Metrics Guide](docs/metrics.md)
## Working with docker image
### Building
To build a Docker image of the vLLM Simulator, run:
```bash
make image-build
```
Please note that the default image tag is `ghcr.io/llm-d/llm-d-inference-sim:dev`.
The following environment variables can be used to change the image tag
| Variable | Descriprtion| Default Value|
| --- | --- | --- |
| IMAGE_REGISTRY | Name of the repo | ghcr.io/llm-d |
| IMAGE_TAG_BASE | Image base name | \$(IMAGE_REGISTRY)/llm-d-inference-sim |
| SIM_TAG | Image tag | dev |
| IMG | The full image specification | \$(IMAGE_TAG_BASE):\$(SIM_TAG) |
### Running
To run the vLLM Simulator image under Docker, run:
```bash
docker run --rm --publish 8000:8000 -v $(pwd)/hf_cache:/hf_cache ghcr.io/llm-d/llm-d-inference-sim:dev --port 8000 --model "Qwen/Qwen2.5-1.5B-Instruct" --lora-modules '{"name":"tweet-summary-0"}' '{"name":"tweet-summary-1"}'
```
**Note:** To run the vLLM Simulator with the latest release version, in the above docker command replace `dev` with the current release which can be found on [GitHub](https://github.com/llm-d/llm-d-inference-sim/pkgs/container/llm-d-inference-sim).
**Note:** The above command exposes the simulator on port 8000, and serves the Qwen/Qwen2.5-1.5B-Instruct model.
## Standalone testing
### Building
To build the vLLM simulator to run locally as an executable, run:
```bash
make build
```
### Running
To run the vLLM simulator in a standalone test environment:
1. Set the PYTHONPATH environment variable (needed for the tokenization code) by running:
```bash
. env-setup.sh
```
2. Start the simulator:
```bash
./bin/llm-d-inference-sim --model my_model --port 8000
```
## Kubernetes testing
To run the vLLM simulator in a Kubernetes cluster, run:
```bash
kubectl apply -f manifests/deployment.yaml
```
When testing locally with kind, build the docker image with `make build-image` then load into the cluster:
```shell
kind load --name kind docker-image ghcr.io/llm-d/llm-d-inference-sim:dev
```
Update the `deployment.yaml` file to use the dev tag.
To verify the deployment is available, run:
```bash
kubectl get deployment vllm-llama3-8b-instruct
kubectl get service vllm-llama3-8b-instruct-svc
```
Use `kubectl port-forward` to expose the service on your local machine:
```bash
kubectl port-forward svc/vllm-llama3-8b-instruct-svc 8000:8000
```
Test the API with curl
```bash
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'
```
## Prefill/Decode (P/D) Separation Example
An example configuration for P/D (Prefill/Decode) disaggregation deployment can be found in [manifests/disaggregation](manifests/disaggregation).