# kvcached **Repository Path**: underdogs/kvcached ## Basic Information - **Project Name**: kvcached - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-01-29 - **Last Updated**: 2026-01-29 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

Make GPU Sharing Flexible and Easy

kvcached (KV cache daemon) is a KV cache library for LLM serving/training on **shared GPUs**. By bringing OS-style **virtual memory** abstraction to LLM systems, it enables **elastic and demand-driven** KV cache allocation, improving GPU utilization under dynamic workloads. kvcached achieves this by decoupling GPU virtual addressing from physical memory allocation for KV caches. It allows serving engines to initially reserve virtual memory only and later back it with physical GPU memory when the cache is actively used. This decoupling enables on-demand allocation and flexible sharing, bringing better GPU memory utilization under dynamic and mixed workloads. Check out more details in the [blog](https://yifanqiao.notion.site/Solve-the-GPU-Cost-Crisis-with-kvcached-289da9d1f4d68034b17bf2774201b141).

Key Features

- **Elastic KV cache**: allocate and reclaim KV memory dynamically to match live load. - **GPU virtual memory**: decouple logical KV from physical GPU memory via runtime mapping. - **Memory control CLI**: enforce memory limits with kvcached CLI. - **Frontend router and sleep mode**: route requests to the target models and put models to sleep when idle. - **Support mainstream serving engines**: integrate with SGLang and vLLM. ## Example use cases

	Multi‑LLM serving kvcached allows multiple LLMs to share a GPU's memory elastically, enabling concurrent deployment without the rigid memory partitioning used today. This improves GPU utilization and saves serving costs.
	Serverless LLM By allocating KV cache only when needed, kvcached supports serverless deployments where models can spin up and down on demand.
	Compound AI systems kvcached makes compound AI systems practical on limited hardware by elastically allocating memory across specialized models in a pipeline (e.g., retrieval, reasoning, and summarization).
	GPU workload colocation kvcached allows LLM inference to coexist with other GPU workloads such as training jobs, fine-tuning, or vision models.

See concrete examples here: [kvcached/examples](https://github.com/ovg-project/kvcached/tree/main/examples). ## kvcached in action The following simple example shows how kvcached enables an unmodified vLLM engine run with dynamically allocated memory.

kvcached in action

## Performance: Multi-LLM serving kvcached enables dynamic memory sharing between LLMs, allowing them to share the same GPU memory elastically. As a comparison, the current serving engines need to statically reserve GPU memory at startup. This benchmark shows the performance benefits of kvcached when serving three `Llama-3.1-8B` models on an A100-80G GPU under workloads with intermittent peaks. kvcached can achieve **2-28x TTFT reduction** compared to the current serving engines. This performance gain can be converted to **significant cost savings** for LLM serving. Without kvcached, the systems have to provision more GPUs to achieve the same performance. Details can be found in [benchmarks/bench_latency_benefit](https://github.com/ovg-project/kvcached/tree/main/benchmarks/bench_latency_benefit).

TTFT mean TTFT p99

## Installation ### Prerequisites - Python (tested with 3.9 - 3.12) - SGLang (tested with v0.5.3) or vLLM (tested with v0.11.0) kvcached can be installed as a plugin with existing SGLang or vLLM environment. ### Install from PyPI ```bash pip install kvcached --no-build-isolation ``` ### Install from source ```bash # under the project root folder pip install -e . --no-build-isolation --no-cache-dir python tools/dev_copy_pth.py ``` ### Using Docker kvcached installed with original engine dockers. ```bash docker pull ghcr.io/ovg-project/kvcached-sglang:latest # kvcached-v0.1.1-sglang-v0.5.3 docker pull ghcr.io/ovg-project/kvcached-vllm:latest # kvcached-v0.1.1-vllm-v0.11.0 ``` We prepare an all-in-one docker for developers: ```bash docker pull ghcr.io/ovg-project/kvcached-dev:latest ``` More instructions can be found [here](https://github.com/ovg-project/kvcached/blob/main/docker/README.md). GB200 dockers are on the way. ## Documentation kvcached is indexed on [DeepWiki](https://deepwiki.com/ovg-project/kvcached) for LLM-powered documentation. The documentation covers: - Core architecture and memory management system - Integration with vLLM and SGLang - Multi-model serving and controller system - Deployment guides and configuration reference - Performance benchmarking and analysis - Development tools and testing ## Testing kvcached can be enabled by setting the following environmental variables: ```bash export ENABLE_KVCACHED=true export KVCACHED_AUTOPATCH=1 ``` If you are using the engine-specific dockers, you can test kvcached by running the original engines' benchmark scripts. For example: ```bash # for sglang python -m sglang.launch_server --model meta-llama/Llama-3.2-1B --disable-radix-cache --port 30000 python -m sglang.bench_serving --backend sglang-oai --model meta-llama/Llama-3.2-1B --dataset-name sharegpt --request-rate 10 --num-prompts 1000 --port 30000 # for vllm vllm serve meta-llama/Llama-3.2-1B --disable-log-requests --no-enable-prefix-caching --port=12346 vllm bench serve --model meta-llama/Llama-3.2-1B --request-rate 10 --num-prompts 1000 --port 12346 ``` > [!NOTE] > kvcached hasn't supported prefix caching/sharing yet because that will prevent kvcached from releasing the memory after requests finish. Remember to use `--no-enable-prefix-caching` for vLLM and `--disable-radix-cache` for SGLang. > > When kvcached is enabled, there is NO need to set memory utilization limit (e.g., using `--gpu-memory-utilization`) as kvcached will automatically manage the memory. If you installed kvcached using its source code, you can also do the following: ```bash cd benchmarks/simple_bench ./start_server.sh [sglang|vllm] --venv-path $VENV_PATH --model meta-llama/Llama-3.2-1B # Wait until LLM server is ready ./start_client.sh [sglang|vllm] --venv-path $VENV_PATH --model meta-llama/Llama-3.2-1B ``` The benchmark scripts automatically set `ENABLE_KVCACHED=true`. Please refer to each script for instructions on how to run inference with kvcached. > [!NOTE] > We haven’t fully tested kvcached with every version of SGLang and vLLM (there are too many!). If you run into issues with a specific version, please open an issue---we'll look into it and fix it within a few hours. ## Roadmap The latest roadmap is also tracked in [issue #125](https://github.com/ovg-project/kvcached/issues/125). - **Engine integration** - [x] SGLang and vLLM - [ ] Ollama (in progress) - [ ] llama.cpp and LMStudio - **Features** - [x] Tensor parallelism - [ ] Prefix caching - [ ] KV cache offloading to host memory - [ ] More attention types (sliding window attention, linear attention, vision encoder, etc.) - **Performance optimizations** - [x] Contiguous KV tensor layout - [x] Physical memory management - **Hardware** - [x] NVIDIA GPUs - [ ] AMD GPUs ## Contributing We are grateful for and open to contributions and collaborations of any kind. We use pre-commit to ensure a consistent coding style. You can set it up by ``` pip install pre-commit pre-commit install ``` Before pushing your code, please run the following check and make sure your code passes all checks. ``` pre-commit run --all-files ``` ## Contacts kvcached is developed by many contributors from the community. Feel free to contact us for contributions and collaborations. ``` Jiarong Xing (jxing@rice.edu) Yifan Qiao (yifanqiao@berkeley.edu) Shan Yu (shanyu1@g.ucla.edu) ``` ## Citation If you find kvcached useful, please cite our paper: ```bibtex @article{xing2025towards, title={Towards Efficient and Practical GPU Multitasking in the Era of LLM}, author={Xing, Jiarong and Qiao, Yifan and Mo, Simon and Cui, Xingqi and Sela, Gur-Eyal and Zhou, Yang and Gonzalez, Joseph and Stoica, Ion}, journal={arXiv preprint arXiv:2508.08448}, year={2025} } @article{yu2025prism, title={Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving}, author={Yu, Shan and Xing, Jiarong and Qiao, Yifan and Ma, Mingyuan and Li, Yangmin and Wang, Yang and Yang, Shuo and Xie, Zhiqiang and Cao, Shiyi and Bao, Ke and others}, journal={arXiv preprint arXiv:2505.04021}, year={2025} } ```