# vllm_xformers_prefetch

**Repository Path**: alibaba/vllm_xformers_prefetch

## Basic Information

- **Project Name**: vllm_xformers_prefetch
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-11-12
- **Last Updated**: 2026-01-01

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Prefetch-enhanced vLLM XFORMERS Attention Kernel

This repository contains the improved attention kernel for vLLM's XFORMERS backend with prefetching optimizations, as presented in our paper accepted at AAAI 2026.

## Overview

This work enhances the vLLM XFORMERS backend attention kernel by incorporating prefetching techniques to improve memory access patterns and reduce latency during attention computation. The key innovation is the addition of asynchronous memory prefetching instructions for key and value cache blocks specifically within the XFORMERS attention implementation, which helps hide memory access latency and improve overall throughput. The experiments in our paper were conducted on H20 GPU.

## Key Improvements

   - Implemented `cp.async.bulk.prefetch.L2.global` instructions to prefetch next key and value cache blocks

## Installation

1. Replace the `csrc/attention/attention_kernels.cuh` file in your vLLM project with the `attention_kernels.cuh` file from this repository
2. Execute the following command in the vLLM root directory to recompile vLLM:
   ```bash
   pip install --no-build-isolation -e .
   ```
Note: For detailed instructions on installing vLLM, please refer to the [official documentation](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/#full-build-with-compilation). This file was last tested with vLLM v0.10.2.

## Usage

To use the prefetch-enhanced XFORMERS attention kernel, set the environment variable before starting vLLM:
```
export VLLM_ATTENTION_BACKEND=XFORMERS
```

## Paper Citation

Details of this work will be available in our paper: 

[*<< Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching >>*](https://arxiv.org/abs/2504.06319)

## Acknowledgements

This implementation is based on the vLLM project and NVIDIA's FasterTransformer implementation. We thank the vLLM team for their excellent work on optimizing large language model inference, especially their XFORMERS backend implementation.

## License

This project is licensed under the Apache License 2.0