# flash-attention

**Repository Path**: ceci3/flash-attention

## Basic Information

- **Project Name**: flash-attention
- **Description**: xxxxxxxxxxxxxx
- **Primary Language**: Unknown
- **License**: BSD-3-Clause
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2024-11-08
- **Last Updated**: 2025-10-31

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# FlashAttention for vLLM

This is a fork of https://github.com/Dao-AILab/flash-attention customized for vLLM.

We have the following customizations:

- Build: Cmake, torch library (this package is bundled into vLLM).
- Size: reduced templating and removal of (training) kernels
- Features: Small page size support (FA2), DCP support (FA3)
- Performance: Some decode specific optimizations for sizes we care about; as well as mixed batch performance optimizations. (Upstream is understandably hesitant on specializing for inference as they also need to support training; we on the other hand compile out the backward pass kernels and do not test that our optimizations do not break them.)