# q8_kernels
**Repository Path**: ihavoc/q8_kernels
## Basic Information
- **Project Name**: q8_kernels
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-01-11
- **Last Updated**: 2025-01-11
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# Q8 Kernels
---
Q8Kernels is a efficent implementation of 8bit kernels(FP8 and INT8).
## Features:
-8bit GEMM(with fused gelu and bias) / 2x faster than cuBLAS FP8 and 3.5x faster than torch.mm
-FP8 Flash Attention 2 with Fast Hadamard Transform(also supports cross attention mask) / 2x faster than flash attention 2
-Mixed Precision Fast Hadamard Transform
-RMSNorm
-Mixed Precision FMA
-RoPE Layer
-Quantizers
All operations are implemented in CUDA.
Current version supports ADA Architecture(Ampere optimizations are coming soon!).
## Installation
q8_kernels requires CUDA Version >= 12.4 and pytorch >=2.4.
q8_kernels was tested on Windows machine. Dont see problem with building on Linux systems.
Install ninja ```pip install ninja```
Make sure that ninja is installed and that it works correctly (e.g. ninja --version).
Without ninja installation is very slow.
```
git clone https://github.com/KONAKONA666/q8_kernels
cd q8_kernels
git submodule init
git submodule update
python setup.py install
pip install . # for utility
```
It takes ~10-15 minutes to compile and install all modules.
## Supported models
Speed ups are computed relative to transformers with inference with 16bit and flash attention 2
|Model name | Speed up |
| -------- | -------- |
| [LTXVideo](https://github.com/KONAKONA666/LTX-Video)| up to 2.5x |
## Acknowledgement
Thanks to:
[Flash attention](https://github.com/Dao-AILab/flash-attention/tree/main)
[@66RING](https://github.com/66RING/tiny-flash-attention)
[fast-hadamard-transform](https://github.com/Dao-AILab/fast-hadamard-transform)
[cutlass](https://github.com/NVIDIA/cutlass)
[@weishengying](https://github.com/weishengying): Check his CUTE exercises and flash attn implementations
## Authors
KONAKONA666
## License
MIT
**Free Software**