# CUDA_gemm

**Repository Path**: yuzyong/CUDA_gemm

## Basic Information

- **Project Name**: CUDA_gemm
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2024-08-23
- **Last Updated**: 2024-08-23

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

## introduction
A simple high performance CUDA GEMM, Block Sparse GEMM and Non-uniform Quantized GEMM implementation.
```
C = alpha * A * B + beta * C
```
## algorithm
**located in src/cuda/**

* MatrixMulCUDA
    * one element of C is assigned one thread
    * global memory coalesce of B
* MatrixMulCUDA1
    * texture load
* MatrixMulCUDA2
    * one 4 * 4 grid of C is assigned one thread
* MatrixMulCUDA3
    * vectorized A B load
* MatrixMulCUDA4
    * vectorized C store
* MatrixMulCUDA5
    * block sparse version
* MatrixMulCUDA6
    * vectorized A B load coalesce
* MatrixMulCUDA7
    * warp shuffle to enable C store coalesce
* MatrixMulCUDAQuantize8bit
    * 8 bit non-uniform quantized matmul

## experiments
**located in benchmark/**
* benchmark_dense
    * Compare My Gemm with Cublas
* benchmark_sparse
    * Compare My block sparse Gemm with Cusparse
* benchmark_quantization_8bit
    * Compare My Gemm with Cublas
* benchmark_quantization
    * Compare My Gemm with My quantized non-uniform 8 bit Gemm

## TODO
* (MatrixMulCUDA7) write back to C matrix, warp shuffle to enable global memory coalesce
* (MatrixMulCUDA8) double buffering

## run
```
mkdir builds
make benchmark_[experiment name]
bash scripts/benchmark_[experiment name].sh
```

## Note
* sparsity约为1%的时候, cusparse的性能可以超越cublas
* 合理分配寄存器 尽可能让参数在编译器确定节省计算资源和寄存器数目