# fast.cu **Repository Path**: hotheart1982/fast.cu ## Basic Information - **Project Name**: fast.cu - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-01-27 - **Last Updated**: 2026-01-27 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Fastest GPU kernels, written from scratch. ## Matrix Multiplication Matrix multiplication of square bf16 matrices, accumulated in fp32. ``` N=4096 Kernel: 763 TFLOPs cuBLAS: 716 TFLOPs N=8192 Kernel: 808 TFLOPs cuBLAS: 795 TFLOPs ``` Explanation in https://cudaforfun.substack.com/p/outperforming-cublas-on-h100-a-worklog ##### To run: ``` make matmul && out/matmul ``` Example kernels are in [`examples/matmul/`](https://github.com/pranjalssh/fast.cu/tree/main/examples/matmul) and orchestration is in [`matmul.cu`](https://github.com/pranjalssh/fast.cu/blob/main/matmul.cu) ## Sum reduction We compute sum of 2^30 elements. ##### To run: ``` make sum && out/sum ``` ``` Kernel: 3240.11 GB/s cub Library: 3193 GB/s ``` Example kernels are in [`sum.cu`](https://github.com/pranjalssh/fast.cu/tree/main/sum.cu)