# CUDATutorial

**Repository Path**: sunzhongqi2023/CUDATutorial

## Basic Information

- **Project Name**: CUDATutorial
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2024-07-21
- **Last Updated**: 2024-07-21

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# CUDATutorial
A CUDA tutorial to make people learn CUDA program from 0

## test enviroment
Turing T4 GPU
## compile command

1. compile by hand

`nvcc xxx.cu -o xxx`

if that does not work, pls try:

`nvcc xxx.cu --gpu-architecture=compute_yy -o xxx`

xxx is file name, yy is GPU compute capability, ep.A100's compute capability is 86.

2. one-click compile and run

please ensure:

1.cmake version >= 3.8

2.you have CUDA TOOLKIT installed in system root directory, downloaded link is https://developer.nvidia.com/cuda-downloads.

```
 mkdir build 
 cd build 
 cmake .. && make -j8 
 cd bin 
 ./xxx
```
## remark
* related performance data is attached at the top of code file.
* the performance data is diverse and diverse on different GPU platforms and NVCC compiler, so some counter-intuitive result is normal, we should only explore and debug the result.
* welcome all comments and pull requests.

## update notes
### v2.0
* add cuda stream
* add quantize
### v2.1
* add fp32/fp16 gemv(vec * mat,mat is col major)
### v2.2
* add fp32/fp16 gemv(vec * mat,mat is row major)
* add some code explaination(WIP)
### v2.6
* add fp32 dropout