# cuda_fast

**Repository Path**: frankytom/cuda_fast

## Basic Information

- **Project Name**: cuda_fast
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2020-05-17
- **Last Updated**: 2022-03-25

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

本项目旨在研究高效的cuda实现方式,将从如何几个方面讨论如何高效的使用cuda:

## 部分代码来之"udacity-IntroToParallelProgramming"
github路径为: https://github.com/nickspell/udacity-IntroToParallelProgramming

->hello_world
这个属于最基础的例子，用来演示数据如何从cpu拷贝到gpu,然后在gpu上进行计算，
最后将结果从gpu拷贝到cpu的过程。

->parallel_cpu_gpu
当gpu执行kernel函数的时候，cpu可以采用openMP对cpu中的for循环代码进行多
线程并行处理，实现gpu和cpu同时并行执行。需要注意的是，cpu的for循环应该写
在阻塞函数之前（cudaThreadSynchronize 或 cudaMemory 或cudaMemcpy）

->pinned_memory
页锁定，通过页锁定的方式使分配的内存驻留在物理内存中，使主机与gpu的数据拷贝
速度更快，由于这种方式消耗更多的内存，建议临时存储。

->ILP_TLP
指令级并行和线程级并行，通过visual profiler(nvvp)分析不同的显卡对应最合适
的线程并行个数和指令并行个数。

->warp_bifurcation
warp分歧是cuda编程中影响性能最明显的原因，因为同一个warp里的所有线程必须保证
路径一样，否着较快的线程必须等待最慢的线程。

->task_parallel
任务并行，通过多个任务kernel并发的执行以减少串行的执行时间和数据传输的带宽时间。

->select_storage
选择合适的存储器，保证数据最大化利用和带宽最大化利用。