# inference-hw1 **Repository Path**: tom0078/inference-hw1 ## Basic Information - **Project Name**: inference-hw1 - **Description**: No description available - **Primary Language**: 其他 - **License**: MulanPSL-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 1 - **Created**: 2026-01-15 - **Last Updated**: 2026-01-25 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # SCNN-CUDA: 基于CUDA的脉冲卷积神经网络推理实现 > **2025年秋季国科大《GPU架构与编程》大作业一** 基于CUDA实现的脉冲卷积神经网络（Spiking Convolutional Neural Network, SCNN）推理程序，采用LeNet架构，在Fashion MNIST数据集上进行图像分类。 ## 项目概述 | 项目 | 说明 | |------|------| | 网络架构 | LeNet-5 + IF脉冲神经元 | | 数据集 | Fashion MNIST (10类，28×28灰度图) | | 时间步 | 8 (T_TIMESTEPS) | | 评测平台 | NVIDIA V100 GPU | ## 网络结构 ``` Input(1×28×28) → Conv1(6×24×24) → Pool1(6×12×12) → Conv2(16×8×8) → Pool2(16×4×4) → FC1(120) → FC2(84) → FC3(10) ``` 每个卷积层和全连接层（除输出层外）后接IF（Integrate-and-Fire）脉冲神经元。 ## CUDA优化策略 ### 核心优化技术 | 优化技术 | 实现方式 | |---------|---------| | **常量内存** | 卷积权重、偏置存储于`__constant__`内存 | | **共享内存** | 全部中间结果存储于共享内存(~41KB) | | **向量化访存** | `float4`批量加载，提升带宽利用率 | | **循环展开** | `#pragma unroll`展开卷积和池化循环 | | **内存预取** | `__ldg()`通过纹理缓存预取数据 | | **批处理** | BATCH_SIZE=2048，减少kernel启动开销 | | **异步传输** | Stream + cudaMemcpyAsync实现重叠 | | **锁页内存** | cudaHostRegister加速H2D/D2H传输 | ### 共享内存布局 ```c++ // 总计 ~41KB，适配V100的96KB共享内存 extern __shared__ float shared_mem[]; float* s_input = shared_mem; // 784 floats float* s_conv1_pot = s_input + 784; // 3456 floats (膜电位) float* s_conv1_spk = s_conv1_pot + 3456; // 3456 floats (脉冲) float* s_pool1 = s_conv1_spk + 3456; // 864 floats float* s_conv2_pot = s_pool1 + 864; // 1024 floats float* s_conv2_spk = s_conv2_pot + 1024; // 1024 floats float* s_pool2 = s_conv2_spk + 1024; // 256 floats float* s_fc1_pot = s_pool2 + 256; // 120 floats float* s_fc1_spk = s_fc1_pot + 120; // 120 floats float* s_fc2_pot = s_fc1_spk + 120; // 84 floats float* s_fc2_spk = s_fc2_pot + 84; // 84 floats float* s_out = s_fc2_spk + 84; // 10 floats ``` ### Kernel设计采用**单Kernel融合**设计，一个block处理一个样本： ```c++ __global__ __launch_bounds__(TILE_THREADS, 2) void scnn_infer_kernel_optimized(...) { const int sample = blockIdx.x; // 每个block处理一个样本 // 时间步循环 for (int t = 0; t < T_TIMESTEPS; ++t) { // Conv1 + IF → Pool1 → Conv2 + IF → Pool2 → FC1 + IF → FC2 + IF → FC3 // 每层计算后使用 __syncthreads() 同步 } } ``` ### IF神经元模型 ```c++ // Integrate-and-Fire 神经元 float V = membrane_potential + input_current; // 积分 float spike = (V > THRESHOLD) ? 1.0f : 0.0f; // 阈值判断 membrane_potential = V * (1.0f - spike); // 发放后重置 ``` ### 向量化全连接计算 ```c++ // 使用float4向量化加载，配合__ldg()预取 const float4* w4 = reinterpret_cast(weight_row); const float4* x4 = reinterpret_cast(input_data); for (int k = 0; k < size/4; ++k) { float4 w = __ldg(&w4[k]); float4 x = x4[k]; acc += w.x*x.x + w.y*x.y + w.z*x.z + w.w*x.w; } ``` ## 可调参数 ```c++ #define TILE_THREADS 256 // 每个block的线程数 #define T_TIMESTEPS 8 // SCNN时间步 #define THRESHOLD 1.0f // IF神经元阈值 #define BATCH_SIZE 2048 // 批处理大小 ``` ## 编译与运行 ### 编译 ```bash nvcc -O3 -arch=sm_70 -o scnn inference.cu ``` ### 运行 ```bash ./scnn ``` ### 输出格式 ``` <运行时间(秒)>:<准确率> ``` 示例：`0.038:0.8978` ## 文件结构 ``` ├── README.md # 项目说明文档 └── inference.cu # CUDA实现源代码 ``` ## 参考资料 - [SpikingJelly文档](https://spikingjelly.readthedocs.io/zh-cn/latest/activation_based/conv_fashion_mnist.html) - CUDA C++ Programming Guide - NVIDIA V100 Architecture Whitepaper --- **2025年秋季国科大《GPU架构与编程》**