# cpu-gpu-ndp-work **Repository Path**: liu_sh/cpu-gpu-ndp-work ## Basic Information - **Project Name**: cpu-gpu-ndp-work - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-08-08 - **Last Updated**: 2025-08-08 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # CPU/GPU Memory & Near-Data Processing Assignments This repository contains my assignments from CS 6501 - CPU/GPU Memory & Near-Data Processing @ UVA Spring '25 under Prof. Kevin Skadron, focusing on CPU/GPU memory architecture, cache design, DRAM simulation, GPU programming, and near-data processing (PIM). Each assignment applies industry-standard tools to analyze, simulate, and optimize real-world memory and processing behavior. ## 📁 Assignments Overview ### [HW1: Roofline Model Analysis](./roofline-model-analysis) - **Assignment PDF:** [HW1 Assignment](./roofline-model-analysis/Roofline_Assignment.pdf) - **Report:** [HW1 Report](./roofline-model-analysis/HuyNguyen_Roofline_Report.pdf) - Analysis of memory and compute bottlenecks across multiple matrix/vector kernels using Intel Advisor's Roofline model. The assignment involved: - Profiling 10 distinct matrix/vector implementations with varying optimization levels - Generating Roofline plots to visualize performance bottlenecks - Measuring INTOPS/sec and arithmetic intensity across different implementations - Identifying the ridge point where code transitions from memory-bound to compute-bound **Tools:** Intel Advisor, C++, Roofline visualization --- ### [HW2: Cache Design with CACTI](./cache-design-cacti) - **Assignment PDF:** [HW2 Assignment](./cache-design-cacti/CACTI_Assignment.pdf) - **Report:** [HW2 Report](./cache-design-cacti/HuyNguyen_CACTI_Report.pdf) - Systematic exploration of cache design tradeoffs using CACTI cache simulator. Key aspects: - Parameter sweeps across cache sizes (16KB to 8MB), associativity (1-way to 16-way) - Analysis of access time, area, energy consumption, and data efficiency - Examination of technology node impact (65nm vs. 32nm) on cache performance - Determination of optimal configurations for both L1 and LLC caches **Tools:** CACTI 7.0, Bash scripting, data visualization --- ### [HW3: DRAM Simulation with DRAMsim3](./DRAM-simulation) - **Assignment PDF:** [HW3 Assignment](./DRAM-simulation/DRAM_Assignment.pdf) - **Report:** [HW3 Report](./DRAM-simulation/HuyNguyen_DRAM_Report.pdf) - Comprehensive simulation of various DRAM technologies under different memory access patterns: - Comparison of DDR4, LPDDR4, GDDR6, and HBM2 under random, streaming, and mixed patterns - Analysis of bandwidth scaling, energy consumption, and latency characteristics - Detailed examination of command-level activity distribution (ACT, PRE, RD/WR) - DRAM selection recommendations for power-constrained vs. performance-driven scenarios **Tools:** DRAMsim3, Python for data processing, JSON-to-CSV conversion --- ### [HW4: GPU Programming with CUDA](./cuda-programming) - **Assignment PDF:** [HW4 Assignment](./cuda-programming/cuda_assignment.pdf) - **Report:** [HW4 Report](./cuda-programming/HuyNguyen_CUDA_Report.pdf) - Implementation and optimization of parallel algorithms using NVIDIA CUDA: - Development of matrix addition, matrix multiplication, and parallel reduction kernels - Implementation of shared memory optimizations and thread cooperative strategies - Performance evaluation using CUDA events and nvprof profiling - Comparative analysis between optimized GPU implementations and CPU baselines **Tools:** NVIDIA CUDA Toolkit, nvcc compiler, nvprof, CUDA events timing --- ### [HW5: PIM Programming with PIMeval-PIMbench](./pim-programming) - **Assignment PDF:** [HW5 Assignment](./pim-programming/assignment_PIMeval.pdf) - **Report:** [HW5 Report](./pim-programming/HuyNguyen_PIM_Report.pdf) - Exploration of near-data processing using UVA's PIMeval-PIMbench simulator: - Implementation of RMS Norm and Layer Norm algorithms for the PIM architecture - Performance analysis across varying HBM configurations (1-32 computing banks) - Energy efficiency analysis of PIM vs. traditional CPU implementations - Evaluation of parallelism scalability and resource utilization in PIM context **Tools:** PIMeval-PIMbench, C++ for kernel implementation, OpenMP, HBM modeling --- ## 🧰 Technical Environment - **Intel Advisor:** Roofline modeling and performance characterization - **CACTI 7.0:** Cache architecture simulation and power/area analysis - **DRAMsim3:** DRAM timing and energy simulation - **NVIDIA CUDA Toolkit:** GPU kernel development and profiling - **PIMeval-PIMbench:** Near-memory processing simulation framework - **Supporting tools:** Python for data analysis, visualization libraries, shell scripting ## 📌 Repository Structure Each assignment folder contains: - Source code and implementations - Configuration files and execution scripts - Results and analysis visualizations - Detailed technical reports --- ## 🔍 License This repository is licensed under the [MIT License](LICENSE).