# hoti-2025-gpu-comms-tutorial

**Repository Path**: mirrors_NVIDIA/hoti-2025-gpu-comms-tutorial

## Basic Information

- **Project Name**: hoti-2025-gpu-comms-tutorial
- **Description**: Tutorial Exercises and Code for GPU Communications Tutorial at HOT Interconnects 2025
- **Primary Language**: Unknown
- **License**: BSD-3-Clause
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-10-25
- **Last Updated**: 2026-03-21

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# GPU Communication Libraries for Accelerating HPC and AI Applications

This repository accompanies the interactive HOTI 2025 tutorial on GPU communication libraries, covering NVIDIA Collective Communication Library (NCCL) and NVSHMEM (including Python bindings). It contains hands-on labs with ready-to-build examples and reference solutions.

Links:
- Tutorial homepage: [GPU Communication Libraries for Accelerating HPC and AI Applications @ HotI 2025](https://hoti.org/tutorials-nccl-nvshmem.html)
- Video recording: [YouTube](https://www.youtube.com/watch?v=rlA5QreHekk&list=PLBM5Lly_T4yRGBFgforeMTDpjasC_PV7r&index=31)

## Prerequisites

- NVIDIA GPUs with CUDA support (Ampere or newer recommended)
- CUDA Toolkit (12.x recommended)
- MPI implementation (e.g., OpenMPI or MPICH)
- NCCL installed and visible to your toolchain
- NVSHMEM installed (for C/C++) and NVSHMEM Python runtime (for Python labs)
- Python 3.9+ for NVSHMEM Python labs

## Environment Setup

Set the following environment variables so the build system and runtime can find CUDA, NCCL, and NVSHMEM. The paths below are examples; adjust to your system.

```bash
export NVSHMEM_HOME=/path/to/nvshmem/build/lib
export NCCL_HOME=/path/to/nccl-src/build/
export LD_LIBRARY_PATH=$NCCL_HOME/lib:$NVSHMEM_HOME/lib:$LD_LIBRARY_PATH
export PATH=$NVSHMEM_HOME/bin:$PATH
export CPATH=$NCCL_HOME/build:$NVSHMEM_HOME/include:$CPATH
```

You may also need `CUDA_HOME` if not set by your environment modules:

```bash
export CUDA_HOME=/usr/local/cuda
```

Verify your toolchain:

```bash
nvcc --version
mpicxx --version || mpicc --version
python3 -V
```

## Repository Structure

```
nccl/
  lab1/     # NCCL basics (unsolved + solved)
  lab3/     # Jacobi with NCCL (unsolved + solved)
  lab5/     # NCCL symmetric memory kernels (unsolved + solved)
nvshmem/
  lab2/     # NVSHMEM basics (C++/CUDA - unsolved + solved)
  lab4/     # Jacobi with NVSHMEM (unsolved + solved)
  lab6/     # NVSHMEM Python bindings (put, put_signal)
```

Each lab includes a `Makefile` with standard targets to build and run.

## Building and Running

Unless noted, the examples assume 2–4 GPUs on a single node. Control the number of MPI processes with `NP` and the visible GPUs with `CUDA_VISIBLE_DEVICES`.

### NCCL Labs

- `nccl/lab3` (Jacobi):
  ```bash
  cd nccl/lab3
  make jacobi              # build unsolved version
  make jacobi_solved       # build reference solution
  make run                 # run unsolved (default NP=4)
  make run_solved          # run solved (default NP=4)
  ```

- `nccl/lab5` (Symmetric kernels):
  ```bash
  cd nccl/lab5
  make nccl_symmetric            # build unsolved
  make nccl_symmetric_solved     # build reference solution
  make run                       # run unsolved (default NP=4)
  make run_solved                # run solved (default NP=4)
  ```

### NVSHMEM Labs (C++/CUDA)

- `nvshmem/lab2` (Basics):
  ```bash
  cd nvshmem/lab2
  make               # build
  make run           # run (default NP=4)
  ```

- `nvshmem/lab4` (Jacobi):
  ```bash
  cd nvshmem/lab4
  make jacobi            # build unsolved
  make jacobi_solved     # build reference solution
  make run               # run unsolved (default NP=1 unless NP is set)
  make run_solved        # run solved
  ```

### NVSHMEM Python Labs

- `nvshmem/lab6`:
  - Install Python dependencies:
    ```bash
    cd nvshmem/lab6
    pip install -r requirements.txt
    ```
  - Run the Python example with two processes on two GPUs:
    ```bash
    make run   # runs: CUDA_VISIBLE_DEVICES=0,1 $(JSC_SUBMIT_CMD) -n 2 python3 put_signal.py
    ```

Notes:
- Some Makefiles rely on `JSC_SUBMIT_CMD` (cluster launcher wrapper). This is because the tutorial was hosted using hardware from Forschungszentrum Jülich (JSC). On a workstation, you can set `JSC_SUBMIT_CMD` to `mpirun` or `srun` as appropriate, e.g.:
  ```bash
  export JSC_SUBMIT_CMD=mpirun
  ```
- You can override `NP` at invocation time: `NP=8 make run`.

## Troubleshooting

- Ensure `LD_LIBRARY_PATH` contains both CUDA, NCCL, and NVSHMEM `lib` directories.
- If `NVSHMEM_HOME` is required by a Makefile, confirm it is set and points to a valid install.
- Match `NP` to the number of GPUs specified by `CUDA_VISIBLE_DEVICES`.
- For NVSHMEM Python, verify that `libnvidia-nvshmem-cu12` and `cuda-python` versions are compatible with your CUDA driver/runtime.

## Credits and Attribution

This material was presented as an interactive tutorial at Hot Interconnects 2025 (HOTI 2025):
- Tutorial homepage: [hoti.org/tutorials-nccl-nvshmem.html](https://hoti.org/tutorials-nccl-nvshmem.html)
- Recording: [YouTube](https://www.youtube.com/watch?v=rlA5QreHekk&list=PLBM5Lly_T4yRGBFgforeMTDpjasC_PV7r&index=31)

This tutorial was co-hosted by NVIDIA and Forschungszentrum Jülich (JSC). JSC supported the workshop by supplying hardware access for participants.