# cupti_profiler

**Repository Path**: zksainx/cupti_profiler

## Basic Information

- **Project Name**: cupti_profiler
- **Description**: cupti profiler pybind
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-12-23
- **Last Updated**: 2026-02-10

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# CUPTI Profiler for PyTorch

这是一个基于 NVIDIA CUPTI (CUDA Profiling Tools Interface) 的高性能 GPU 性能分析工具集，专为深度学习模型（特别是 PyTorch）和 CUDA Kernel 优化而设计。

本项目包含三个核心模块，通过 Python 绑定 (pybind11) 提供简洁的接口，方便集成到现有的 Python/PyTorch 工作流中。

## 核心模块

### 1. AutoRange Profiling (`autorange_profiling`)
**用途**: 收集 Kernel 级别的硬件性能指标。
- **功能**: 自动检测 CUDA Kernel 执行范围，收集指定的 CUPTI 指标（如 `smsp__cycles_elapsed.avg`, `dram__bytes.sum` 等）。
- **适用场景**: 分析 Kernel 的计算密集度、内存带宽利用率、Cache 命中率等。

### 2. PC Sampling (`pc_sampling`)
**用途**: 进行指令级的热点分析 (Instruction-level Hotspot Analysis)。
- **功能**: 以固定频率采样 GPU 的程序计数器 (PC)，统计指令执行频率和 Warp 停顿原因 (Stall Reasons)。
- **适用场景**: 深入定位 Kernel 内部的代码瓶颈，分析是计算延迟、内存依赖还是其他原因导致的停顿。

### 3. Activity Trace (`activity_trace`)
**用途**: 采集 CUDA Kernel 的执行时间 (Duration)。
- **功能**: 使用 CUPTI Activity API 异步记录所有 CUDA Kernel 的执行时间，支持部分名称匹配和统计分析。
- **适用场景**: 快速测量特定 Kernel 的平均执行时间，类似 PyTorch 的 `torch.profiler`，但更轻量、开销更低。

---

## 环境要求

- **OS**: Linux
- **CUDA Toolkit**: 12.x
- **GPU Compute Capability**: 8.0+ (Ampere, Ada, Hopper)
- **Python**: 3.7+
- **依赖库**: `pybind11`

## 安装

三个模块均已支持 `pip install` 方式安装为 Python 包，安装后可在任意位置直接 `import`。

### 一键安装所有模块

```bash
cd cupti_profiler
./init_all.sh
```

### 单独安装某个模块

```bash
cd activity_trace
./init.sh
```

```bash
cd autorange_profiling
./init.sh
```

```bash
cd pc_sampling
./init.sh
```

### 自定义 CUDA 路径

如果你的 CUDA 安装路径不是 `/usr/local/cuda`：

```bash
export CUDA_INSTALL_PATH=/path/to/your/cuda
./init_all.sh
```

### 卸载

```bash
pip uninstall activity_trace autorange_profiling pc_sampling
```

---

## 使用指南与 API 详解

### 1. AutoRange Profiling

#### API 参数详解

**`ProfilerSession(metrics, device_id=0)`**

*   **`metrics`** (list[str], **必填**): 需要收集的 CUPTI 指标名称列表。
    *   例如: `['smsp__cycles_elapsed.avg', 'dram__bytes.sum']`
    *   指标名称请参考 [NVIDIA Nsight Compute Metrics Guide](https://docs.nvidia.com/nsight-compute/NsightComputeMetricsGuide/index.html)。
*   **`device_id`** (int, 可选): 指定要进行性能分析的 CUDA 设备 ID。
    *   默认值: `0`

#### 使用示例

```python
import torch
import autorange_profiling as arp

# 1. 初始化 Profiler
# metrics: 必填，指定要收集的硬件指标
profiler = arp.ProfilerSession(
    metrics=[
        'smsp__cycles_elapsed.avg',
        'dram__bytes.sum',
        'l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum'
    ],
    device_id=0                           # 可选：指定 GPU
)

profiler.initialize()

# 2. 开始 Profiling
profiler.start()

# --- 运行你的 CUDA 代码 / PyTorch 模型 ---
# 预热通常建议在 start 之前完成
output = model(input_data)
torch.cuda.synchronize()
# ---------------------------------------

# 3. 停止 Profiling 并获取结果
profiler.stop()
results = profiler.get_results()

# 4. 打印结果
for kernel_name, metrics in results.items():
    print(f"Kernel: {kernel_name}")
    for metric_name, value in metrics.items():
        print(f"  {metric_name}: {value}")

profiler.finalize()
```

### 2. PC Sampling

#### API 参数详解

**`PCSamplingSession(device_id=0, sampling_period=-1)`**

*   **`device_id`** (int, 可选): 指定要进行采样的 CUDA 设备 ID。
    *   默认值: `0`
*   **`sampling_period`** (int, 可选): PC 采样周期 (5-31)，或 -1 使用默认值。
    *   采样周期 = 2^sampling_period 个时钟周期
    *   值越小 = 采样频率越高，开销越大
    *   值越大 = 采样频率越低，开销越小
    *   默认值: `-1` (使用 CUPTI 默认值)

**`get_hotspots(top_n=10)`**

*   **`top_n`** (int, 可选): 获取采样数最多的前 N 个指令热点。
    *   默认值: `10`

#### 使用示例

```python
import torch
import pc_sampling as pcs

# 1. 初始化 Sampler
sampler = pcs.PCSamplingSession(device_id=0)

sampler.initialize()

# 2. 开始采样
sampler.start()

# --- 运行你的 CUDA 代码 / PyTorch 模型 ---
output = model(input_data)
torch.cuda.synchronize()
# ---------------------------------------

# 3. 停止采样
sampler.stop()

# 4. 获取分析结果
# 获取前 20 个热点指令
hotspots = sampler.get_hotspots(top_n=20)
for hotspot in hotspots:
    print(f"PC: {hotspot['pc_offset']}, Function: {hotspot['function_name']}")
    print(f"  Total Samples: {hotspot['total_samples']}")
    print(f"  Stall Reasons: {hotspot['stall_reasons']}")

# 获取整体统计信息
stats = sampler.get_statistics()
print(f"Total Samples: {stats['total_samples']}")
```

### 3. Activity Trace

#### API 参数详解

**`ActivityTracer()`**

*   无参数构造。**注意**: 同一时间只能有一个 `ActivityTracer` 实例处于活动状态。

**`get_average_duration(kernel_pattern, count=-1)`**

*   **`kernel_pattern`** (str, **必填**): 要匹配的 Kernel 名称模式（使用子字符串匹配）。
    *   例如: `"gemm"` 可以匹配 `"ampere_sgemm_128x64_nn"`
*   **`count`** (int, 可选): 计算平均值时使用最后 N 次匹配的 Kernel。
    *   默认值: `-1` (使用所有匹配的 Kernel)
    *   如果 `count > 0`，则只使用最后 `count` 次出现的 Kernel

**返回**: 平均执行时间（单位：微秒 us）

#### 使用示例

```python
import torch
import activity_trace

# 1. 创建 Tracer
tracer = activity_trace.ActivityTracer()

# 2. 开始追踪
tracer.start()

# --- 运行你的 CUDA 代码 / PyTorch 模型 ---
# 建议在 start() 之前先预热，避免首次执行的 JIT 开销
a = torch.randn(2048, 2048, device='cuda')
b = torch.randn(2048, 2048, device='cuda')

for i in range(30):
    c = torch.mm(a, b)

torch.cuda.synchronize()
# ---------------------------------------

# 3. 停止追踪
tracer.stop()

# 4. 获取匹配 Kernel 的平均时间
# 方式 1: 获取所有匹配 Kernel 的平均时间
avg_us_all = tracer.get_average_duration("gemm")
print(f"Average duration (all): {avg_us_all:.2f} us")

# 方式 2: 只使用最后 10 次的平均时间
avg_us_10 = tracer.get_average_duration("gemm", count=10)
print(f"Average duration (last 10): {avg_us_10:.2f} us")

# 5. 查看所有捕获的 Kernel 记录
records = tracer.get_kernel_records()
for r in records[:5]:  # 打印前 5 个
    print(f"{r.name}: {r.duration_us():.2f} us")

# 6. 清空记录（可选）
tracer.clear()
```

#### 类似 bench_kineto 的使用方式

```python
def bench_like_kineto(fn, kernel_pattern, num_tests=30, warmup=5):
    """基准测试函数，类似 PyTorch 的 bench_kineto"""
    tracer = activity_trace.ActivityTracer()

    # 预热（在追踪之外）
    for _ in range(warmup):
        fn()
    torch.cuda.synchronize()

    # 开始追踪
    tracer.start()

    # 执行测试
    for _ in range(num_tests):
        fn()
    torch.cuda.synchronize()

    # 停止并获取结果
    tracer.stop()
    avg_us = tracer.get_average_duration(kernel_pattern, count=num_tests)

    return avg_us

# 使用示例
a = torch.randn(2048, 2048, device='cuda')
b = torch.randn(2048, 2048, device='cuda')

def matmul_fn():
    return torch.mm(a, b)

avg_us = bench_like_kineto(matmul_fn, "gemm", num_tests=30, warmup=5)
print(f"Average kernel duration: {avg_us:.2f} us")
```

#### Autotune 场景：高频重用模式

对于需要测试大量配置的 autotune 场景（每个 shape 测试上千种 config），推荐重用同一个 `ActivityTracer` 实例以减少开销：

```python
import activity_trace
import torch

# 创建一次，重用多次（避免频繁创建/销毁的开销）
tracer = activity_trace.ActivityTracer()

# 假设有 1000+ 种配置要测试
configs = generate_tile_configs()  # (tile_size, num_stages, num_warps, ...)
results = {}

for config in configs:
    # start() 会自动清空之前的记录
    tracer.start()

    # 运行当前配置的 kernel
    output = run_kernel_with_config(
        input_data,
        tile_m=config.tile_m,
        tile_n=config.tile_n,
        num_stages=config.num_stages,
        num_warps=config.num_warps
    )
    torch.cuda.synchronize()

    tracer.stop()

    # 获取这个 config 的平均时间（使用最后 30 次）
    try:
        avg_us = tracer.get_average_duration("my_kernel", count=30)
        results[config] = avg_us
    except RuntimeError:
        # 如果 kernel 没有匹配，记录失败
        results[config] = float('inf')

# 所有测试完成后销毁
del tracer

# 找到最快的配置
best_config = min(results, key=results.get)
print(f"Best config: {best_config}, time: {results[best_config]:.2f} us")
```

**性能优化要点**：
- **重用实例**：避免每次都 `ActivityTracer()` → `del`，CUPTI callbacks 只在构造时注册一次
- **Buffer size**：已优化为 1 MB，可存储 2500+ 条 kernel 记录，适合高频测试
- **自动清空**：每次 `start()` 会自动清空之前的记录，无需手动 `clear()`


## 目录结构

```
cupti_profiler/
├── init_all.sh                # 一键安装所有模块
├── README.md                  # 本文档
├── autorange_profiling/       # AutoRange Profiling 模块
│   ├── setup.py               # 包构建脚本
│   ├── pyproject.toml         # 包元数据
│   ├── MANIFEST.in            # 包文件清单
│   ├── init.sh                # 安装脚本
│   ├── Makefile               # 手动编译脚本
│   ├── src/                   # C++ 核心实现
│   └── test_profiler.py       # 测试脚本
├── pc_sampling/               # PC Sampling 模块
│   ├── setup.py               # 包构建脚本
│   ├── pyproject.toml         # 包元数据
│   ├── MANIFEST.in            # 包文件清单
│   ├── init.sh                # 安装脚本
│   ├── Makefile               # 手动编译脚本
│   ├── src/                   # C++ 核心实现
│   └── test_sampler.py        # 测试脚本
├── activity_trace/            # Activity Trace 模块
│   ├── setup.py               # 包构建脚本
│   ├── pyproject.toml         # 包元数据
│   ├── MANIFEST.in            # 包文件清单
│   ├── init.sh                # 安装脚本
│   ├── Makefile               # 手动编译脚本
│   ├── src/                   # C++ 核心实现
│   └── test_activity_trace.py # 测试脚本
└── README.md                  # 本文档
```

### 3. 内部配置与限制 (Internal Configuration & Limitations)

当前版本的工具在 C++ 核心层硬编码了一些参数，Python API 暂未暴露这些配置。了解这些限制有助于你更好地规划 Profiling 策略。

#### AutoRange Profiling
*   **Max Ranges Per Pass**: `100`
    *   每次 Pass 最多收集 100 个 Range (Kernel)。如果你的模型执行超过 100 个 Kernel，后续的 Kernel 可能不会被 Profile。
*   **Max Launches Per Pass**: `100`
    *   每次 Pass 最多支持 100 次 Kernel Launch。
*   **Replay Mode**: `CUPTI_KernelReplay`
    *   强制使用 Kernel Replay 模式。这意味着每个 Kernel 会被多次执行以收集所有请求的指标。请确保你的 Kernel 是确定性的 (Deterministic) 且没有副作用 (Side-effects)，否则 Replay 可能导致错误结果。

#### PC Sampling
*   **Sampling Period**: 可配置 (5-31)
    *   通过构造函数的 `sampling_period` 参数配置，默认值 -1 使用 CUPTI 默认采样频率。
    *   采样周期 = 2^sampling_period 个时钟周期。
*   **Max PCs to Collect**: `1000`
    *   缓冲区最多存储 1000 个不同的 PC (Program Counter) 地址。如果 Kernel 非常大且热点分散，可能会丢失部分 PC 的数据。

#### Activity Trace
*   **Buffer Size**: `1 MB` (优化后)
    *   可存储约 2500+ 条 kernel 记录（每条约 400 字节）。适合 autotune 等高频测试场景。
    *   如果 kernel 执行频率极高导致 buffer 溢出，会有警告提示 "Dropped N activity records"。
*   **Activity Type**: `CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL`
    *   只追踪 Kernel 执行活动，不追踪内存拷贝、API 调用等其他活动。
*   **Thread Safety**: 单实例限制
    *   同一时间只允许一个 `ActivityTracer` 实例处于活动状态（由于使用静态回调）。
    *   **推荐做法**：创建一个实例后重用多次（`start()` → `stop()` 循环），而不是频繁创建/销毁。
*   **Callback Registration**: 一次性注册
    *   CUPTI callbacks 在构造函数中注册一次，析构时自动清理，避免 `start()`/`stop()` 循环中的重复注册开销。
*   **Kernel Name Matching**: 部分匹配
    *   使用 `std::string::find()` 进行子字符串匹配，无正则表达式支持。建议使用简短的关键字（如 "gemm", "sgemm", "ampere"）来匹配目标 Kernel。

## 常见问题

1.  **`ImportError: No module named ...`**:
    - 确保已成功运行 `./init.sh` 或 `./init_all.sh` 安装模块。
    - 检查安装状态：`pip list | grep -E "activity_trace|autorange_profiling|pc_sampling"`
    - 如果安装失败，尝试手动安装：`cd <module_dir> && pip install -e . --no-build-isolation`

2.  **`CUPTI_ERROR_NOT_INITIALIZED` 或权限错误**:
    - 某些 CUPTI 功能需要 root 权限或特定的系统配置。
    - 尝试使用 `sudo` 运行，或查阅 NVIDIA 文档关于 "CUPTI Permissions" 的部分。
    - 确保没有其他 Profiler (如 Nsight Systems/Compute) 同时在运行。

3.  **编译错误 `undefined symbol: NVPW_RawMetricsConfig_AddMetrics`**:
    - 这是 autorange_profiling 模块缺少链接库导致的。
    - 确保使用最新的 setup.py（已包含 `-lnvperf_host -lnvperf_target`）。
    - 重新安装：`cd autorange_profiling && ./init.sh`