# train_pipeline_template

**Repository Path**: LXP-Never/train_pipeline_template

## Basic Information

- **Project Name**: train_pipeline_template
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-01-22
- **Last Updated**: 2026-01-22

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# PyTorch 分布式训练管道模板

一个通用的PyTorch分布式训练管道模板，提供完整的训练框架，开箱即用。

## 特性

- ✅ **分布式训练** - PyTorch DDP多GPU并行
- ✅ **混合精度(AMP)** - 默认开启，3-4x训练加速，充分利用Tensor Core
- ✅ **智能代码备份** - 自动生成Git patch，遵守.gitignore规则
- ✅ **完整流程** - 训练/验证/测试，自动保存检查点
- ✅ **丰富监控** - TensorBoard + 梯度/权重/系统监控
- ✅ **模块化设计** - 数据集/模型/损失函数独立，易于定制

## 快速开始（5分钟）

### 1. 安装依赖

```bash
# 创建环境
conda create -n train_env python=3.9
conda activate train_env

# 安装其他依赖
pip install -r requirements.txt
```

### 2. 准备数据

**方式A: 使用虚拟数据测试**

运行: `python create_dummy_data.py`

**方式B: 使用真实数据**

修改配置文件中的数据路径：

```yaml
# configs/config.yaml
dataset:
  train:
    data_path: "/path/to/your/train/data"
  val:
    data_path: "/path/to/your/val/data"
```

### 3. 自定义你的代码

#### 3.1 数据集 (`dataset/dataset_template.py`)

```python
def __getitem__(self, idx):
    file_path = self.data_list[idx]

    # TODO: 实现你的数据加载
    data = np.load(file_path)  # 或 your_load_function(file_path)
    data = data.flatten()

    # TODO: 实现标签获取
    label = self._get_label(file_path)

    return data, label
```

#### 3.2 模型 (`models/model_template.py`)

使用提供的模板（MLP/CNN/RNN/Transformer）或自定义：

```python
class YourModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        return self.fc2(x)
```

### 4. 运行训练

**单GPU:**
```bash
CUDA_VISIBLE_DEVICES=0 python train_ddp.py \
    --train_tag my_experiment \
    --batch_size 64 \
    --epochs 100
```

**多GPU (4卡):**
```bash
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --standalone --nproc_per_node=4 train_ddp.py \
    --train_tag my_experiment \
    --batch_size 256 \
    --epochs 100
```

**使用脚本:**
```bash
bash scripts/train.sh
```

### 5. 查看结果

```bash
# 查看日志
tail -f train_log/YourModel/my_experiment/train.log

# TensorBoard
tensorboard --logdir=train_log/YourModel/my_experiment/tensorboard --port=6006
```

## 项目结构

```
train_pipeline_template/
├── train_ddp.py              # 主训练脚本（已集成AMP、梯度裁剪）
├── configs/
│   └── config.yaml           # 通用配置（支持4090/H100等GPU）
├── dataset/
│   └── dataset_template.py   # 数据集模板（BaseDataset/Train/Val/Test）
├── models/
│   └── model_template.py     # 模型模板（MLP/CNN/RNN/Transformer）
├── losses/
│   └── loss_template.py      # 损失函数（10+种常用损失）
├── utils/
│   ├── base_trainer.py       # 基础训练器（日志/TensorBoard/CodeSaver/监控）
│   └── train_utils.py        # 训练工具（学习率/早停/数据处理）
└── scripts/
    ├── train.sh              # 训练启动脚本
    └── resume_train.sh       # 恢复训练脚本
```

## 配置文件

### 配置文件结构

```yaml
# 数据集路径
dataset:
  train:
    data_path: "/path/to/train"
  val:
    data_path: "/path/to/val"

# 模型参数
model:
  input_dim: 128
  hidden_dim: 256
  output_dim: 10

# 优化器
optim_conf:
  lr: 0.001
  weight_decay: 0.0001

# 学习率调度器
lr_scheduler_conf:
  mode: 'max'
  factor: 0.5
  patience: 3

# 训练设置
training:
  batch_size: 256     # 根据GPU调整（4090: 256-512, H100: 512-2048）
  epochs: 100
  num_workers: 8
  early_stopping:
    enable: true
    patience: 10

# 混合精度训练（默认开启）
amp:
  enable: true
  dtype: 'float16'    # 或 'bfloat16'（H100支持）

# torch.compile优化（PyTorch 2.0+）
compile:
  enable: true
  mode: 'max-autotune'

# GPU优化
gpu_optimizations:
  allow_tf32: true              # H100/A100支持
  cudnn_benchmark: true
  float32_matmul_precision: 'high'
```

## 命令行参数

| Parameter | Description | Default |
|------|------|--------|
| `--train_tag` | Training tag to distinguish experiments | experiment_v1 |
| `--model_name` | Checkpoint to resume training | None |
| `--batch_size` | Total batch size | 256 |
| `--epochs` | Number of epochs | 100 |
| `--config_path` | Path to config file | ./configs/config.yaml |
| `--log_path` | Path to save logs | ./train_log |

## 核心功能

### 1. 智能代码备份（CodeSaver）

每次训练自动备份代码差异，生成标准Git patch格式。

**特性:**
- ✅ 标准Git unified diff格式，可用`git apply`应用
- ✅ 自动遵守`.gitignore`规则
- ✅ 保存实验元数据（Git状态、配置快照、实验信息）
- ✅ 维护实验索引（最近100个实验）

**文件过滤机制:**

1. **.gitignore规则**（优先级最高）
   ```gitignore
   # Python
   __pycache__/
   *.pyc
   
   # 训练相关
   train_log/
   checkpoints/
   *.pth
   ```

2. **白名单机制** - 只备份以下类型文件：
   - `.py`, `.sh`
   - `.yaml`, `.yml`, `.json`, `.toml`
   - `.txt`, `.md`, `.ini`, `.cfg`

3. **强制忽略** - `.latest_run_cache/`（CodeSaver缓存）

**生成的文件:**
```
code_patches/
├── diff_20260120_175100.patch      # Git patch
├── metadata_20260120_175100.json   # 实验元数据
└── experiments_index.json          # 实验索引
```

**查看和应用patch:**
```bash
# 查看patch
cat train_log/YourModel/my_experiment/code_patches/diff_20260120_175100.patch

# 应用patch
git apply diff_20260120_175100.patch

# 预览效果
git apply --stat diff_20260120_175100.patch
```

### 2. 混合精度训练（AMP）

默认开启，利用GPU Tensor Core硬件加速。

**为什么混合精度能加速？**

1. **Tensor Core硬件加速**
   - FP32: 使用普通CUDA Core
   - FP16: 使用Tensor Core（**性能提升30倍**）
   - 实际训练加速: 3-4x

2. **内存带宽优势**
   - FP16占用内存减半
   - CPU↔GPU传输速度翻倍
   - 可以使用更大batch size

3. **智能混合精度**
   ```
   前向传播: FP16 (快)
   Loss计算: FP32 (准)
   反向传播: FP16 (快)
   权重更新: FP32 (准) ← "Master Weights"
   ```

**性能提升:**

| GPU | FP32速度 | FP16速度 | 加速比 |
|-----|---------|---------|-------|
| RTX 4090 | 基准 | 3-4x | ⚡ |
| H100 | 基准 | 3-4x (FP16) | 🚀 |
| H100 | 基准 | 7-8x (FP8) | 🔥 |

**H100特有: FP8训练**

```python
# H100支持FP8（需要Transformer Engine）
# pip install git+https://github.com/NVIDIA/TransformerEngine.git

import transformer_engine.pytorch as te

with te.fp8_autocast(enabled=True):
    outputs = model(inputs)
```

**配置:**
```yaml
amp:
  enable: true  # 默认开启
  dtype: 'float16'  # 或 'bfloat16'（H100支持）
```

### 3. 高级监控

**基础监控（推荐日常使用）:**
```python
metrics = {
    'train/loss': epoch_loss,
    'train/acc': epoch_acc,
    'train/lr': current_lr
}
self.log_training_metrics(metrics, epoch)
```

**高级监控（调试时使用）:**
```python
# 梯度监控（诊断梯度消失/爆炸）
self.log_model_gradients(self.model, step)

# 权重监控
self.log_model_weights(self.model, epoch)

# 混淆矩阵
self.log_confusion_matrix(preds, targets, class_names, epoch)

# 系统监控（CPU/GPU/内存）
if epoch % 5 == 0:
    self.log_system_metrics(epoch)
```

**可选依赖（高级监控需要）:**
```bash
pip install psutil gputil matplotlib seaborn scikit-learn
```

### 4. 早停机制

```yaml
training:
  early_stopping:
    enable: true
    patience: 10     # 10个epoch无提升则停止
    mode: 'max'      # 'max'表示指标越大越好，'min'表示越小越好
```

## 常见任务

### 从检查点恢复训练

```bash
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --standalone --nproc_per_node=4 train_ddp.py \
    --train_tag my_experiment \
    --model_name 10.pth \
    --batch_size 256 \
    --epochs 100
```

或使用脚本:
```bash
bash scripts/resume_train.sh
```

### 添加数据增强

在`dataset/dataset_template.py`的`_augment()`方法中:

```python
def _augment(self, data):
    # 添加噪声
    if np.random.random() < 0.5:
        noise = np.random.randn(*data.shape) * 0.01
        data = data + noise

    # 随机遮罩
    if np.random.random() < 0.3:
        mask_size = int(len(data) * 0.1)
        mask_start = np.random.randint(0, len(data) - mask_size)
        data[mask_start:mask_start+mask_size] = 0

    return data
```

### 使用不同的损失函数

模板提供了10+种损失函数，在`train_ddp.py`中选择:

```python
# Focal Loss（处理类别不平衡）
from losses import FocalLoss
self.criterion = FocalLoss(alpha=1, gamma=2)

# Triplet Loss（度量学习）
from losses import TripletLoss
self.criterion = TripletLoss(margin=1.0)

# 组合损失
from losses import CombinedLoss
self.criterion = CombinedLoss([
    {'loss': nn.CrossEntropyLoss(), 'weight': 1.0},
    {'loss': FocalLoss(), 'weight': 0.5}
])
```

可用的损失函数:
- **分类**: CrossEntropyLoss, FocalLoss, LabelSmoothingLoss
- **度量学习**: TripletLoss, ContrastiveLoss
- **回归**: SmoothL1Loss, HuberLoss
- **分割**: DiceLoss
- **组合**: CombinedLoss

### 添加验证指标

在`val_epoch()`中添加更多指标:

```python
from sklearn.metrics import f1_score, precision_score, recall_score

@torch.no_grad()
def val_epoch(self, epoch):
    all_preds = []
    all_targets = []

    for batch_data in self.val_dataloader:
        # 前向传播
        outputs = self.model(inputs)
        pred = outputs.argmax(dim=1)

        all_preds.append(pred.cpu())
        all_targets.append(targets.cpu())

    # 合并所有GPU的结果
    all_preds = self.merge_list(all_preds)
    all_targets = self.merge_list(all_targets)

    # 计算指标
    if self.local_rank == 0:
        f1 = f1_score(all_targets, all_preds, average='macro')
        precision = precision_score(all_targets, all_preds, average='macro')
        recall = recall_score(all_targets, all_preds, average='macro')

        self.logger.info(f"F1: {f1:.4f}, Precision: {precision:.4f}, Recall: {recall:.4f}")
```

## 常见问题

### Q: OOM怎么办？

1. 减小`batch_size`
2. 使用梯度累积:
   ```python
   accumulation_steps = 4
   loss = loss / accumulation_steps
   loss.backward()
   if (batch_idx + 1) % accumulation_steps == 0:
       optimizer.step()
       optimizer.zero_grad()
   ```
3. 使用混合精度训练（已默认开启）

### Q: H100是否需要特殊配置？

不需要！训练脚本已经集成AMP和GPU优化，H100会自动利用Tensor Core。只需：
- 使用更大的batch_size（充分利用80GB显存）
- 确保PyTorch 2.0+以获得最佳性能
- 如需FP8训练（仅Transformer模型），安装Transformer Engine

### BaseTrainer提供的工具

```python
# 分布式工具
self.reduce_mean(tensor)     # 计算所有GPU的平均值
self.reduce_sum(tensor)      # 计算所有GPU的总和
self.merge_list(tensor_list) # 合并所有GPU的张量

# 日志工具
self.logger.info(message)    # 文本日志
self.writer.add_scalar()     # TensorBoard标量
self.log_training_metrics()  # 统一指标记录

# 监控工具
self.log_model_gradients()   # 梯度监控
self.log_model_weights()     # 权重监控
self.log_confusion_matrix()  # 混淆矩阵
self.log_system_metrics()    # 系统资源监控
```

## 适用场景

这个模板适用于:
- ✅ 图像分类/检测/分割
- ✅ 语音识别/分类
- ✅ 文本分类/NLP任务
- ✅ 时序数据分析
- ✅ 任何需要分布式训练的深度学习任务

## 参考资料

- [PyTorch官方文档](https://pytorch.org/docs/stable/index.html)
- [PyTorch DDP教程](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html)
- [TensorBoard文档](https://www.tensorflow.org/tensorboard)
- [Mixed Precision Training](https://pytorch.org/docs/stable/amp.html)

## 贡献

欢迎提出改进建议和bug报告！

---

**祝你训练顺利！** 🚀