# gpu-up

**Repository Path**: aixiaozao/gpu-up

## Basic Information

- **Project Name**: gpu-up
- **Description**: 动态判断gpu的使用情况，如果使用率过低则启动脚本让gpu使用率高起来
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-09-03
- **Last Updated**: 2025-09-03

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# GPU Monitor Server - Linux 服务器 GPU 监控系统

## 项目简介

这是一个专为 Linux 服务器设计的 GPU 使用率监控系统，能够：

- **实时监控** GPU 使用情况
- **自动启动高负载任务** 当 GPU 使用率过低时
- **支持多GPU** 同时监控多个 GPU 设备
- **后台服务运行** 作为 systemd 服务运行
- **温度保护** 防止 GPU 过热
- **详细日志记录** 完整的运行日志和统计数据

## 快速安装 (推荐)

### 自动安装脚本

```bash
conda install numpy matplotlib psutil -y
python3 -m pip install -r requirements.txt
python3 -m pip install torch torchvision --no-cache-dir

# 1. 下载安装脚本
chmod +x install_gpu_monitor.sh

# 2. 以 root 权限运行安装脚本
sudo ./install_gpu_monitor.sh
```

安装脚本会自动：
- 检查 NVIDIA 驱动和 GPU
- 安装 Python 依赖
- 创建系统用户和目录
- 安装并启动 systemd 服务
- 创建管理命令

### 管理命令

安装完成后，使用以下命令管理服务：

```bash
# 查看服务状态
gpu-monitor status

# 查看实时日志
gpu-monitor logs

# 查看 GPU 统计数据
gpu-monitor stats

# 编辑配置文件
gpu-monitor config

# 重启服务
gpu-monitor restart
```

## 手动安装

### 1. 环境要求

- Linux 操作系统
- NVIDIA GPU 和驱动
- Python 3.7+
- nvidia-smi 命令可用

### 2. 安装依赖

```bash
# 安装基础依赖
pip3 install nvidia-ml-py3 psutil

# 可选：安装 GPU 计算库以获得更好的工作负载支持
pip3 install torch torchvision  # PyTorch
# 或者
pip3 install tensorflow         # TensorFlow
# 或者
pip3 install cupy-cuda11x       # CuPy (CUDA 11.x)
```

### 3. 运行程序

```bash
# 前台运行（用于测试）
python3 gpu_monitor_server.py

# 后台运行
python3 gpu_monitor_server.py --daemon

# 使用自定义配置文件
python3 gpu_monitor_server.py --config /path/to/config.json
```

## 配置说明

配置文件 `config.json` 主要参数：

```json
{
    "monitoring_interval": 5,        // 监控间隔（秒）
    "low_usage_threshold": 20,       // 低使用率阈值（%）
    "high_usage_threshold": 80,      // 高使用率阈值（%）
    "consecutive_low_readings": 3,   // 连续低使用率次数触发
    "workload_duration": 60,         // 工作负载持续时间（秒）
    "max_temperature": 85,           // 最大允许温度（°C）
    "min_workload_interval": 300,    // 工作负载最小间隔（秒）
    "workload_backend": "auto",      // 工作负载后端 (auto/pytorch/tensorflow/cupy)
    "log_file": "/var/log/gpu_monitor.log",
    "enable_logging": true
}
```

### 关键参数说明

- **low_usage_threshold**: GPU 使用率低于此值时触发工作负载
- **consecutive_low_readings**: 连续检测到低使用率多少次后启动工作负载
- **max_temperature**: GPU 温度超过此值时停止工作负载
- **min_workload_interval**: 两次工作负载启动之间的最小间隔，防止频繁启动

## 使用场景

### 1. 基础监控

默认配置适用于大多数场景：
- GPU 使用率低于 20% 且连续 3 次检测时启动工作负载
- GPU 使用率高于 80% 时停止工作负载
- GPU 温度超过 85°C 时停止工作负载

### 2. 高性能计算环境

对于需要更高 GPU 利用率的环境：

```json
{
    "low_usage_threshold": 10,
    "high_usage_threshold": 90,
    "consecutive_low_readings": 2,
    "workload_duration": 120,
    "max_temperature": 80
}
```

### 3. 节能模式

对于需要节能的环境：

```json
{
    "low_usage_threshold": 30,
    "high_usage_threshold": 70,
    "consecutive_low_readings": 5,
    "workload_duration": 30,
    "min_workload_interval": 600
}
```

## 监控信息

系统提供详细的监控信息：

### GPU 统计
- GPU 使用率百分比
- 显存使用情况
- GPU 温度
- 功耗信息
- 设备名称和索引

### 系统信息
- CPU 使用率
- 内存使用率
- 系统负载
- 主机名和运行时间

## 日志和统计

### 日志文件位置
- **服务日志**: `/var/log/gpu-monitor/gpu_monitor.log`
- **统计数据**: `/var/log/gpu-monitor/gpu_stats.json`
- **系统日志**: `journalctl -u gpu-monitor.service`

### 查看日志
```bash
# 实时查看服务日志
gpu-monitor logs

# 查看系统日志
journalctl -u gpu-monitor.service -f

# 查看统计数据
gpu-monitor stats
```

## 故障排除

### 常见问题

1. **nvidia-smi 不可用**
   ```bash
   # 检查 NVIDIA 驱动
   nvidia-smi
   
   # 如果失败，重新安装 NVIDIA 驱动
   ```

2. **权限问题**
   ```bash
   # 确保以正确用户运行
   sudo systemctl status gpu-monitor.service
   ```

3. **工作负载不启动**
   ```bash
   # 检查配置文件
   gpu-monitor config
   
   # 查看详细日志
   gpu-monitor logs
   ```

4. **GPU 温度过高**
    - 检查散热系统
    - 降低 `max_temperature` 设置
    - 减少 `workload_duration`

### 调试模式

```bash
# 停止服务
sudo systemctl stop gpu-monitor.service

# 前台运行查看详细输出
cd /opt/gpu-monitor
sudo -u gpu-monitor python3 gpu_monitor_server.py --config /etc/gpu-monitor/config.json
```

## 性能优化

### 1. 选择合适的后端

- **PyTorch**: 最推荐，支持丰富的 GPU 操作
- **TensorFlow**: 适合已有 TF 环境的系统
- **CuPy**: 轻量级，适合简单计算
- **Simple**: 基础后端，兼容性最好

### 2. 调整监控间隔

- 高频监控（1-3秒）：适合对响应时间要求高的场景
- 标准监控（5秒）：平衡性能和响应时间
- 低频监控（10-30秒）：适合资源受限的环境

### 3. 内存管理

系统会自动管理 GPU 内存，避免内存泄漏：
- 定期清理 GPU 缓存
- 限制历史数据大小
- 自动重启失败的进程

## 安全考虑

1. **服务用户**: 使用专用的 `gpu-monitor` 用户运行服务
2. **文件权限**: 限制配置文件和日志文件的访问权限
3. **资源限制**: 通过 systemd 限制内存和文件句柄使用
4. **网络隔离**: 服务默认不开放网络端口

## 更新和维护

### 更新配置

```bash
# 编辑配置
gpu-monitor config

# 重启服务应用配置
gpu-monitor restart
```

### 查看版本

```bash
python3 /opt/gpu-monitor/gpu_monitor_server.py --version
```

### 卸载服务

```bash
# 停止并禁用服务
sudo systemctl stop gpu-monitor.service
sudo systemctl disable gpu-monitor.service

# 删除服务文件
sudo rm /etc/systemd/system/gpu-monitor.service
sudo systemctl daemon-reload

# 删除安装文件（可选）
sudo rm -rf /opt/gpu-monitor
sudo rm -rf /etc/gpu-monitor
sudo rm -rf /var/log/gpu-monitor
sudo userdel gpu-monitor
```

## 技术支持

如果遇到问题，请：

1. 查看日志文件确定问题原因
2. 检查 GPU 驱动和硬件状态
3. 验证配置文件格式正确
4. 确保有足够的系统资源

---

## English Quick Start

For English users, here's a quick start guide:

### Installation
```bash
sudo ./install_gpu_monitor.sh
```

### Management
```bash
gpu-monitor status    # Check service status
gpu-monitor logs      # View live logs
gpu-monitor config    # Edit configuration
gpu-monitor restart   # Restart service
```

### Configuration
The system monitors GPU usage and automatically starts high-utilization workloads when usage falls below the configured threshold (default: 20%). Key settings in `/etc/gpu-monitor/config.json`:

- `low_usage_threshold`: Trigger workload below this GPU usage %
- `high_usage_threshold`: Stop workload above this GPU usage %
- `max_temperature`: Stop workload above this temperature (°C)
- `workload_duration`: How long to run workload (seconds)

The system supports multiple GPU computation backends (PyTorch, TensorFlow, CuPy) and includes comprehensive logging and monitoring features.