# ddl-platform-iwqos

**Repository Path**: yzluo2023/ddl-platform-iwqos

## Basic Information

- **Project Name**: ddl-platform-iwqos
- **Description**: 分布式训练任务资源调度平台
- **Primary Language**: Python
- **License**: GPL-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 3
- **Created**: 2024-12-28
- **Last Updated**: 2024-12-28

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Work on yzluo Branches
## Mostly focused on pair job train section
1. Both `mono and twin data` is stored in this branch, in .../utils/mono_data/ and .../utils/twin_data/
2. GPU memory occupation of different jobs with varient batch size, `gmem_info` is in .../utils/gmem_info/
3. $\xi$ estimation program `get_xi.py` in .../utils/get_xi.py
4. mono and twin log file parser funtion python program, `log_parser.py` in .../utils/log_parser.py
5. nvidia-smi cmd & feedback collector `nvidia_checker.py` in .../utils/nvidia_checker.py
6. `pair_job_train.py`, python control file of concurrently launch 2 job on certain amount of gpu
7. `rename_tool.py`, used to rename pair train log file, already deprecated
8. `xi_fixer.py`, fix negative $\xi$ values generated by `get_xi.py` through retesting.


# Work on dev Branches
## 1. resource_manager policy added
1. In file `resource_manager.py`, add 2 new resource allocate policy
    - All Shared policy, also known as **SRTF-ARS**, unconditionally allow job to share its gpu with and only with 1 new job
    - Cond Shared policy, aka. **SRTF-WRS**, conditionally choose job with a priority to shared with, if not enough shared-gpu are provided, this new job won't start.

## 2. simulator updated
1. enabled Pollux's pending_time and restart information output
2. Use trace data of 2080Ti Cluster to simulate, modify workloads adapted for traces, like "max(gpu_num, 16)" since the maximum data collected in 2080Ti is 16-gpus
3. use $\xi$ generated in `yzluo` branch in both shared-policy desicion and simulator's job progress calculation.

## 3. Interpolation Problems
1. 待插值变量取值，超过了插值函数自变量区间（之前采用超限插值，但容易导致问题 3，现改用取均值）
2. Scipy 的 LinearNDInterpolator 插值方法允许自变量有多个，但是使用时容易出现两个或以上自变量都没有对应数据，导致插值出错
3. 由于 $\xi$ 数据本身还存在一定的问题，导致超限插值出错，原因是超限插值目前只基于线性插值，而 $\xi$ 数据本身线性性质非常差，采用 LinearNDInterpolator（分段线性插值）效果尚可，但如果进行超限插值，就容易出现斜率为 0 的错误
4. 针对第 2 点，例如插值函数自变量是 `gpu数，job0的bs，job1的bs`，假如要求的是 job0 的 $\xi_0$ 值，而 job1 的 bs 超出了测试数据极限，那么只能将其约束为测试数据中的最大值
	1. 在此之上，假如 gpu 数也是一个不规范数，如 3、5 等，且 job0 本身 bs 就不在测试数据中，那么分段线性插值就不成立，需要人为得到 gpu 数的一个上下限，例如上限 4 下限 1，然后分别求两次分段线性插值结果，最后取平均
5. 在 4 的基础上，如果同时出现两个 job 的 bs 都超出了测试数据的极限，那么就无法进行插值
6. 如果出现两个任务是同一类，例如两个 Cifar10 任务，但是 $\xi$ 数据表中对应的任务，在 bs 取值相互对调的情况下（例如。第一个的 bs=1024，第二个 bs=512；与第一个 bs=512，第二个 bs=1024 的结果是不一样的），$\xi$ 值不同……这可能导致共享判断出问题，实际的任务减速计算也是
7. 对于分段线性插值，需要保证其自变量值的任何一个分量，都有至少两种不同取值才能正常进行插值