# ddl-platform-iwqos **Repository Path**: yzluo2023/ddl-platform-iwqos ## Basic Information - **Project Name**: ddl-platform-iwqos - **Description**: 分布式训练任务资源调度平台 - **Primary Language**: Python - **License**: GPL-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 3 - **Created**: 2024-12-28 - **Last Updated**: 2024-12-28 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Work on yzluo Branches ## Mostly focused on pair job train section 1. Both `mono and twin data` is stored in this branch, in .../utils/mono_data/ and .../utils/twin_data/ 2. GPU memory occupation of different jobs with varient batch size, `gmem_info` is in .../utils/gmem_info/ 3. $\xi$ estimation program `get_xi.py` in .../utils/get_xi.py 4. mono and twin log file parser funtion python program, `log_parser.py` in .../utils/log_parser.py 5. nvidia-smi cmd & feedback collector `nvidia_checker.py` in .../utils/nvidia_checker.py 6. `pair_job_train.py`, python control file of concurrently launch 2 job on certain amount of gpu 7. `rename_tool.py`, used to rename pair train log file, already deprecated 8. `xi_fixer.py`, fix negative $\xi$ values generated by `get_xi.py` through retesting. # Work on dev Branches ## 1. resource_manager policy added 1. In file `resource_manager.py`, add 2 new resource allocate policy - All Shared policy, also known as **SRTF-ARS**, unconditionally allow job to share its gpu with and only with 1 new job - Cond Shared policy, aka. **SRTF-WRS**, conditionally choose job with a priority to shared with, if not enough shared-gpu are provided, this new job won't start. ## 2. simulator updated 1. enabled Pollux's pending_time and restart information output 2. Use trace data of 2080Ti Cluster to simulate, modify workloads adapted for traces, like "max(gpu_num, 16)" since the maximum data collected in 2080Ti is 16-gpus 3. use $\xi$ generated in `yzluo` branch in both shared-policy desicion and simulator's job progress calculation. ## 3. Interpolation Problems 1. 待插值变量取值,超过了插值函数自变量区间(之前采用超限插值,但容易导致问题 3,现改用取均值) 2. Scipy 的 LinearNDInterpolator 插值方法允许自变量有多个,但是使用时容易出现两个或以上自变量都没有对应数据,导致插值出错 3. 由于 $\xi$ 数据本身还存在一定的问题,导致超限插值出错,原因是超限插值目前只基于线性插值,而 $\xi$ 数据本身线性性质非常差,采用 LinearNDInterpolator(分段线性插值)效果尚可,但如果进行超限插值,就容易出现斜率为 0 的错误 4. 针对第 2 点,例如插值函数自变量是 `gpu数,job0的bs,job1的bs`,假如要求的是 job0 的 $\xi_0$ 值,而 job1 的 bs 超出了测试数据极限,那么只能将其约束为测试数据中的最大值 1. 在此之上,假如 gpu 数也是一个不规范数,如 3、5 等,且 job0 本身 bs 就不在测试数据中,那么分段线性插值就不成立,需要人为得到 gpu 数的一个上下限,例如上限 4 下限 1,然后分别求两次分段线性插值结果,最后取平均 5. 在 4 的基础上,如果同时出现两个 job 的 bs 都超出了测试数据的极限,那么就无法进行插值 6. 如果出现两个任务是同一类,例如两个 Cifar10 任务,但是 $\xi$ 数据表中对应的任务,在 bs 取值相互对调的情况下(例如。第一个的 bs=1024,第二个 bs=512;与第一个 bs=512,第二个 bs=1024 的结果是不一样的),$\xi$ 值不同……这可能导致共享判断出问题,实际的任务减速计算也是 7. 对于分段线性插值,需要保证其自变量值的任何一个分量,都有至少两种不同取值才能正常进行插值