117 Star 719 Fork 84

MindSpore / community

Create your Gitee Account
Explore and code with more than 6 million developers,Free private repositories !:)
Sign up
Clone or download
MEP-ADAPTIVE.md 3.32 KB
Copy Edit Web IDE Raw Blame History
sunbeilei authored 2020-10-28 12:31 . !73 adaptive distributed training MEP
title authors owning-sig participating-sigs status creation-date reviewers approvers stage milestone
MEP-ADAPTIVE @SunnyBeike adaptivetraining adaptivetraining provisional 2020-10-27 TBD beta beta : "v1.0"

MEP-ADAPTIVE: Adaptive Distributed Training System

Table of Contents

Summary

Adaptive Distributed Training System aims to train the neural networks with elastic resources.

Motivation

Improving the resource utilization of a deep learning cluster is of paramount concern for many AI practioners. A promising approach is to use elastic deep learning systems. These systems allow users to dynamically change the number of training resources allocated to training jobs. Hence, practitioners can pack a large number of training jobs into a cluster, significantly improving cluster utilization.

Though promising, elastic deep learning systems are difficult to be deployed in practice. State of the art data-parallel elastic ddp learning systems couple the number of training resources with a critical learning hyper-parameter: the batch size of the SGD. Any scaling decisions made by the cluster scheduler therefore must alter the SGD batch size, which affects training results and can even make the training fail to converge.

Goals

In this project, we will enable the cluster scheduler to dynamically scale a training job without affecting its SGD batch size. To achieve this, we want to explore a novel method to decouple the SGD batch size and the number of training resources, so that the change of training resources does not affect the convergence.

Non-Goals

  • None
1
https://gitee.com/mindspore/community.git
git@gitee.com:mindspore/community.git
mindspore
community
community
master

Search

105716 1d94204e 1850385 105716 2d26be5c 1850385