348 Star 1.6K Fork 1K

MindSpore/docs

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
文件
克隆/下载
recover.rst 1.73 KB
一键复制 编辑 原始数据 按行查看 历史
宦晓玲 提交于 2024-07-01 14:18 +08:00 . modify links 2.3.0

Fault Recovery

View Source on Gitee
.. toctree::
  :maxdepth: 1
  :hidden:

  disaster_recover
  fault_recover

During the distributed parallel training process, MindSpore has three recovery methods when encountering problems such as failures of compute nodes or communication interruptions:

  • Model Reloading: During training, by configuring the parameters to be merged and saved, a complete model parameter file is saved for each card, which can be directly loaded for checkpoint recovery. See Model loading in Model saving and loading for details.
  • Disaster Recovery in Dynamic Cluster Scenarios: In the dynamic cluster startup scenario, if a process fails, the other processes will enter a waiting state, and the training task can be continued by pulling up the failed process without restarting the cluster (currently only supports GPU hardware platforms).
  • Fault Recovery Based on Redundant Information: In large model training, the devices divided according to the dimension of data parallelism have the same parameters of their models. According to this principle, these redundant parameter information can be utilized as a backup, and in case of one node failure, another node utilizing the same parameters can recover the failed node.
Loading...
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
1
https://gitee.com/mindspore/docs.git
git@gitee.com:mindspore/docs.git
mindspore
docs
docs
r2.3.0

搜索帮助