开源项目 > 其他开源 > 图书/手册/教程 && 人工智能 > 机器学习/深度学习 &&

加入 Gitee

与超过 1200万开发者一起发现、参与优秀开源项目，私有仓库也完全免费：）

文件

.gitee

.github

.jenkins/rules/codespell

activity

docs

install

resource

templates

tools

tutorials

application

experts

source_en

dataset

debug

func_programming

infer

operation

optimize

parallel

images

advanced_operator_parallel.md

auto_parallel.rst

comm_fusion.md

comm_subgraph.md

data_parallel.md

dataset_slice.md

disaster_recover.md

distributed_case.rst

distributed_gradient_accumulation.md

distributed_graph_partition.md

dynamic_cluster.md

fault_recover.md

host_device_training.md

manual_parallel.md

memory_offload.md

model_loading.md

model_save_load.rst

model_saving.md

model_transformation.md

mpirun.md

ms_operator.md

msrun_launcher.md

multiple_copy.md

multiple_mix.md

operator_parallel.md

optimize_technique.rst

optimizer_parallel.md

others.rst

overview.md

pangu_alpha.md

parameter_server_training.md

pipeline_parallel.md

rank_table.md

recompute.md

recover.rst

sapp.md

semi_auto_parallel.rst

shard_function_parallel.md

sharding_propagation.md

split_technique.md

startup_method.rst

strategy_select.md

support_dynamic_shape_in_parallel.md

vmap

conf.py

index.rst

source_zh_cn

Makefile

requirements.txt

source_en

source_zh_cn

Makefile

requirements.txt

.gitignore

CONTRIBUTING_DOC.md

CONTRIBUTING_DOC_CN.md

LICENSE

NOTICE

OWNERS

README.md

README_CN.md

克隆/下载

recover.rst 1.73 KB

Fault Recovery
==============

.. image:: https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.3.1/resource/_static/logo_source_en.svg
    :target: https://gitee.com/mindspore/docs/blob/r2.3.1/tutorials/experts/source_en/parallel/recover.rst
    :alt: View Source on Gitee

.. toctree::
  :maxdepth: 1
  :hidden:

disaster_recover
  fault_recover

During the distributed parallel training process, MindSpore has three recovery methods when encountering problems such as failures of compute nodes or communication interruptions:

- Model Reloading: During training, by configuring the parameters to be merged and saved, a complete model parameter file is saved for each card, which can be directly loaded for checkpoint recovery. See `Model loading <https://www.mindspore.cn/tutorials/experts/en/r2.3.1/parallel/model_loading.html>`_ in Model saving and loading for details. 
- `Disaster Recovery in Dynamic Cluster Scenarios <https://www.mindspore.cn/tutorials/experts/en/r2.3.1/parallel/disaster_recover.html>`_: In the dynamic cluster startup scenario, if a process fails, the other processes will enter a waiting state, and the training task can be continued by pulling up the failed process without restarting the cluster (currently only supports GPU hardware platforms).
- `Fault Recovery Based on Redundant Information <https://www.mindspore.cn/tutorials/experts/en/r2.3.1/parallel/fault_recover.html>`_: In large model training, the devices divided according to the dimension of data parallelism have the same parameters of their models. According to this principle, these redundant parameter information can be utilized as a backup, and in case of one node failure, another node utilizing the same parameters can recover the failed node.

一键复制编辑原始数据按行查看历史

提交于 12个月前 . modify the links 2.3.1

Fault Recovery

Fault Recovery

View Source on Gitee

.. toctree::
  :maxdepth: 1
  :hidden:

  disaster_recover
  fault_recover

During the distributed parallel training process, MindSpore has three recovery methods when encountering problems such as failures of compute nodes or communication interruptions:

Model Reloading: During training, by configuring the parameters to be merged and saved, a complete model parameter file is saved for each card, which can be directly loaded for checkpoint recovery. See Model loading in Model saving and loading for details.
Disaster Recovery in Dynamic Cluster Scenarios: In the dynamic cluster startup scenario, if a process fails, the other processes will enter a waiting state, and the training task can be continued by pulling up the failed process without restarting the cluster (currently only supports GPU hardware platforms).
Fault Recovery Based on Redundant Information: In large model training, the devices divided according to the dimension of data parallelism have the same parameters of their models. According to this principle, these redundant parameter information can be utilized as a backup, and in case of one node failure, another node utilizing the same parameters can recover the failed node.

Loading...

马建仓 AI 助手

尝试更多

代码解读

代码找茬

代码优化

1

https://gitee.com/mindspore/docs.git

git@gitee.com:mindspore/docs.git

mindspore

docs

docs

r2.3.1