348 Star 1.6K Fork 1K

MindSpore/docs

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
文件
.gitee
.github
.jenkins/rules/codespell
activity
docs
install
resource
templates
tools
tutorials
application
experts
source_en
dataset
debug
func_programming
infer
operation
optimize
parallel
images
advanced_operator_parallel.md
auto_parallel.rst
comm_fusion.md
comm_subgraph.md
data_parallel.md
dataset_slice.md
disaster_recover.md
distributed_case.rst
distributed_gradient_accumulation.md
distributed_graph_partition.md
dynamic_cluster.md
fault_recover.md
host_device_training.md
manual_parallel.md
memory_offload.md
model_loading.md
model_save_load.rst
model_saving.md
model_transformation.md
mpirun.md
ms_operator.md
msrun_launcher.md
multiple_copy.md
multiple_mix.md
operator_parallel.md
optimize_technique.rst
optimizer_parallel.md
others.rst
overview.md
pangu_alpha.md
parameter_server_training.md
pipeline_parallel.md
rank_table.md
recompute.md
recover.rst
sapp.md
semi_auto_parallel.rst
shard_function_parallel.md
sharding_propagation.md
split_technique.md
startup_method.rst
strategy_select.md
support_dynamic_shape_in_parallel.md
vmap
conf.py
index.rst
source_zh_cn
Makefile
requirements.txt
source_en
source_zh_cn
Makefile
requirements.txt
.gitignore
CONTRIBUTING_DOC.md
CONTRIBUTING_DOC_CN.md
LICENSE
NOTICE
OWNERS
README.md
README_CN.md
克隆/下载
recover.rst 1.73 KB
一键复制 编辑 原始数据 按行查看 历史
宦晓玲 提交于 12个月前 . modify the links 2.3.1

Fault Recovery

View Source on Gitee
.. toctree::
  :maxdepth: 1
  :hidden:

  disaster_recover
  fault_recover

During the distributed parallel training process, MindSpore has three recovery methods when encountering problems such as failures of compute nodes or communication interruptions:

  • Model Reloading: During training, by configuring the parameters to be merged and saved, a complete model parameter file is saved for each card, which can be directly loaded for checkpoint recovery. See Model loading in Model saving and loading for details.
  • Disaster Recovery in Dynamic Cluster Scenarios: In the dynamic cluster startup scenario, if a process fails, the other processes will enter a waiting state, and the training task can be continued by pulling up the failed process without restarting the cluster (currently only supports GPU hardware platforms).
  • Fault Recovery Based on Redundant Information: In large model training, the devices divided according to the dimension of data parallelism have the same parameters of their models. According to this principle, these redundant parameter information can be utilized as a backup, and in case of one node failure, another node utilizing the same parameters can recover the failed node.
Loading...
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
1
https://gitee.com/mindspore/docs.git
git@gitee.com:mindspore/docs.git
mindspore
docs
docs
r2.3.1

搜索帮助