name | about | labels |
---|---|---|
Bug Report | Use this template for reporting a bug | kind/bug |
enable_parallel_optimizer=True,与支持优化器组合后,不使用allreduce分组融合,bert网络8p训练失败,报ge相关错误
Ascend
/GPU
/CPU
) / 硬件环境:Please delete the backend not involved / 请删除不涉及的后端:
/device ascend
Software Environment / 软件环境 (Mandatory / 必填):
-- MindSpore version (e.g., 1.7.0.Bxxx) :
-- Python version (e.g., Python 3.7.5) :
-- OS platform and distribution (e.g., Linux Ubuntu 16.04):
-- GCC/Compiler version (if compiled from source):
ms版本:commit_id = '[sha1]:2f410aa8,[branch]:(HEAD,origin/master,origin/HEAD,master)'
run包:runpkg_version:Milan_C18/20240517
Excute Mode / 执行模式 (Mandatory / 必填)(PyNative
/Graph
):
Please delete the mode not involved / 请删除不涉及的模式:
/mode graph
test_ms_optimizer_automodelparallel_normal_008
source /home/miniconda3/bin/activate feature_39
export TRAIN_MODE=GRAPH_MODE
export DEVICE_TYPE=Ascend910B_Arm
export ENV_DEVICE=0
source solution_test/env_set.source -e ascend
cd solution_test/remaining/test_scripts/mindspore/features/automodelparallel/optimizer_parallel
pytest -s test_ms_optimizer_automodelparallel.py::test_ms_optimizer_automodelparallel_normal_008
训练正常,用例pass
Traceback (most recent call last):
File "/data/jenkins_workspace/TDT_deployment/solution_test/remaining/test_scripts/mindspore/features/automodelparallel/optimizer_parallel/test_ms_optimizer_bert_without_allreduce_base_loss_8p/run_pretrain.py", line 288, in
run_pretrain()
File "/data/jenkins_workspace/TDT_deployment/solution_test/remaining/test_scripts/mindspore/features/automodelparallel/optimizer_parallel/test_ms_optimizer_bert_without_allreduce_base_loss_8p/src/model_utils/moxing_adapter.py", line 109, in wrapped_func
run_func(*args, **kwargs)
File "/data/jenkins_workspace/TDT_deployment/solution_test/remaining/test_scripts/mindspore/features/automodelparallel/optimizer_parallel/test_ms_optimizer_bert_without_allreduce_base_loss_8p/run_pretrain.py", line 282, in run_pretrain
model.train(new_repeat_count, ds, callbacks=callback,
File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/train/model.py", line 1082, in train
self._train(epoch,
File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/train/model.py", line 115, in wrapper
func(self, *args, **kwargs)
File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/train/model.py", line 636, in _train
self._train_dataset_sink_process(epoch, train_dataset, list_callback,
File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/train/model.py", line 721, in _train_dataset_sink_process
outputs = train_network(*inputs)
File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 696, in call
out = self.compile_and_run(*args, **kwargs)
File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 1014, in compile_and_run
self.compile(*args, **kwargs)
File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 997, in compile
_cell_graph_executor.compile(self, *self._compile_args, phase=self.phase,
File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/common/api.py", line 1643, in compile
result = self._graph_executor.compile(obj, args, kwargs, phase, self._use_vm_mode())
RuntimeError: Compile graph kernel_graph1 failed.
E19999: Inner Error!
E19999: 2024-05-18-21:40:40.383.389 Node[Default/network/Switch-op1_kernel_graph2/Default/network/optimizer/Broadcast-op1]input offset [14820334080]should equal to output offset[14820334592]with ref in[23]to output[23][FUNC:CheckRefNodeOffset][FILE:graph_mem_assigner.cc][LINE:2272]
TraceBack (most recent call last):
[Call][PreRun] Failed, graph_id:2, session_id:0.[FUNC:CompileGraph][FILE:graph_manager.cc][LINE:4542]
[Compile][Graph]Compile graph failed, error code:1343225857, session_id:0, graph_id:2.[FUNC:CompileGraph][FILE:ge_api.cc][LINE:1178]
走给 俞超杰
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
感谢您的提问,您可以评论//mindspore-assistant更快获取帮助:
关联DTS单:https://dts-szv.clouddragon.huawei.com/DTSPortal/ticket/DTS2024052020303
@wenli cann问题,与测试确认已解决
登录 后才可以发表评论