name | about | labels |
---|---|---|
Bug Report | Use this template for reporting a bug | kind/bug |
480卡llama2-175B加载编译缓存报错
Ascend
/GPU
/CPU
) / 硬件环境:Please delete the backend not involved / 请删除不涉及的后端:
/device ascend/GPU/CPU/kirin/等其他芯片
Software Environment / 软件环境 (Mandatory / 必填):
-- MindSpore version (e.g., 1.7.0.Bxxx) :
-- Python version (e.g., Python 3.7.5) :
-- OS platform and distribution (e.g., Linux Ubuntu 16.04):
-- GCC/Compiler version (if compiled from source):
Excute Mode / 执行模式 (Mandatory / 必填)(PyNative
/Graph
):
Please delete the mode not involved / 请删除不涉及的模式:
/mode pynative
/mode graph
[ERROR] GE_ADPT(96479,fffc9effd1e0,python):2024-04-08-17:31:49.581.897 [mindspore/ccsrc/transform/graph_ir/graph_runner.cc:397] RunGraphWithStreamAsync] Call GE RunGraphWithStreamAsync Failed, ret is: 4294967295
[CRITICAL] DEVICE(96479,fffc9effd1e0,python):2024-04-08-17:31:49.582.641 [mindspore/ccsrc/plugin/device/ascend/hal/hardware/ge_graph_executor.cc:994] RunGraphRefMode] Exec graph failed
E19999: Inner Error!
E19999 Assert ((GetModelInputBaseicAddrByIndex(static_cast<uint32_t>(i), inputs_pls[i], basic_addr)) == ge::SUCCESS) failed[FUNC:UpdateKnownNodeArgs][FILE:davinci_model.cc][LINE:3950]
TraceBack (most recent call last):
Update all node args failed.[FUNC:UpdateAllNodeArgs][FILE:davinci_model.cc][LINE:4681]
Assert ((UpdateAllNodeArgs(input_data, output_data)) == ge::SUCCESS) failed[FUNC:CopyModelData][FILE:davinci_model.cc][LINE:4584]
Assert ((ExecuteModel(model_id, stream, async_mode, input_buffer, input_desc, output_buffer, output_desc)) == ge::SUCCESS) failed[FUNC:ExecuteModel][FILE:model_manager.cc][LINE:1480]
GraphManager RunGrapWithStreamhAsync failed,session id = 0, graph id = 2, stream = 0x47302520.[FUNC:RunGraphWithStreamAsync][FILE:inner_session.cc][LINE:517]
[Run][Graph]Run graph with stream asyn failed, error code = 1343225857, session id = 0,graph id = 2, stream = 0x47302520.[FUNC:RunGraphWithStreamAsync][FILE:ge_api.cc][LINE:823]
(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)
[INFO] DEBUG(96479,fffc9effd1e0,python):2024-04-08-17:31:49.582.699 [mindspore/ccsrc/pipeline/jit/ps/debug/trace.cc:117] TraceGraphEval] Length of analysis graph stack is empty.
[INFO] DEBUG(96479,fffc9effd1e0,python):2024-04-08-17:31:49.582.727 [mindspore/ccsrc/pipeline/jit/ps/debug/trace.cc:418] GetEvalStackInfo] Get graph analysis information begin
[INFO] DEBUG(96479,fffc9effd1e0,python):2024-04-08-17:31:49.582.747 [mindspore/ccsrc/pipeline/jit/ps/debug/trace.cc:421] GetEvalStackInfo] Length of analysis information stack is empty.
[INFO] PIPELINE(96479,ffffa0f111c0,python):2024-04-08-17:32:49.584.269 [mindspore/ccsrc/include/common/utils/python_utils.h:368] HandleExceptionRethrow] Caught exception: Exec graph failed
----------------------------------------------------
- Ascend Error Message:
----------------------------------------------------
E19999: Inner Error!
E19999 Assert ((GetModelInputBaseicAddrByIndex(static_cast<uint32_t>(i), inputs_pls[i], basic_addr)) == ge::SUCCESS) failed[FUNC:UpdateKnownNodeArgs][FILE:davinci_model.cc][LINE:3950]
TraceBack (most recent call last):
Update all node args failed.[FUNC:UpdateAllNodeArgs][FILE:davinci_model.cc][LINE:4681]
Assert ((UpdateAllNodeArgs(input_data, output_data)) == ge::SUCCESS) failed[FUNC:CopyModelData][FILE:davinci_model.cc][LINE:4584]
Assert ((ExecuteModel(model_id, stream, async_mode, input_buffer, input_desc, output_buffer, output_desc)) == ge::SUCCESS) failed[FUNC:ExecuteModel][FILE:model_manager.cc][LINE:1480]
GraphManager RunGrapWithStreamhAsync failed,session id = 0, graph id = 2, stream = 0x47302520.[FUNC:RunGraphWithStreamAsync][FILE:inner_session.cc][LINE:517]
[Run][Graph]Run graph with stream asyn failed, error code = 1343225857, session id = 0,graph id = 2, stream = 0x47302520.[FUNC:RunGraphWithStreamAsync][FILE:ge_api.cc][LINE:823]
(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)
----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/plugin/device/ascend/hal/hardware/ge_graph_executor.cc:994 RunGraphRefMode
2024-04-08 17:32:49,589 - mindformers[mindformers/tools/cloud_adapter/cloud_monitor.py:43] - ERROR - Traceback (most recent call last):
File "/home/worker_479/mindformers/tools/cloud_adapter/cloud_monitor.py", line 34, in wrapper
result = run_func(*args, **kwargs)
File "/home/worker_479/run_mindformer.py", line 143, in main
create_task_trainer(config)
File "/home/worker_479/run_mindformer.py", line 85, in create_task_trainer
trainer.train(config, is_full_config=True)
File "/home/worker_479/mindformers/trainer/causal_language_modeling/causal_language_modeling.py", line 97, in train
self.training_process(
File "/home/worker_479/mindformers/trainer/base_trainer.py", line 772, in training_process
model.train(config.runner_config.epochs, dataset,
File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 1080, in train
self._train(epoch,
File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 115, in wrapper
func(self, *args, **kwargs)
File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 634, in _train
self._train_dataset_sink_process(epoch, train_dataset, list_callback,
File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 719, in _train_dataset_sink_process
outputs = train_network(*inputs)
File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 680, in __call__
out = self.compile_and_run(*args, **kwargs)
File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 1023, in compile_and_run
return _cell_graph_executor(self, *new_args, phase=self.phase)
File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/common/api.py", line 1589, in __call__
return self.run(obj, *args, phase=phase)
File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/common/api.py", line 1628, in run
return self._exec_pip(obj, *args, phase=phase_real)
File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/common/api.py", line 121, in wrapper
results = fn(*arg, **kwargs)
File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/common/api.py", line 1608, in _exec_pip
return self._graph_executor(args, phase)
RuntimeError: Exec graph failed
----------------------------------------------------
- Ascend Error Message:
----------------------------------------------------
E19999: Inner Error!
E19999 Assert ((GetModelInputBaseicAddrByIndex(static_cast<uint32_t>(i), inputs_pls[i], basic_addr)) == ge::SUCCESS) failed[FUNC:UpdateKnownNodeArgs][FILE:davinci_model.cc][LINE:3950]
TraceBack (most recent call last):
Update all node args failed.[FUNC:UpdateAllNodeArgs][FILE:davinci_model.cc][LINE:4681]
Assert ((UpdateAllNodeArgs(input_data, output_data)) == ge::SUCCESS) failed[FUNC:CopyModelData][FILE:davinci_model.cc][LINE:4584]
Assert ((ExecuteModel(model_id, stream, async_mode, input_buffer, input_desc, output_buffer, output_desc)) == ge::SUCCESS) failed[FUNC:ExecuteModel][FILE:model_manager.cc][LINE:1480]
GraphManager RunGrapWithStreamhAsync failed,session id = 0, graph id = 2, stream = 0x47302520.[FUNC:RunGraphWithStreamAsync][FILE:inner_session.cc][LINE:517]
[Run][Graph]Run graph with stream asyn failed, error code = 1343225857, session id = 0,graph id = 2, stream = 0x47302520.[FUNC:RunGraphWithStreamAsync][FILE:ge_api.cc][LINE:823]
(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)
----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/plugin/device/ascend/hal/hardware/ge_graph_executor.cc:994 RunGraphRefMode
Traceback (most recent call last):
File "/home/worker_479/run_mindformer.py", line 365, in <module>
main(config_)
File "/home/worker_479/mindformers/tools/cloud_adapter/cloud_monitor.py", line 44, in wrapper
raise exc
File "/home/worker_479/mindformers/tools/cloud_adapter/cloud_monitor.py", line 34, in wrapper
result = run_func(*args, **kwargs)
File "/home/worker_479/run_mindformer.py", line 143, in main
create_task_trainer(config)
File "/home/worker_479/run_mindformer.py", line 85, in create_task_trainer
trainer.train(config, is_full_config=True)
File "/home/worker_479/mindformers/trainer/causal_language_modeling/causal_language_modeling.py", line 97, in train
self.training_process(
File "/home/worker_479/mindformers/trainer/base_trainer.py", line 772, in training_process
model.train(config.runner_config.epochs, dataset,
File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 1080, in train
self._train(epoch,
File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 115, in wrapper
func(self, *args, **kwargs)
File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 634, in _train
self._train_dataset_sink_process(epoch, train_dataset, list_callback,
File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 719, in _train_dataset_sink_process
outputs = train_network(*inputs)
File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 680, in __call__
out = self.compile_and_run(*args, **kwargs)
File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 1023, in compile_and_run
return _cell_graph_executor(self, *new_args, phase=self.phase)
File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/common/api.py", line 1589, in __call__
return self.run(obj, *args, phase=phase)
File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/common/api.py", line 1628, in run
return self._exec_pip(obj, *args, phase=phase_real)
File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/common/api.py", line 121, in wrapper
results = fn(*arg, **kwargs)
File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/common/api.py", line 1608, in _exec_pip
return self._graph_executor(args, phase)
RuntimeError: Exec graph failed
----------------------------------------------------
- Ascend Error Message:
----------------------------------------------------
E19999: Inner Error!
E19999 Assert ((GetModelInputBaseicAddrByIndex(static_cast<uint32_t>(i), inputs_pls[i], basic_addr)) == ge::SUCCESS) failed[FUNC:UpdateKnownNodeArgs][FILE:davinci_model.cc][LINE:3950]
TraceBack (most recent call last):
Update all node args failed.[FUNC:UpdateAllNodeArgs][FILE:davinci_model.cc][LINE:4681]
Assert ((UpdateAllNodeArgs(input_data, output_data)) == ge::SUCCESS) failed[FUNC:CopyModelData][FILE:davinci_model.cc][LINE:4584]
Assert ((ExecuteModel(model_id, stream, async_mode, input_buffer, input_desc, output_buffer, output_desc)) == ge::SUCCESS) failed[FUNC:ExecuteModel][FILE:model_manager.cc][LINE:1480]
GraphManager RunGrapWithStreamhAsync failed,session id = 0, graph id = 2, stream = 0x47302520.[FUNC:RunGraphWithStreamAsync][FILE:inner_session.cc][LINE:517]
[Run][Graph]Run graph with stream asyn failed, error code = 1343225857, session id = 0,graph id = 2, stream = 0x47302520.[FUNC:RunGraphWithStreamAsync][FILE:ge_api.cc][LINE:823]
(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)
----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/plugin/device/ascend/hal/hardware/ge_graph_executor.cc:994 RunGraphRefMode
[INFO] PIPELINE(96479,ffffa0f111c0,python):2024-04-08-17:32:50.948.968 [mindspore/ccsrc/pipeline/jit/ps/init.cc:531] operator()] Start releasing dataset handles...
[INFO] MD(96479,ffffa0f111c0,python):2024-04-08-17:32:50.949.124 [mindspore/ccsrc/minddata/dataset/engine/python_runtime_context.cc:22] Terminate] Terminating a Dataset PythonRuntime.
[INFO] MD(96479,ffffa0f111c0,python):2024-04-08-17:32:50.951.701 [mindspore/ccsrc/minddata/dataset/engine/python_runtime_context.cc:22] Terminate] Terminating a Dataset PythonRuntime.
[WARNING] MD(96479,ffffa0f111c0,python):2024-04-08-17:32:50.951.829 [mindspore/ccsrc/minddata/dataset/engine/datasetops/data_queue_op.cc:163] ~DataQueueOp]
preprocess_batch: 100;
batch_queue: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0;
push_start_time -> push_end_time
2024-04-08-17:28:51.790.383 -> 2024-04-08-17:28:51.791.597
2024-04-08-17:28:51.800.480 -> 2024-04-08-17:28:51.801.471
2024-04-08-17:28:51.810.400 -> 2024-04-08-17:28:51.811.462
2024-04-08-17:28:51.820.747 -> 2024-04-08-17:28:51.821.687
2024-04-08-17:28:51.831.351 -> 2024-04-08-17:28:51.832.537
2024-04-08-17:28:51.841.474 -> 2024-04-08-17:28:51.842.334
2024-04-08-17:28:51.851.906 -> 2024-04-08-17:28:51.853.441
2024-04-08-17:28:51.862.491 -> 2024-04-08-17:28:51.863.570
2024-04-08-17:28:51.872.971 -> 2024-04-08-17:28:51.874.221
2024-04-08-17:28:51.883.176 -> 2024-04-08-17:28:51.884.237
For more details, please refer to the FAQ at https://www.mindspore.cn/docs/en/master/faq/data_processing.html.
Please assign maintainer to check this issue.
请为此issue分配处理人。
@duanjiali
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
感谢您的提问,您可以评论//mindspore-assistant更快获取帮助:
root cause: 带FA的编译缓存, GE老版本不支持
fix solution: 升级cann版本
不需要加UT/ST用例
原因:rank_table方式HCCL通过group_name识通信域, 动态组网通过commHandle方式识别通信域, 第1次编译时保存的commHandle, HCCL无法建立旧commHandle与新的commHanle映射关系,使用旧commHandle对第2次编译是野指针, 导致HCCL core dump。
解决: 正式方案,对动态组网来说,HCCL增加接口,建立group name与commHandle的映射关系,mindspore给GE传group name, HCCL通过group name找到新的commHandle。
不需要加UT/ST
llama2-57b在112卡训练,编译时间从51分钟减到17分钟,且loss一致
第一次编译:
第二次加载编译缓存:
登录 后才可以发表评论