2.4K Star 8.2K Fork 4.4K

GVPMindSpore / mindspore

 / 详情

[ST][MS] llama2-175B加载编译缓存后报错

DONE
Bug-Report 成员
创建于  
2024-04-08 20:23
name about labels
Bug Report Use this template for reporting a bug kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

480卡llama2-175B加载编译缓存报错

Environment / 环境信息 (Mandatory / 必填)

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device ascend/GPU/CPU/kirin/等其他芯片

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) :
    -- Python version (e.g., Python 3.7.5) :
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):
    -- GCC/Compiler version (if compiled from source):

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode pynative
/mode graph

Related testcase / 关联用例 (Mandatory / 必填)

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

Describe the expected behavior / 预期结果 (Mandatory / 必填)

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

[ERROR] GE_ADPT(96479,fffc9effd1e0,python):2024-04-08-17:31:49.581.897 [mindspore/ccsrc/transform/graph_ir/graph_runner.cc:397] RunGraphWithStreamAsync] Call GE RunGraphWithStreamAsync Failed, ret is: 4294967295
[CRITICAL] DEVICE(96479,fffc9effd1e0,python):2024-04-08-17:31:49.582.641 [mindspore/ccsrc/plugin/device/ascend/hal/hardware/ge_graph_executor.cc:994] RunGraphRefMode] Exec graph failed
E19999: Inner Error!
E19999  Assert ((GetModelInputBaseicAddrByIndex(static_cast<uint32_t>(i), inputs_pls[i], basic_addr)) == ge::SUCCESS) failed[FUNC:UpdateKnownNodeArgs][FILE:davinci_model.cc][LINE:3950]
       TraceBack (most recent call last):
       Update all node args failed.[FUNC:UpdateAllNodeArgs][FILE:davinci_model.cc][LINE:4681]
       Assert ((UpdateAllNodeArgs(input_data, output_data)) == ge::SUCCESS) failed[FUNC:CopyModelData][FILE:davinci_model.cc][LINE:4584]
       Assert ((ExecuteModel(model_id, stream, async_mode, input_buffer, input_desc, output_buffer, output_desc)) == ge::SUCCESS) failed[FUNC:ExecuteModel][FILE:model_manager.cc][LINE:1480]
       GraphManager RunGrapWithStreamhAsync failed,session id = 0, graph id = 2, stream = 0x47302520.[FUNC:RunGraphWithStreamAsync][FILE:inner_session.cc][LINE:517]
       [Run][Graph]Run graph with stream asyn failed, error code = 1343225857, session id = 0,graph id = 2, stream = 0x47302520.[FUNC:RunGraphWithStreamAsync][FILE:ge_api.cc][LINE:823]

(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)
[INFO] DEBUG(96479,fffc9effd1e0,python):2024-04-08-17:31:49.582.699 [mindspore/ccsrc/pipeline/jit/ps/debug/trace.cc:117] TraceGraphEval] Length of analysis graph stack is empty.
[INFO] DEBUG(96479,fffc9effd1e0,python):2024-04-08-17:31:49.582.727 [mindspore/ccsrc/pipeline/jit/ps/debug/trace.cc:418] GetEvalStackInfo] Get graph analysis information begin
[INFO] DEBUG(96479,fffc9effd1e0,python):2024-04-08-17:31:49.582.747 [mindspore/ccsrc/pipeline/jit/ps/debug/trace.cc:421] GetEvalStackInfo] Length of analysis information stack is empty.
[INFO] PIPELINE(96479,ffffa0f111c0,python):2024-04-08-17:32:49.584.269 [mindspore/ccsrc/include/common/utils/python_utils.h:368] HandleExceptionRethrow] Caught exception: Exec graph failed

----------------------------------------------------
- Ascend Error Message:
----------------------------------------------------
E19999: Inner Error!
E19999  Assert ((GetModelInputBaseicAddrByIndex(static_cast<uint32_t>(i), inputs_pls[i], basic_addr)) == ge::SUCCESS) failed[FUNC:UpdateKnownNodeArgs][FILE:davinci_model.cc][LINE:3950]
       TraceBack (most recent call last):
       Update all node args failed.[FUNC:UpdateAllNodeArgs][FILE:davinci_model.cc][LINE:4681]
       Assert ((UpdateAllNodeArgs(input_data, output_data)) == ge::SUCCESS) failed[FUNC:CopyModelData][FILE:davinci_model.cc][LINE:4584]
       Assert ((ExecuteModel(model_id, stream, async_mode, input_buffer, input_desc, output_buffer, output_desc)) == ge::SUCCESS) failed[FUNC:ExecuteModel][FILE:model_manager.cc][LINE:1480]
       GraphManager RunGrapWithStreamhAsync failed,session id = 0, graph id = 2, stream = 0x47302520.[FUNC:RunGraphWithStreamAsync][FILE:inner_session.cc][LINE:517]
       [Run][Graph]Run graph with stream asyn failed, error code = 1343225857, session id = 0,graph id = 2, stream = 0x47302520.[FUNC:RunGraphWithStreamAsync][FILE:ge_api.cc][LINE:823]

(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)

----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/plugin/device/ascend/hal/hardware/ge_graph_executor.cc:994 RunGraphRefMode

2024-04-08 17:32:49,589 - mindformers[mindformers/tools/cloud_adapter/cloud_monitor.py:43] - ERROR - Traceback (most recent call last):
 File "/home/worker_479/mindformers/tools/cloud_adapter/cloud_monitor.py", line 34, in wrapper
   result = run_func(*args, **kwargs)
 File "/home/worker_479/run_mindformer.py", line 143, in main
   create_task_trainer(config)
 File "/home/worker_479/run_mindformer.py", line 85, in create_task_trainer
   trainer.train(config, is_full_config=True)
 File "/home/worker_479/mindformers/trainer/causal_language_modeling/causal_language_modeling.py", line 97, in train
   self.training_process(
 File "/home/worker_479/mindformers/trainer/base_trainer.py", line 772, in training_process
   model.train(config.runner_config.epochs, dataset,
 File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 1080, in train
   self._train(epoch,
 File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 115, in wrapper
   func(self, *args, **kwargs)
 File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 634, in _train
   self._train_dataset_sink_process(epoch, train_dataset, list_callback,
 File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 719, in _train_dataset_sink_process
   outputs = train_network(*inputs)
 File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 680, in __call__
   out = self.compile_and_run(*args, **kwargs)
 File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 1023, in compile_and_run
   return _cell_graph_executor(self, *new_args, phase=self.phase)
 File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/common/api.py", line 1589, in __call__
   return self.run(obj, *args, phase=phase)
 File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/common/api.py", line 1628, in run
   return self._exec_pip(obj, *args, phase=phase_real)
 File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/common/api.py", line 121, in wrapper
   results = fn(*arg, **kwargs)
 File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/common/api.py", line 1608, in _exec_pip
   return self._graph_executor(args, phase)
RuntimeError: Exec graph failed

----------------------------------------------------
- Ascend Error Message:
----------------------------------------------------
E19999: Inner Error!
E19999  Assert ((GetModelInputBaseicAddrByIndex(static_cast<uint32_t>(i), inputs_pls[i], basic_addr)) == ge::SUCCESS) failed[FUNC:UpdateKnownNodeArgs][FILE:davinci_model.cc][LINE:3950]
       TraceBack (most recent call last):
       Update all node args failed.[FUNC:UpdateAllNodeArgs][FILE:davinci_model.cc][LINE:4681]
       Assert ((UpdateAllNodeArgs(input_data, output_data)) == ge::SUCCESS) failed[FUNC:CopyModelData][FILE:davinci_model.cc][LINE:4584]
       Assert ((ExecuteModel(model_id, stream, async_mode, input_buffer, input_desc, output_buffer, output_desc)) == ge::SUCCESS) failed[FUNC:ExecuteModel][FILE:model_manager.cc][LINE:1480]
       GraphManager RunGrapWithStreamhAsync failed,session id = 0, graph id = 2, stream = 0x47302520.[FUNC:RunGraphWithStreamAsync][FILE:inner_session.cc][LINE:517]
       [Run][Graph]Run graph with stream asyn failed, error code = 1343225857, session id = 0,graph id = 2, stream = 0x47302520.[FUNC:RunGraphWithStreamAsync][FILE:ge_api.cc][LINE:823]

(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)

----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/plugin/device/ascend/hal/hardware/ge_graph_executor.cc:994 RunGraphRefMode


Traceback (most recent call last):
 File "/home/worker_479/run_mindformer.py", line 365, in <module>
   main(config_)
 File "/home/worker_479/mindformers/tools/cloud_adapter/cloud_monitor.py", line 44, in wrapper
   raise exc
 File "/home/worker_479/mindformers/tools/cloud_adapter/cloud_monitor.py", line 34, in wrapper
   result = run_func(*args, **kwargs)
 File "/home/worker_479/run_mindformer.py", line 143, in main
   create_task_trainer(config)
 File "/home/worker_479/run_mindformer.py", line 85, in create_task_trainer
   trainer.train(config, is_full_config=True)
 File "/home/worker_479/mindformers/trainer/causal_language_modeling/causal_language_modeling.py", line 97, in train
   self.training_process(
 File "/home/worker_479/mindformers/trainer/base_trainer.py", line 772, in training_process
   model.train(config.runner_config.epochs, dataset,
 File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 1080, in train
   self._train(epoch,
 File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 115, in wrapper
   func(self, *args, **kwargs)
 File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 634, in _train
   self._train_dataset_sink_process(epoch, train_dataset, list_callback,
 File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 719, in _train_dataset_sink_process
   outputs = train_network(*inputs)
 File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 680, in __call__
   out = self.compile_and_run(*args, **kwargs)
 File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 1023, in compile_and_run
   return _cell_graph_executor(self, *new_args, phase=self.phase)
 File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/common/api.py", line 1589, in __call__
   return self.run(obj, *args, phase=phase)
 File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/common/api.py", line 1628, in run
   return self._exec_pip(obj, *args, phase=phase_real)
 File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/common/api.py", line 121, in wrapper
   results = fn(*arg, **kwargs)
 File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/common/api.py", line 1608, in _exec_pip
   return self._graph_executor(args, phase)
RuntimeError: Exec graph failed

----------------------------------------------------
- Ascend Error Message:
----------------------------------------------------
E19999: Inner Error!
E19999  Assert ((GetModelInputBaseicAddrByIndex(static_cast<uint32_t>(i), inputs_pls[i], basic_addr)) == ge::SUCCESS) failed[FUNC:UpdateKnownNodeArgs][FILE:davinci_model.cc][LINE:3950]
       TraceBack (most recent call last):
       Update all node args failed.[FUNC:UpdateAllNodeArgs][FILE:davinci_model.cc][LINE:4681]
       Assert ((UpdateAllNodeArgs(input_data, output_data)) == ge::SUCCESS) failed[FUNC:CopyModelData][FILE:davinci_model.cc][LINE:4584]
       Assert ((ExecuteModel(model_id, stream, async_mode, input_buffer, input_desc, output_buffer, output_desc)) == ge::SUCCESS) failed[FUNC:ExecuteModel][FILE:model_manager.cc][LINE:1480]
       GraphManager RunGrapWithStreamhAsync failed,session id = 0, graph id = 2, stream = 0x47302520.[FUNC:RunGraphWithStreamAsync][FILE:inner_session.cc][LINE:517]
       [Run][Graph]Run graph with stream asyn failed, error code = 1343225857, session id = 0,graph id = 2, stream = 0x47302520.[FUNC:RunGraphWithStreamAsync][FILE:ge_api.cc][LINE:823]

(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)

----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/plugin/device/ascend/hal/hardware/ge_graph_executor.cc:994 RunGraphRefMode

[INFO] PIPELINE(96479,ffffa0f111c0,python):2024-04-08-17:32:50.948.968 [mindspore/ccsrc/pipeline/jit/ps/init.cc:531] operator()] Start releasing dataset handles...
[INFO] MD(96479,ffffa0f111c0,python):2024-04-08-17:32:50.949.124 [mindspore/ccsrc/minddata/dataset/engine/python_runtime_context.cc:22] Terminate] Terminating a Dataset PythonRuntime.
[INFO] MD(96479,ffffa0f111c0,python):2024-04-08-17:32:50.951.701 [mindspore/ccsrc/minddata/dataset/engine/python_runtime_context.cc:22] Terminate] Terminating a Dataset PythonRuntime.
[WARNING] MD(96479,ffffa0f111c0,python):2024-04-08-17:32:50.951.829 [mindspore/ccsrc/minddata/dataset/engine/datasetops/data_queue_op.cc:163] ~DataQueueOp] 
preprocess_batch: 100;
batch_queue: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0;
           push_start_time -> push_end_time
2024-04-08-17:28:51.790.383 -> 2024-04-08-17:28:51.791.597
2024-04-08-17:28:51.800.480 -> 2024-04-08-17:28:51.801.471
2024-04-08-17:28:51.810.400 -> 2024-04-08-17:28:51.811.462
2024-04-08-17:28:51.820.747 -> 2024-04-08-17:28:51.821.687
2024-04-08-17:28:51.831.351 -> 2024-04-08-17:28:51.832.537
2024-04-08-17:28:51.841.474 -> 2024-04-08-17:28:51.842.334
2024-04-08-17:28:51.851.906 -> 2024-04-08-17:28:51.853.441
2024-04-08-17:28:51.862.491 -> 2024-04-08-17:28:51.863.570
2024-04-08-17:28:51.872.971 -> 2024-04-08-17:28:51.874.221
2024-04-08-17:28:51.883.176 -> 2024-04-08-17:28:51.884.237
For more details, please refer to the FAQ at https://www.mindspore.cn/docs/en/master/faq/data_processing.html.

Special notes for this issue/备注 (Optional / 选填)

评论 (7)

duanjiali 创建了Bug-Report
duanjiali 添加了
 
kind/bug
标签
duanjiali 添加了
 
v2.2.14
标签
duanjiali 添加了
 
sig/parallel
标签
duanjiali 添加了
 
attr/function
标签
展开全部操作日志

Please assign maintainer to check this issue.
请为此issue分配处理人。
@duanjiali

感谢您的提问,您可以评论//mindspore-assistant更快获取帮助:

  1. 如果您刚刚接触MindSpore,或许您可以在教程找到答案
  2. 如果您是资深Pytorch用户,您或许需要:
  1. 如果您遇到动态图问题,可以设置set_context(pynative_synchronize=True)查看报错栈协助定位
  2. 模型精度调优问题可参考官网调优指南
  3. 如果您反馈的是框架BUG,请确认您在ISSUE中提供了MindSpore版本、使用的后端类型(CPU、GPU、Ascend)、环境、训练的代码官方链接以及可以复现报错的代码的启动方式等必要的定位信息
  4. 如果您已经定位出问题根因,欢迎提交PR参与MindSpore开源社区,我们会尽快review

root cause: 带FA的编译缓存, GE老版本不支持
fix solution: 升级cann版本

xiaoyao 里程碑B-SIG-Parallel 修改为B-SolutionTest
xiaoyao 添加了
 
rct/cann
标签
xiaoyao 添加了
 
rca/others
标签
xiaoyao 添加了
 
ctl/solutiontest
标签
xiaoyao 任务状态TODO 修改为VALIDATION

不需要加UT/ST用例

xiaoyao 添加协作者xiaoyao
xiaoyao 负责人xiaoyao 修改为duanjiali
duanjiali 添加协作者duanjiali
duanjiali 负责人duanjiali 修改为baimz
duanjiali 负责人baimz 修改为duanjiali
duanjiali 取消协作者duanjiali
duanjiali 添加协作者baimz
i-robot 添加了
 
foruda
标签
i-robot 添加了
 
foruda
标签
i-robot 添加了
 
foruda
标签
duanjiali 任务状态VALIDATION 修改为WIP
duanjiali 里程碑B-SolutionTest 修改为B-SIG-Parallel
duanjiali 添加协作者duanjiali
duanjiali 负责人duanjiali 修改为xiaoyao
duanjiali 取消协作者xiaoyao
zhunaipan 添加了
 
r2.2
标签
xiaoyao 里程碑B-SIG-Parallel 修改为B-SolutionTest
xiaoyao 添加协作者xiaoyao
xiaoyao 负责人xiaoyao 修改为duanjiali
xiaoyao 取消协作者duanjiali
xiaoyao 任务状态WIP 修改为VALIDATION

原因:rank_table方式HCCL通过group_name识通信域, 动态组网通过commHandle方式识别通信域, 第1次编译时保存的commHandle, HCCL无法建立旧commHandle与新的commHanle映射关系,使用旧commHandle对第2次编译是野指针, 导致HCCL core dump。
解决: 正式方案,对动态组网来说,HCCL增加接口,建立group name与commHandle的映射关系,mindspore给GE传group name, HCCL通过group name找到新的commHandle。
不需要加UT/ST

llama2-57b在112卡训练,编译时间从51分钟减到17分钟,且loss一致
第一次编译:
输入图片说明
第二次加载编译缓存:
输入图片说明

duanjiali 任务状态VALIDATION 修改为DONE

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(4)
Python
1
https://gitee.com/mindspore/mindspore.git
git@gitee.com:mindspore/mindspore.git
mindspore
mindspore
mindspore

搜索帮助