76 Star 221 Fork 167

Ascend / modelzoo

 / 详情

【众智】【上海交通大学】【RandLa-Net】【ID0850】NPU训练报错GeOp9_0GEOP::::DoRunAsync Failed

DONE
Bug-Report
创建于  
2021-09-14 19:35

一、问题现象(附报错日志上下文):
NPU迁移之后训练出现以下错误。不清楚具体原因。

WARNING:tensorflow:From /cache/user-job-dir/code/RandLANet.py:140: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.
****EPOCH 0****
Traceback (most recent call last):
  File "/usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: GeOp9_0GEOP::::DoRunAsync Failed
Error Message is : 
EE9999: Inner Error!
        [driver interface] halMemAlloc failed: device_id=0, size=33554432, type=2, env_type=3, drvRetCode=6![FUNC:DevMemAllocHugePageManaged][FILE:npu_driver.cc][LINE:757]
        [driver interface] halMemAlloc failed: size=33554432, deviceId=0, type=2, env_type=3, drvRetCode=6![FUNC:DevMemAllocManaged][FILE:npu_driver.cc][LINE:792]
        DevMemAlloc huge page failed: deviceId=0, type=2, size=33554432, retCode=117571606![FUNC:DevMemAllocOnline][FILE:npu_driver.cc][LINE:867]
        Device malloc failed, size=33554432, type=2.[FUNC:DevMalloc][FILE:logger.cc][LINE:335]
        rtMalloc execute failed, reason=[driver error:out of memory][FUNC:ReportFuncErrorReason][FILE:error_message_manage.cc][LINE:26]
        Call rtMalloc fail, purpose:Memory for caching., size:33554432, device_id:0[FUNC:MallocMemory][FILE:graph_mem_allocator.cc][LINE:52]
        [driver interface] halMemAlloc failed: device_id=0, size=134217728, type=2, env_type=3, drvRetCode=6![FUNC:DevMemAllocHugePageManaged][FILE:npu_driver.cc][LINE:757]
        [driver interface] halMemAlloc failed: size=134217728, deviceId=0, type=2, env_type=3, drvRetCode=6![FUNC:DevMemAllocManaged][FILE:npu_driver.cc][LINE:792]
        DevMemAlloc huge page failed: deviceId=0, type=2, size=134217728, retCode=117571606![FUNC:DevMemAllocOnline][FILE:npu_driver.cc][LINE:867]
        Device malloc failed, size=134217728, type=2.[FUNC:DevMalloc][FILE:logger.cc][LINE:335]
        Call rtMalloc fail, purpose:Memory for caching., size:134217728, device_id:0[FUNC:MallocMemory][FILE:graph_mem_allocator.cc][LINE:52]
        FindFreeBlock fail, size:125829632, device_id:0[FUNC:Malloc][FILE:graph_caching_allocator.cc][LINE:151]
        malloc memory failed, device_id = 0, size = 125829184[FUNC:Allocate][FILE:npu_memory_allocator.cc][LINE:97]
        allocate failed, size = 125829152.[FUNC:Create][FILE:tensor_value.cc][LINE:47]
        Param:buffer is nullptr, check invalid[FUNC:AllocateTensor][FILE:task_context.cc][LINE:229]
        [Root-Graph] Error:1343225857 occurred while executing graph.[FUNC:OnError][FILE:subgraph_context.cc][LINE:214]
        [Root-Graph] Error occurs while launching tasks. quit from preparing nodes.[FUNC:NodeEnqueue][FILE:subgraph_executor.cc][LINE:229]
        failed to execute graph. model_id = 2[FUNC:HandleResult][FILE:hybrid_model_async_executor.cc][LINE:219]
	 [[{{node GeOp9_0}}]]

二、软件版本:
-- CANN 版本: 5.0.2
--Tensorflow版本:1.15
--Python 版本:3.7
--操作系统版本:Ascend : 1 * Ascend 910 CPU : 24vCPUs 96GiB

三、测试步骤:
输入图片说明
直接运行run_sh_S3DIS.py,训练出现上面描述的错误

四、日志信息:
日志和代码地址

评论 (12)

张烜 创建了Bug-Report
张烜 关联仓库设置为Ascend/modelzoo
zhujianpeng 任务状态TODO 修改为Analysing
zhujianpeng 负责人设置为张晓龙
展开全部操作日志

上面的下载链接有问题,补充:代码和日志

1.将batch_size(在helper_tool.py文件中搜S3DIS修改batch_size)从6改成3之后,程序可以跑起来,不会报上面描述的错误。但是程序会一直卡着。通过控制台看到的资源占用情况如下,跑了几分钟就不跑了,程序不会停止,权重参数也没更新。
2.batch_size还是3, 注释掉权重更新的代码(RandLANet.py文件中211行),程序可以跑通,也有打屏的日志信息。通过两次测试比较发现,程序一直卡在权重更新那。另外,虽然没有更新权重,但它的运行速度仍然很慢。

附件:调试代码和打屏信息

王位 添加了
 
OpenMind
标签
颜亚文 修改了描述
颜亚文 修改了标题

带debugger的日志
提取码:qwe123
obs链接:obs://randla-net/train_out_npu/MA-new-RandLa-Net-master_npu_for_TensorFlow-10-12-15-14/

最新打屏信息:
WARNING:tensorflow:From /home/ma-user/modelarts/user-job-dir/code/RandLANet.py:132: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.

EPOCH 0
Traceback (most recent call last):
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: GeOp9_0GEOP::::DoRunAsync Failed
Error Message is :
E19999: Inner Error!
GetInputIndexByName failed, node:[optimizer/gradients/layers/mul_9_grad/Sum_1/ConfusionTranspose(ConfusionTransposeD)] inputname: axes.[FUNC:ParseDependencies][FILE:hybrid_model_builder.cc][LINE:379]

 [[{{node GeOp9_0}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "main_S3DIS.py", line 291, in
model.train(dataset)
File "/home/ma-user/modelarts/user-job-dir/code/RandLANet.py", line 191, in train
_, _, summary, l_out, probs, labels, acc = self.sess.run(ops, {self.is_training: True})
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: GeOp9_0GEOP::::DoRunAsync Failed
Error Message is :
E19999: Inner Error!
GetInputIndexByName failed, node:[optimizer/gradients/layers/mul_9_grad/Sum_1/ConfusionTranspose(ConfusionTransposeD)] inputname: axes.[FUNC:ParseDependencies][FILE:hybrid_model_builder.cc][LINE:379]

 [[{{node GeOp9_0}}]]

后台错误信息:
[INFO] GE(98,python):2021-10-12-15:29:39.948.610 [node_item.cc:434]737 SetCtrlSend:Node[trans_TransData_3202] will control node[PartitionedCall_132]
[INFO] GE(98,python):2021-10-12-15:29:39.948.635 [node_item.cc:415]737 SetDataSend:Node[optimizer/gradients/layers/Sum_14_grad/Tile] will control node[optimizer/gradients/layers/mul_9_grad/Mul]
[INFO] GE(98,python):2021-10-12-15:29:39.948.641 [node_item.cc:415]737 SetDataSend:Node[optimizer/gradients/layers/Sum_14_grad/Tile] will control node[optimizer/gradients/layers/mul_9_grad/Mul_1]
[INFO] GE(98,python):2021-10-12-15:29:39.948.680 [node_item.cc:415]737 SetDataSend:Node[optimizer/gradients/layers/mul_9_grad/Mul] will control node[optimizer/gradients/AddN_34]
[INFO] GE(98,python):2021-10-12-15:29:39.948.714 [node_item.cc:415]737 SetDataSend:Node[optimizer/gradients/layers/mul_9_grad/Mul_1] will control node[optimizer/gradients/layers/mul_9_grad/Sum_1/ConfusionTranspose]
[ERROR] GE(98,python):2021-10-12-15:29:39.948.758 [hybrid_model_builder.cc:378]737 ParseDependencies: ErrorNo: 1343225860(Internal errors) [LOAD][LOAD][Get][InputIndex]failed, node:[optimizer/gradients/layers/mul_9_grad/Sum_1/ConfusionTranspose(ConfusionTransposeD)] inputname: axes.
[ERROR] GE(98,python):2021-10-12-15:29:39.948.804 [hybrid_model_builder.cc:479]737 ParseDependentInputNodes: ErrorNo: 4294967295(failed) [LOAD][LOAD]
[ERROR] GE(98,python):2021-10-12-15:29:39.948.809 [hybrid_model_builder.cc:264]737 BuildNodeItem: ErrorNo: 4294967295(failed) [LOAD][LOAD][Invoke][ParseDependentInputNodes]failed, node:[optimizer/gradients/layers/mul_9_grad/Sum_1/ConfusionTranspose(ConfusionTransposeD)].
[ERROR] GE(98,python):2021-10-12-15:29:39.949.110 [hybrid_model_builder.cc:949]737 LoadGraph: ErrorNo: 4294967295(failed) [LOAD][LOAD][Invoke][LoadDynamicSubgraph]Failed to load root graph, model_name_:.
[ERROR] GE(98,python):2021-10-12-15:29:39.949.117 [hybrid_model_builder.cc:169]737 Build: ErrorNo: 4294967295(failed) [LOAD][LOAD][Invoke][LoadGraph] failed, model_name_:[]
[ERROR] GE(98,python):2021-10-12-15:29:39.949.911 [hybrid_model.cc:49]737 Init: ErrorNo: 4294967295(failed) [LOAD][LOAD][Build][HybridModel] failed.
[ERROR] GE(98,python):2021-10-12-15:29:39.949.918 [hybrid_davinci_model.cc:38]737 Init: ErrorNo: 4294967295(failed) [LOAD][LOAD][Init][HybridModel] failed.
[ERROR] GE(98,python):2021-10-12-15:29:39.949.923 [model_manager.cc:304]737 DoLoadHybridModelOnline: ErrorNo: 4294967295(failed) [LOAD][LOAD][Init][HybridModel] failed. model_id = 2
[INFO] GE(98,python):2021-10-12-15:29:39.949.937 [npu_memory_allocator.cc:107]737 Deallocate:To deallocating buffer, addr = 0x108040636c00
[INFO] GE(98,python):2021-10-12-15:29:39.949.941 [npu_memory_allocator.cc:109]737 Deallocate:Deallocating buffer successfully. addr = 0x108040636c00
[INFO] GE(98,python):2021-10-12-15:29:39.949.950 [graph_caching_allocator.cc:159]737 Free:Free device id = 0
[INFO] GE(98,python):2021-10-12-15:29:39.949.955 [graph_caching_allocator.cc:184]737 FreeBlock:Free block size = 1024

颜亚文 负责人张晓龙 修改为未设置
颜亚文 负责人设置为张晓龙

@张晓龙 issue转给GE进行功能分析

张晓龙 添加协作者张晓龙
张晓龙 负责人张晓龙 修改为梁昊
张晓龙 取消协作者张晓龙

你好,从目前的报错来看,有功能报错。我们需要收集Debug日志信息。操作方法如下:
https://support.huaweicloud.com/tfmigr-cann503alpha2training/atlasma_13_0004.html#section4
勾选Debugger选项卡然后训练。

Image Path:ascend-share/5.0.3.alpha005_tensorflow-ascend910-cp37-euleros2.8-aarch64-training:1.15.0-21.0.2_1019
日志信息
提取码:111111

GPU使用ECS离线服务器验证一下bs=6是否可以执行。ModelArts的GPU和离线ECS的V100规格不同,ModelArts是64G显存。

好的收到

您好,ECS离线服务器bs=6是可以正常运行的。
输入图片说明

export ENABLE_FORCE_V2_CONTROL=1可以设置这个环境变量重新跑下试试,这个可以把V1控制算子改成V2,有可能提高内存复用率

BS=6执行失败问题已解决。

颜亚文 任务状态Analysing 修改为DONE
吴定远 关联仓库Ascend/modelzoo-his 修改为Ascend/modelzoo

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(4)
1
https://gitee.com/ascend/modelzoo.git
git@gitee.com:ascend/modelzoo.git
ascend
modelzoo
modelzoo

搜索帮助