一、问题现象(附报错日志上下文):
NPU迁移之后训练出现以下错误。不清楚具体原因。
WARNING:tensorflow:From /cache/user-job-dir/code/RandLANet.py:140: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.
****EPOCH 0****
Traceback (most recent call last):
File "/usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: GeOp9_0GEOP::::DoRunAsync Failed
Error Message is :
EE9999: Inner Error!
[driver interface] halMemAlloc failed: device_id=0, size=33554432, type=2, env_type=3, drvRetCode=6![FUNC:DevMemAllocHugePageManaged][FILE:npu_driver.cc][LINE:757]
[driver interface] halMemAlloc failed: size=33554432, deviceId=0, type=2, env_type=3, drvRetCode=6![FUNC:DevMemAllocManaged][FILE:npu_driver.cc][LINE:792]
DevMemAlloc huge page failed: deviceId=0, type=2, size=33554432, retCode=117571606![FUNC:DevMemAllocOnline][FILE:npu_driver.cc][LINE:867]
Device malloc failed, size=33554432, type=2.[FUNC:DevMalloc][FILE:logger.cc][LINE:335]
rtMalloc execute failed, reason=[driver error:out of memory][FUNC:ReportFuncErrorReason][FILE:error_message_manage.cc][LINE:26]
Call rtMalloc fail, purpose:Memory for caching., size:33554432, device_id:0[FUNC:MallocMemory][FILE:graph_mem_allocator.cc][LINE:52]
[driver interface] halMemAlloc failed: device_id=0, size=134217728, type=2, env_type=3, drvRetCode=6![FUNC:DevMemAllocHugePageManaged][FILE:npu_driver.cc][LINE:757]
[driver interface] halMemAlloc failed: size=134217728, deviceId=0, type=2, env_type=3, drvRetCode=6![FUNC:DevMemAllocManaged][FILE:npu_driver.cc][LINE:792]
DevMemAlloc huge page failed: deviceId=0, type=2, size=134217728, retCode=117571606![FUNC:DevMemAllocOnline][FILE:npu_driver.cc][LINE:867]
Device malloc failed, size=134217728, type=2.[FUNC:DevMalloc][FILE:logger.cc][LINE:335]
Call rtMalloc fail, purpose:Memory for caching., size:134217728, device_id:0[FUNC:MallocMemory][FILE:graph_mem_allocator.cc][LINE:52]
FindFreeBlock fail, size:125829632, device_id:0[FUNC:Malloc][FILE:graph_caching_allocator.cc][LINE:151]
malloc memory failed, device_id = 0, size = 125829184[FUNC:Allocate][FILE:npu_memory_allocator.cc][LINE:97]
allocate failed, size = 125829152.[FUNC:Create][FILE:tensor_value.cc][LINE:47]
Param:buffer is nullptr, check invalid[FUNC:AllocateTensor][FILE:task_context.cc][LINE:229]
[Root-Graph] Error:1343225857 occurred while executing graph.[FUNC:OnError][FILE:subgraph_context.cc][LINE:214]
[Root-Graph] Error occurs while launching tasks. quit from preparing nodes.[FUNC:NodeEnqueue][FILE:subgraph_executor.cc][LINE:229]
failed to execute graph. model_id = 2[FUNC:HandleResult][FILE:hybrid_model_async_executor.cc][LINE:219]
[[{{node GeOp9_0}}]]
二、软件版本:
-- CANN 版本: 5.0.2
--Tensorflow版本:1.15
--Python 版本:3.7
--操作系统版本:Ascend : 1 * Ascend 910 CPU : 24vCPUs 96GiB
三、测试步骤:
直接运行run_sh_S3DIS.py,训练出现上面描述的错误
四、日志信息:
日志和代码地址
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
带debugger的日志
提取码:qwe123
obs链接:obs://randla-net/train_out_npu/MA-new-RandLa-Net-master_npu_for_TensorFlow-10-12-15-14/
最新打屏信息:
WARNING:tensorflow:From /home/ma-user/modelarts/user-job-dir/code/RandLANet.py:132: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.
EPOCH 0
Traceback (most recent call last):
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: GeOp9_0GEOP::::DoRunAsync Failed
Error Message is :
E19999: Inner Error!
GetInputIndexByName failed, node:[optimizer/gradients/layers/mul_9_grad/Sum_1/ConfusionTranspose(ConfusionTransposeD)] inputname: axes.[FUNC:ParseDependencies][FILE:hybrid_model_builder.cc][LINE:379]
[[{{node GeOp9_0}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "main_S3DIS.py", line 291, in
model.train(dataset)
File "/home/ma-user/modelarts/user-job-dir/code/RandLANet.py", line 191, in train
_, _, summary, l_out, probs, labels, acc = self.sess.run(ops, {self.is_training: True})
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: GeOp9_0GEOP::::DoRunAsync Failed
Error Message is :
E19999: Inner Error!
GetInputIndexByName failed, node:[optimizer/gradients/layers/mul_9_grad/Sum_1/ConfusionTranspose(ConfusionTransposeD)] inputname: axes.[FUNC:ParseDependencies][FILE:hybrid_model_builder.cc][LINE:379]
[[{{node GeOp9_0}}]]
后台错误信息:
[INFO] GE(98,python):2021-10-12-15:29:39.948.610 [node_item.cc:434]737 SetCtrlSend:Node[trans_TransData_3202] will control node[PartitionedCall_132]
[INFO] GE(98,python):2021-10-12-15:29:39.948.635 [node_item.cc:415]737 SetDataSend:Node[optimizer/gradients/layers/Sum_14_grad/Tile] will control node[optimizer/gradients/layers/mul_9_grad/Mul]
[INFO] GE(98,python):2021-10-12-15:29:39.948.641 [node_item.cc:415]737 SetDataSend:Node[optimizer/gradients/layers/Sum_14_grad/Tile] will control node[optimizer/gradients/layers/mul_9_grad/Mul_1]
[INFO] GE(98,python):2021-10-12-15:29:39.948.680 [node_item.cc:415]737 SetDataSend:Node[optimizer/gradients/layers/mul_9_grad/Mul] will control node[optimizer/gradients/AddN_34]
[INFO] GE(98,python):2021-10-12-15:29:39.948.714 [node_item.cc:415]737 SetDataSend:Node[optimizer/gradients/layers/mul_9_grad/Mul_1] will control node[optimizer/gradients/layers/mul_9_grad/Sum_1/ConfusionTranspose]
[ERROR] GE(98,python):2021-10-12-15:29:39.948.758 [hybrid_model_builder.cc:378]737 ParseDependencies: ErrorNo: 1343225860(Internal errors) [LOAD][LOAD][Get][InputIndex]failed, node:[optimizer/gradients/layers/mul_9_grad/Sum_1/ConfusionTranspose(ConfusionTransposeD)] inputname: axes.
[ERROR] GE(98,python):2021-10-12-15:29:39.948.804 [hybrid_model_builder.cc:479]737 ParseDependentInputNodes: ErrorNo: 4294967295(failed) [LOAD][LOAD]
[ERROR] GE(98,python):2021-10-12-15:29:39.948.809 [hybrid_model_builder.cc:264]737 BuildNodeItem: ErrorNo: 4294967295(failed) [LOAD][LOAD][Invoke][ParseDependentInputNodes]failed, node:[optimizer/gradients/layers/mul_9_grad/Sum_1/ConfusionTranspose(ConfusionTransposeD)].
[ERROR] GE(98,python):2021-10-12-15:29:39.949.110 [hybrid_model_builder.cc:949]737 LoadGraph: ErrorNo: 4294967295(failed) [LOAD][LOAD][Invoke][LoadDynamicSubgraph]Failed to load root graph, model_name_:.
[ERROR] GE(98,python):2021-10-12-15:29:39.949.117 [hybrid_model_builder.cc:169]737 Build: ErrorNo: 4294967295(failed) [LOAD][LOAD][Invoke][LoadGraph] failed, model_name_:[]
[ERROR] GE(98,python):2021-10-12-15:29:39.949.911 [hybrid_model.cc:49]737 Init: ErrorNo: 4294967295(failed) [LOAD][LOAD][Build][HybridModel] failed.
[ERROR] GE(98,python):2021-10-12-15:29:39.949.918 [hybrid_davinci_model.cc:38]737 Init: ErrorNo: 4294967295(failed) [LOAD][LOAD][Init][HybridModel] failed.
[ERROR] GE(98,python):2021-10-12-15:29:39.949.923 [model_manager.cc:304]737 DoLoadHybridModelOnline: ErrorNo: 4294967295(failed) [LOAD][LOAD][Init][HybridModel] failed. model_id = 2
[INFO] GE(98,python):2021-10-12-15:29:39.949.937 [npu_memory_allocator.cc:107]737 Deallocate:To deallocating buffer, addr = 0x108040636c00
[INFO] GE(98,python):2021-10-12-15:29:39.949.941 [npu_memory_allocator.cc:109]737 Deallocate:Deallocating buffer successfully. addr = 0x108040636c00
[INFO] GE(98,python):2021-10-12-15:29:39.949.950 [graph_caching_allocator.cc:159]737 Free:Free device id = 0
[INFO] GE(98,python):2021-10-12-15:29:39.949.955 [graph_caching_allocator.cc:184]737 FreeBlock:Free block size = 1024
你好,从目前的报错来看,有功能报错。我们需要收集Debug日志信息。操作方法如下:
https://support.huaweicloud.com/tfmigr-cann503alpha2training/atlasma_13_0004.html#section4
勾选Debugger选项卡然后训练。
Image Path:ascend-share/5.0.3.alpha005_tensorflow-ascend910-cp37-euleros2.8-aarch64-training:1.15.0-21.0.2_1019
日志信息
提取码:111111
export ENABLE_FORCE_V2_CONTROL=1可以设置这个环境变量重新跑下试试,这个可以把V1控制算子改成V2,有可能提高内存复用率
BS=6执行失败问题已解决。
登录 后才可以发表评论