一、问题现象(附报错日志上下文):
介绍:原代码中使用了tf.map_fn,为了规避此函数,将其用for循环替换实现,并已在gpu跑通。
模型训练的输入为尺寸固定的图片和每张图片中数量不一定的人脸框,不知道这算不算动态shape,训练过程中一些函数的输出是会生成一些不定形状的tensor。
NPU训练报错:在NPU训练时会先用十几分钟加载一堆东西,然后运行一个step,可以看到第一个step的log打屏,然后再卡一会,进而报错:
tensorflow.python.framework.errors_impl.InternalError: GeOp31_0GEOP::::DoRunAsync Failed
Error Message is :
EZ9999: Inner Error!
[Update][Tilingdata] Nodetarget_creation/mul tiling failed![FUNC:UpdateTilingData][FILE:aicore_node_executor.cc][LINE:198]
[Root-Graph] Error:-1 occurred while executing graph.[FUNC:OnError][FILE:subgraph_context.cc][LINE:168]
[Root-Graph] error occurs when push to queue.[FUNC:NodeScheduled][FILE:subgraph_executor.cc][LINE:336]
[Root-Graph] error occurs when push to queue.[FUNC:operator()][FILE:subgraph_executor.cc][LINE:315]
[Root-Graph] Error:1343225860 occurred while executing graph.[FUNC:OnError][FILE:subgraph_context.cc][LINE:168]
failed to execute graph. model_id = 9[FUNC:HandleResult][FILE:hybrid_model_async_executor.cc][LINE:220]
[[{{node GeOp31_0}}]]
Traceback (most recent call last):
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: GeOp31_0GEOP::::DoRunAsync Failed
Error Message is :
EZ9999: Inner Error!
[Update][Tilingdata] Node[target_creation/mul](Mul) tiling failed![FUNC:UpdateTilingData][FILE:aicore_node_executor.cc][LINE:198]
[Root-Graph] Error:-1 occurred while executing graph.[FUNC:OnError][FILE:subgraph_context.cc][LINE:168]
[Root-Graph] error occurs when push to queue.[FUNC:NodeScheduled][FILE:subgraph_executor.cc][LINE:336]
[Root-Graph] error occurs when push to queue.[FUNC:operator()][FILE:subgraph_executor.cc][LINE:315]
[Root-Graph] Error:1343225860 occurred while executing graph.[FUNC:OnError][FILE:subgraph_context.cc][LINE:168]
failed to execute graph. model_id = 9[FUNC:HandleResult][FILE:hybrid_model_async_executor.cc][LINE:220]
[[{{node GeOp31_0}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ma-user/modelarts/user-job-dir/code/train.py", line 72, in <module>
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/training.py", line 473, in train_and_evaluate
return executor.run()
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/training.py", line 613, in run
return self.run_local()
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/training.py", line 714, in run_local
saving_listeners=saving_listeners)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
saving_listeners)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1494, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run
run_metadata=run_metadata)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1259, in run
run_metadata=run_metadata)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
raise six.reraise(*original_exc_info)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/six.py", line 719, in reraise
raise value
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
return self._sess.run(*args, **kwargs)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1426, in run
run_metadata=run_metadata))
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/basic_session_run_hooks.py", line 594, in after_run
if self._save(run_context.session, global_step):
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/basic_session_run_hooks.py", line 619, in _save
if l.after_save(session, step):
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/training.py", line 519, in after_save
self._evaluate(global_step_value) # updates self.eval_result
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/training.py", line 539, in _evaluate
self._evaluator.evaluate_and_export())
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/training.py", line 920, in evaluate_and_export
hooks=self._eval_spec.hooks)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 480, in evaluate
name=name)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 522, in _actual_eval
return _evaluate()
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 511, in _evaluate
output_dir=self.eval_dir(name))
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1619, in _evaluate_run
config=self._session_config)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/evaluation.py", line 272, in _evaluate_once
session.run(eval_ops, feed_dict)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run
run_metadata=run_metadata)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1259, in run
run_metadata=run_metadata)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
raise six.reraise(*original_exc_info)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/six.py", line 719, in reraise
raise value
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
return self._sess.run(*args, **kwargs)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1418, in run
run_metadata=run_metadata)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1176, in run
return self._sess.run(*args, **kwargs)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: GeOp31_0GEOP::::DoRunAsync Failed
Error Message is :
EZ9999: Inner Error!
[Update][Tilingdata] Node[target_creation/mul](Mul) tiling failed![FUNC:UpdateTilingData][FILE:aicore_node_executor.cc][LINE:198]
[Root-Graph] Error:-1 occurred while executing graph.[FUNC:OnError][FILE:subgraph_context.cc][LINE:168]
[Root-Graph] error occurs when push to queue.[FUNC:NodeScheduled][FILE:subgraph_executor.cc][LINE:336]
[Root-Graph] error occurs when push to queue.[FUNC:operator()][FILE:subgraph_executor.cc][LINE:315]
[Root-Graph] Error:1343225860 occurred while executing graph.[FUNC:OnError][FILE:subgraph_context.cc][LINE:168]
failed to execute graph. model_id = 9[FUNC:HandleResult][FILE:hybrid_model_async_executor.cc][LINE:220]
[[{{node GeOp31_0}}]]
[CANN-ZhongZhi] train return failed
[CANN-ZhongZhi] after train - list my work files[/home/ma-user/modelarts/workspace/device0]:
total 568
drwxr-xr-x 3 HwHiAiUser HwHiAiUser 4096 Dec 10 13:28 .
drwxr-xr-x 3 HwHiAiUser HwHiAiUser 4096 Dec 10 13:08 ..
-rw-r--r-- 1 HwHiAiUser HwHiAiUser 53651 Dec 10 13:28 check_result.tf.json
-rw-r----- 1 HwHiAiUser HwHiAiUser 19206 Dec 10 13:28 fusion_result.json
drwxr----- 2 HwHiAiUser HwHiAiUser 491520 Dec 10 13:28 kernel_meta
[CANN-ZhongZhi] after train - list my output files[/home/ma-user/modelarts/outputs/train_url_0/]:
total 55212
drwxr-xr-x 3 HwHiAiUser HwHiAiUser 4096 Dec 10 13:28 .
drwxr-xr-x 3 HwHiAiUser HwHiAiUser 4096 Dec 10 13:07 ..
-rw-r--r-- 1 HwHiAiUser HwHiAiUser 124 Dec 10 13:26 checkpoint
drwxr-xr-x 3 HwHiAiUser HwHiAiUser 4096 Dec 10 13:28 device0
-rw-r--r-- 1 HwHiAiUser HwHiAiUser 20486185 Dec 10 13:26 events.out.tfevents.1639112926.ma-job-7d5a3a1f-2ff6-4d0d-b21e-a5b1abb86f29-worker-0
-rw-r--r-- 1 HwHiAiUser HwHiAiUser 9207930 Dec 10 13:10 graph.pbtxt
-rw-r--r-- 1 HwHiAiUser HwHiAiUser 8070488 Dec 10 13:10 model.ckpt-0.data-00000-of-00001
-rw-r--r-- 1 HwHiAiUser HwHiAiUser 10684 Dec 10 13:10 model.ckpt-0.index
-rw-r--r-- 1 HwHiAiUser HwHiAiUser 5318983 Dec 10 13:11 model.ckpt-0.meta
-rw-r--r-- 1 HwHiAiUser HwHiAiUser 8070488 Dec 10 13:26 model.ckpt-2.data-00000-of-00001
-rw-r--r-- 1 HwHiAiUser HwHiAiUser 10684 Dec 10 13:26 model.ckpt-2.index
-rw-r--r-- 1 HwHiAiUser HwHiAiUser 5318983 Dec 10 13:26 model.ckpt-2.meta
-rw-r--r-- 1 HwHiAiUser HwHiAiUser 5782 Dec 10 13:08 my_env.log
[CANN-ZhongZhi] finish run train shell
[ModelArts Service Log]2021-12-10 13:28:50,058 - INFO - Begin destroy training processes
[ModelArts Service Log]2021-12-10 13:28:50,059 - INFO - proc-rank-0-device-0 (pid: 115) has exited
[ModelArts Service Log]2021-12-10 13:28:50,059 - INFO - End destroy training processes
[ModelArts Service Log]exiting...
[ModelArts Service Log]exit with 0
[ModelArts Service Log][sidecar] training is completed
[ModelArts Service Log][sidecar] stop outputs_handler_pid = 63 by signal SIGTERM
[ModelArts Service Log][INFO][2021/12/10 13:28:51,026]: output-handler finalizing due to: [training finished]
[ModelArts Service Log][INFO][2021/12/10 13:28:51,026]: output-handler finalized
[ModelArts Service Log][sidecar] stop toolkit_obs_sync_by_channels_pid = 78 by signal SIGTERM
time="2021-12-10T13:28:51+08:00" level=info msg="the periodic upload task exiting..." file="upload.go:59" Command=obs/sync_by_channels Component=ma-training-toolkit Ctx=log_url Platform=ModelArts-Service
time="2021-12-10T13:28:51+08:00" level=info msg="the periodic upload task exiting..." file="upload.go:59" Command=obs/sync_by_channels Component=ma-training-toolkit Ctx=train_url Platform=ModelArts-Service
[ModelArts Service Log][sidecar] outputs_handler_pid = 63 ret_code is 0
[ModelArts Service Log][sidecar] toolkit_obs_sync_by_channels_pid = 78 ret_code is 0
[ModelArts Service Log][sidecar] upload_metrics_pid = 1512
[ModelArts Service Log][sidecar] exiting at 2021-12-10-13:30:11
[ModelArts Service Log][sidecar] exit with 0
[ModelArts Service Log][sidecar] stop toolkit_obs_upload_pid = 34 by signal SIGTERM
time="2021-12-10T13:30:31+08:00" level=info msg="the periodic upload task exiting..." file="upload.go:59" Command=obs/upload Component=ma-training-toolkit Platform=ModelArts-Service
二、软件版本:
-- CANN 版本:ascend-share/5.0.4.alpha002_tensorflow-ascend910-cp37-euleros2.8-aarch64-training:1.15.0-21.0.2_1202
三、测试步骤:
pycharm上使用ModelArts插件
四、日志信息:
代码及log
提取码:
123456
*有效期至: 2022/06/08 14:51:50 GMT+08:00
请问这个报错有什么好的解决方法吗
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
新版本已无报错,问题关闭。
登录 后才可以发表评论