【众智-南大】【FaceBoxes】NPU+Tensorflow训练任务报错：tensorflow.python.framework.errors_impl.InternalError: GeOp13_0GEOP::::DoRunAsync Failed

一、问题现象（附报错日志上下文）：
介绍：原代码中使用了tf.map_fn，为了规避此函数，将其用for循环替换实现，并已在gpu跑通。
模型训练的输入为尺寸固定的图片和每张图片中数量不一定的人脸框，不知道这算不算动态shape，训练过程中一些函数的输出是会生成一些不定形状的tensor。

NPU训练报错：在NPU训练时会先用十几分钟加载一堆东西，然后运行一个step，可以看到第一个step的log打屏，然后再卡一会，进而报错：
tensorflow.python.framework.errors_impl.InternalError: GeOp31_0GEOP::::DoRunAsync Failed
Error Message is :
EZ9999: Inner Error!
[Update][Tilingdata] Nodetarget_creation/mul tiling failed![FUNC:UpdateTilingData][FILE:aicore_node_executor.cc][LINE:198]
[Root-Graph] Error:-1 occurred while executing graph.[FUNC:OnError][FILE:subgraph_context.cc][LINE:168]
[Root-Graph] error occurs when push to queue.[FUNC:NodeScheduled][FILE:subgraph_executor.cc][LINE:336]
[Root-Graph] error occurs when push to queue.[FUNC:operator()][FILE:subgraph_executor.cc][LINE:315]
[Root-Graph] Error:1343225860 occurred while executing graph.[FUNC:OnError][FILE:subgraph_context.cc][LINE:168]
failed to execute graph. model_id = 9[FUNC:HandleResult][FILE:hybrid_model_async_executor.cc][LINE:220]

 [[{{node GeOp31_0}}]]

Traceback (most recent call last):
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: GeOp31_0GEOP::::DoRunAsync Failed
Error Message is : 
EZ9999: Inner Error!
        [Update][Tilingdata] Node[target_creation/mul](Mul) tiling failed![FUNC:UpdateTilingData][FILE:aicore_node_executor.cc][LINE:198]
        [Root-Graph] Error:-1 occurred while executing graph.[FUNC:OnError][FILE:subgraph_context.cc][LINE:168]
        [Root-Graph] error occurs when push to queue.[FUNC:NodeScheduled][FILE:subgraph_executor.cc][LINE:336]
        [Root-Graph] error occurs when push to queue.[FUNC:operator()][FILE:subgraph_executor.cc][LINE:315]
        [Root-Graph] Error:1343225860 occurred while executing graph.[FUNC:OnError][FILE:subgraph_context.cc][LINE:168]
        failed to execute graph. model_id = 9[FUNC:HandleResult][FILE:hybrid_model_async_executor.cc][LINE:220]

	 [[{{node GeOp31_0}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ma-user/modelarts/user-job-dir/code/train.py", line 72, in <module>
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/training.py", line 473, in train_and_evaluate
    return executor.run()
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/training.py", line 613, in run
    return self.run_local()
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/training.py", line 714, in run_local
    saving_listeners=saving_listeners)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
    saving_listeners)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1494, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run
    run_metadata=run_metadata)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1259, in run
    run_metadata=run_metadata)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
    raise six.reraise(*original_exc_info)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/six.py", line 719, in reraise
    raise value
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
    return self._sess.run(*args, **kwargs)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1426, in run
    run_metadata=run_metadata))
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/basic_session_run_hooks.py", line 594, in after_run
    if self._save(run_context.session, global_step):
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/basic_session_run_hooks.py", line 619, in _save
    if l.after_save(session, step):
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/training.py", line 519, in after_save
    self._evaluate(global_step_value)  # updates self.eval_result
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/training.py", line 539, in _evaluate
    self._evaluator.evaluate_and_export())
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/training.py", line 920, in evaluate_and_export
    hooks=self._eval_spec.hooks)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 480, in evaluate
    name=name)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 522, in _actual_eval
    return _evaluate()
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 511, in _evaluate
    output_dir=self.eval_dir(name))
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1619, in _evaluate_run
    config=self._session_config)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/evaluation.py", line 272, in _evaluate_once
    session.run(eval_ops, feed_dict)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run
    run_metadata=run_metadata)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1259, in run
    run_metadata=run_metadata)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
    raise six.reraise(*original_exc_info)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/six.py", line 719, in reraise
    raise value
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
    return self._sess.run(*args, **kwargs)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1418, in run
    run_metadata=run_metadata)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1176, in run
    return self._sess.run(*args, **kwargs)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: GeOp31_0GEOP::::DoRunAsync Failed
Error Message is : 
EZ9999: Inner Error!
        [Update][Tilingdata] Node[target_creation/mul](Mul) tiling failed![FUNC:UpdateTilingData][FILE:aicore_node_executor.cc][LINE:198]
        [Root-Graph] Error:-1 occurred while executing graph.[FUNC:OnError][FILE:subgraph_context.cc][LINE:168]
        [Root-Graph] error occurs when push to queue.[FUNC:NodeScheduled][FILE:subgraph_executor.cc][LINE:336]
        [Root-Graph] error occurs when push to queue.[FUNC:operator()][FILE:subgraph_executor.cc][LINE:315]
        [Root-Graph] Error:1343225860 occurred while executing graph.[FUNC:OnError][FILE:subgraph_context.cc][LINE:168]
        failed to execute graph. model_id = 9[FUNC:HandleResult][FILE:hybrid_model_async_executor.cc][LINE:220]

	 [[{{node GeOp31_0}}]]
[CANN-ZhongZhi] train return failed
[CANN-ZhongZhi] after train - list my work files[/home/ma-user/modelarts/workspace/device0]:
total 568
drwxr-xr-x 3 HwHiAiUser HwHiAiUser   4096 Dec 10 13:28 .
drwxr-xr-x 3 HwHiAiUser HwHiAiUser   4096 Dec 10 13:08 ..
-rw-r--r-- 1 HwHiAiUser HwHiAiUser  53651 Dec 10 13:28 check_result.tf.json
-rw-r----- 1 HwHiAiUser HwHiAiUser  19206 Dec 10 13:28 fusion_result.json
drwxr----- 2 HwHiAiUser HwHiAiUser 491520 Dec 10 13:28 kernel_meta

[CANN-ZhongZhi] after train - list my output files[/home/ma-user/modelarts/outputs/train_url_0/]:
total 55212
drwxr-xr-x 3 HwHiAiUser HwHiAiUser     4096 Dec 10 13:28 .
drwxr-xr-x 3 HwHiAiUser HwHiAiUser     4096 Dec 10 13:07 ..
-rw-r--r-- 1 HwHiAiUser HwHiAiUser      124 Dec 10 13:26 checkpoint
drwxr-xr-x 3 HwHiAiUser HwHiAiUser     4096 Dec 10 13:28 device0
-rw-r--r-- 1 HwHiAiUser HwHiAiUser 20486185 Dec 10 13:26 events.out.tfevents.1639112926.ma-job-7d5a3a1f-2ff6-4d0d-b21e-a5b1abb86f29-worker-0
-rw-r--r-- 1 HwHiAiUser HwHiAiUser  9207930 Dec 10 13:10 graph.pbtxt
-rw-r--r-- 1 HwHiAiUser HwHiAiUser  8070488 Dec 10 13:10 model.ckpt-0.data-00000-of-00001
-rw-r--r-- 1 HwHiAiUser HwHiAiUser    10684 Dec 10 13:10 model.ckpt-0.index
-rw-r--r-- 1 HwHiAiUser HwHiAiUser  5318983 Dec 10 13:11 model.ckpt-0.meta
-rw-r--r-- 1 HwHiAiUser HwHiAiUser  8070488 Dec 10 13:26 model.ckpt-2.data-00000-of-00001
-rw-r--r-- 1 HwHiAiUser HwHiAiUser    10684 Dec 10 13:26 model.ckpt-2.index
-rw-r--r-- 1 HwHiAiUser HwHiAiUser  5318983 Dec 10 13:26 model.ckpt-2.meta
-rw-r--r-- 1 HwHiAiUser HwHiAiUser     5782 Dec 10 13:08 my_env.log

[CANN-ZhongZhi] finish run train shell
[ModelArts Service Log]2021-12-10 13:28:50,058 - INFO - Begin destroy training processes
[ModelArts Service Log]2021-12-10 13:28:50,059 - INFO - proc-rank-0-device-0 (pid: 115) has exited
[ModelArts Service Log]2021-12-10 13:28:50,059 - INFO - End destroy training processes
[ModelArts Service Log]exiting...
[ModelArts Service Log]exit with 0
[ModelArts Service Log][sidecar] training is completed
[ModelArts Service Log][sidecar] stop outputs_handler_pid = 63 by signal SIGTERM
[ModelArts Service Log][INFO][2021/12/10 13:28:51,026]: output-handler finalizing due to: [training finished]
[ModelArts Service Log][INFO][2021/12/10 13:28:51,026]: output-handler finalized
[ModelArts Service Log][sidecar] stop toolkit_obs_sync_by_channels_pid = 78 by signal SIGTERM
time="2021-12-10T13:28:51+08:00" level=info msg="the periodic upload task exiting..." file="upload.go:59" Command=obs/sync_by_channels Component=ma-training-toolkit Ctx=log_url Platform=ModelArts-Service
time="2021-12-10T13:28:51+08:00" level=info msg="the periodic upload task exiting..." file="upload.go:59" Command=obs/sync_by_channels Component=ma-training-toolkit Ctx=train_url Platform=ModelArts-Service
[ModelArts Service Log][sidecar] outputs_handler_pid = 63 ret_code is 0
[ModelArts Service Log][sidecar] toolkit_obs_sync_by_channels_pid = 78 ret_code is 0
[ModelArts Service Log][sidecar] upload_metrics_pid = 1512
[ModelArts Service Log][sidecar] exiting at 2021-12-10-13:30:11
[ModelArts Service Log][sidecar] exit with 0
[ModelArts Service Log][sidecar] stop toolkit_obs_upload_pid = 34 by signal SIGTERM
time="2021-12-10T13:30:31+08:00" level=info msg="the periodic upload task exiting..." file="upload.go:59" Command=obs/upload Component=ma-training-toolkit Platform=ModelArts-Service

二、软件版本:
-- CANN 版本：ascend-share/5.0.4.alpha002_tensorflow-ascend910-cp37-euleros2.8-aarch64-training:1.15.0-21.0.2_1202

三、测试步骤：
pycharm上使用ModelArts插件

四、日志信息:

代码及log

URL:
https://e-share.obs-website.cn-north-1.myhuaweicloud.com?token=uBkmRmRznP5F3QkyOs67o8NfWi7yA58V3Wr2x/OQFu0fKMA74TI0tzzjHFbAanCs90HupbwZu/XcrEBu4khSLI7pTwxnpiDvuHzZnKiWvDNbCfeipENFdB16SukE2mrKGJt3ZK+82+JkgitwFf74KbImvclQyLBxMusBm2aU6udMqJbGzKz/fc7MoY3+fCKJNMnk7u3+huqwUxl/etMraWoOq7tL8QLHuBpvKz34WtdZ3DQVimsxrHoay9X3HJJJDmHnBGl04kS7iR6w4nldzcliWp/02OoziLrseDEWoOG0UA9QEkeNmGXa9kcrC/IjKpAWLV/PP3FPJCJ5ayLDPvx9W2v1hntl0zo3DOGBgjY7SyEccGkqN4Tq35pQEn+VWyWlt2RNfMsKZpPMCcaL2srPJ5vYWUnMccjxGQC2uTYBVpHZVikaCLvwrwdPlOqwO0CaEjEswE9o5V4HNE3Oiqc+198D7ryqpZkXMNuZnT2q5dT90FlSLca9vfiK0Y5Hv0n3XpejyRV9lpCpOKRs0FnJDwYEosXPfHUVqSAbDZQbSzlqHXewuQ64eEQwHa/9rR/GqwSySzvRwubQmFdvDXJiNTGgivuYOucoqR/qnxPmdoEelhK7ZWZy0Jo4xF3f0Pk4TCLflKbugarDJWfpEQ==

提取码:
123456

*有效期至: 2022/06/08 14:51:50 GMT+08:00

请问这个报错有什么好的解决方法吗

数据集链接
URL:
https://e-share.obs-website.cn-north-1.myhuaweicloud.com?token=uBkmRmRznP5F3QkyOs67o8NfWi7yA58V3Wr2x/OQFu0fKMA74TI0tzzjHFbAanCs90HupbwZu/XcrEBu4khSLI7pTwxnpiDvuHzZnKiWvDNbCfeipENFdB16SukE2mrKGJt3ZK+82+JkgitwFf74KbImvclQyLBxMusBm2aU6udMqJbGzKz/fc7MoY3+fCKJNMnk7u3+huqwUxl/etMraR9Qz2fKuCSPYfsQzgBJ5S/Cnz4udnH8rtkD8vqF9EyBiH+BOPdAOINylgrk7IU6TN3GcVSA/t+KcwLiSm9I6sQYIWfcXL6f4skT/baoSLJdeySbzajIxcxQBJusqc4JbGNop7hNXAuW+eCAldN3S1qWjGi1nZnURZoObHd7/o4kFZ+dPsI1b2rIwvKaXO6Uqh70EdVLgmieZdbjRFKrODd7+5a3AQYgYjrZ0nltwrGQDK+tSEWz6OT05OMb4+JrBfFI1d/yoABgk7PedYSupuzd7YRKPzNYQihaNRLkQvXmYh1JapNdippgkEhIeMbF3qP0NljNRR1g/0HjN3isptg=

提取码:
123456

*有效期至: 2022/01/14 10:26:36 GMT+08:00

新版本已无报错，问题关闭。

Ascend / modelzoo

内容风险标识

评论 (3)

Ascend / modelzoo .gitee-modal { width: 500px !important; }

内容风险标识