75 Star 219 Fork 166

Ascend / modelzoo

 / 详情

【众智-南大】【FaceBoxes】NPU+Tensorflow训练任务报错:tensorflow.python.framework.errors_impl.InternalError: GeOp13_0GEOP::::DoRunAsync Failed

DONE
Bug-Report
创建于  
2021-12-10 14:52

一、问题现象(附报错日志上下文):
介绍:原代码中使用了tf.map_fn,为了规避此函数,将其用for循环替换实现,并已在gpu跑通。
模型训练的输入为尺寸固定的图片和每张图片中数量不一定的人脸框,不知道这算不算动态shape,训练过程中一些函数的输出是会生成一些不定形状的tensor。

NPU训练报错:在NPU训练时会先用十几分钟加载一堆东西,然后运行一个step,可以看到第一个step的log打屏,然后再卡一会,进而报错:
tensorflow.python.framework.errors_impl.InternalError: GeOp31_0GEOP::::DoRunAsync Failed
Error Message is :
EZ9999: Inner Error!
[Update][Tilingdata] Nodetarget_creation/mul tiling failed![FUNC:UpdateTilingData][FILE:aicore_node_executor.cc][LINE:198]
[Root-Graph] Error:-1 occurred while executing graph.[FUNC:OnError][FILE:subgraph_context.cc][LINE:168]
[Root-Graph] error occurs when push to queue.[FUNC:NodeScheduled][FILE:subgraph_executor.cc][LINE:336]
[Root-Graph] error occurs when push to queue.[FUNC:operator()][FILE:subgraph_executor.cc][LINE:315]
[Root-Graph] Error:1343225860 occurred while executing graph.[FUNC:OnError][FILE:subgraph_context.cc][LINE:168]
failed to execute graph. model_id = 9[FUNC:HandleResult][FILE:hybrid_model_async_executor.cc][LINE:220]

 [[{{node GeOp31_0}}]]
Traceback (most recent call last):
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: GeOp31_0GEOP::::DoRunAsync Failed
Error Message is : 
EZ9999: Inner Error!
        [Update][Tilingdata] Node[target_creation/mul](Mul) tiling failed![FUNC:UpdateTilingData][FILE:aicore_node_executor.cc][LINE:198]
        [Root-Graph] Error:-1 occurred while executing graph.[FUNC:OnError][FILE:subgraph_context.cc][LINE:168]
        [Root-Graph] error occurs when push to queue.[FUNC:NodeScheduled][FILE:subgraph_executor.cc][LINE:336]
        [Root-Graph] error occurs when push to queue.[FUNC:operator()][FILE:subgraph_executor.cc][LINE:315]
        [Root-Graph] Error:1343225860 occurred while executing graph.[FUNC:OnError][FILE:subgraph_context.cc][LINE:168]
        failed to execute graph. model_id = 9[FUNC:HandleResult][FILE:hybrid_model_async_executor.cc][LINE:220]

	 [[{{node GeOp31_0}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ma-user/modelarts/user-job-dir/code/train.py", line 72, in <module>
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/training.py", line 473, in train_and_evaluate
    return executor.run()
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/training.py", line 613, in run
    return self.run_local()
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/training.py", line 714, in run_local
    saving_listeners=saving_listeners)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
    saving_listeners)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1494, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run
    run_metadata=run_metadata)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1259, in run
    run_metadata=run_metadata)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
    raise six.reraise(*original_exc_info)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/six.py", line 719, in reraise
    raise value
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
    return self._sess.run(*args, **kwargs)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1426, in run
    run_metadata=run_metadata))
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/basic_session_run_hooks.py", line 594, in after_run
    if self._save(run_context.session, global_step):
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/basic_session_run_hooks.py", line 619, in _save
    if l.after_save(session, step):
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/training.py", line 519, in after_save
    self._evaluate(global_step_value)  # updates self.eval_result
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/training.py", line 539, in _evaluate
    self._evaluator.evaluate_and_export())
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/training.py", line 920, in evaluate_and_export
    hooks=self._eval_spec.hooks)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 480, in evaluate
    name=name)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 522, in _actual_eval
    return _evaluate()
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 511, in _evaluate
    output_dir=self.eval_dir(name))
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1619, in _evaluate_run
    config=self._session_config)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/evaluation.py", line 272, in _evaluate_once
    session.run(eval_ops, feed_dict)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run
    run_metadata=run_metadata)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1259, in run
    run_metadata=run_metadata)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
    raise six.reraise(*original_exc_info)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/six.py", line 719, in reraise
    raise value
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
    return self._sess.run(*args, **kwargs)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1418, in run
    run_metadata=run_metadata)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1176, in run
    return self._sess.run(*args, **kwargs)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: GeOp31_0GEOP::::DoRunAsync Failed
Error Message is : 
EZ9999: Inner Error!
        [Update][Tilingdata] Node[target_creation/mul](Mul) tiling failed![FUNC:UpdateTilingData][FILE:aicore_node_executor.cc][LINE:198]
        [Root-Graph] Error:-1 occurred while executing graph.[FUNC:OnError][FILE:subgraph_context.cc][LINE:168]
        [Root-Graph] error occurs when push to queue.[FUNC:NodeScheduled][FILE:subgraph_executor.cc][LINE:336]
        [Root-Graph] error occurs when push to queue.[FUNC:operator()][FILE:subgraph_executor.cc][LINE:315]
        [Root-Graph] Error:1343225860 occurred while executing graph.[FUNC:OnError][FILE:subgraph_context.cc][LINE:168]
        failed to execute graph. model_id = 9[FUNC:HandleResult][FILE:hybrid_model_async_executor.cc][LINE:220]

	 [[{{node GeOp31_0}}]]
[CANN-ZhongZhi] train return failed
[CANN-ZhongZhi] after train - list my work files[/home/ma-user/modelarts/workspace/device0]:
total 568
drwxr-xr-x 3 HwHiAiUser HwHiAiUser   4096 Dec 10 13:28 .
drwxr-xr-x 3 HwHiAiUser HwHiAiUser   4096 Dec 10 13:08 ..
-rw-r--r-- 1 HwHiAiUser HwHiAiUser  53651 Dec 10 13:28 check_result.tf.json
-rw-r----- 1 HwHiAiUser HwHiAiUser  19206 Dec 10 13:28 fusion_result.json
drwxr----- 2 HwHiAiUser HwHiAiUser 491520 Dec 10 13:28 kernel_meta

[CANN-ZhongZhi] after train - list my output files[/home/ma-user/modelarts/outputs/train_url_0/]:
total 55212
drwxr-xr-x 3 HwHiAiUser HwHiAiUser     4096 Dec 10 13:28 .
drwxr-xr-x 3 HwHiAiUser HwHiAiUser     4096 Dec 10 13:07 ..
-rw-r--r-- 1 HwHiAiUser HwHiAiUser      124 Dec 10 13:26 checkpoint
drwxr-xr-x 3 HwHiAiUser HwHiAiUser     4096 Dec 10 13:28 device0
-rw-r--r-- 1 HwHiAiUser HwHiAiUser 20486185 Dec 10 13:26 events.out.tfevents.1639112926.ma-job-7d5a3a1f-2ff6-4d0d-b21e-a5b1abb86f29-worker-0
-rw-r--r-- 1 HwHiAiUser HwHiAiUser  9207930 Dec 10 13:10 graph.pbtxt
-rw-r--r-- 1 HwHiAiUser HwHiAiUser  8070488 Dec 10 13:10 model.ckpt-0.data-00000-of-00001
-rw-r--r-- 1 HwHiAiUser HwHiAiUser    10684 Dec 10 13:10 model.ckpt-0.index
-rw-r--r-- 1 HwHiAiUser HwHiAiUser  5318983 Dec 10 13:11 model.ckpt-0.meta
-rw-r--r-- 1 HwHiAiUser HwHiAiUser  8070488 Dec 10 13:26 model.ckpt-2.data-00000-of-00001
-rw-r--r-- 1 HwHiAiUser HwHiAiUser    10684 Dec 10 13:26 model.ckpt-2.index
-rw-r--r-- 1 HwHiAiUser HwHiAiUser  5318983 Dec 10 13:26 model.ckpt-2.meta
-rw-r--r-- 1 HwHiAiUser HwHiAiUser     5782 Dec 10 13:08 my_env.log

[CANN-ZhongZhi] finish run train shell
[ModelArts Service Log]2021-12-10 13:28:50,058 - INFO - Begin destroy training processes
[ModelArts Service Log]2021-12-10 13:28:50,059 - INFO - proc-rank-0-device-0 (pid: 115) has exited
[ModelArts Service Log]2021-12-10 13:28:50,059 - INFO - End destroy training processes
[ModelArts Service Log]exiting...
[ModelArts Service Log]exit with 0
[ModelArts Service Log][sidecar] training is completed
[ModelArts Service Log][sidecar] stop outputs_handler_pid = 63 by signal SIGTERM
[ModelArts Service Log][INFO][2021/12/10 13:28:51,026]: output-handler finalizing due to: [training finished]
[ModelArts Service Log][INFO][2021/12/10 13:28:51,026]: output-handler finalized
[ModelArts Service Log][sidecar] stop toolkit_obs_sync_by_channels_pid = 78 by signal SIGTERM
time="2021-12-10T13:28:51+08:00" level=info msg="the periodic upload task exiting..." file="upload.go:59" Command=obs/sync_by_channels Component=ma-training-toolkit Ctx=log_url Platform=ModelArts-Service
time="2021-12-10T13:28:51+08:00" level=info msg="the periodic upload task exiting..." file="upload.go:59" Command=obs/sync_by_channels Component=ma-training-toolkit Ctx=train_url Platform=ModelArts-Service
[ModelArts Service Log][sidecar] outputs_handler_pid = 63 ret_code is 0
[ModelArts Service Log][sidecar] toolkit_obs_sync_by_channels_pid = 78 ret_code is 0
[ModelArts Service Log][sidecar] upload_metrics_pid = 1512
[ModelArts Service Log][sidecar] exiting at 2021-12-10-13:30:11
[ModelArts Service Log][sidecar] exit with 0
[ModelArts Service Log][sidecar] stop toolkit_obs_upload_pid = 34 by signal SIGTERM
time="2021-12-10T13:30:31+08:00" level=info msg="the periodic upload task exiting..." file="upload.go:59" Command=obs/upload Component=ma-training-toolkit Platform=ModelArts-Service

二、软件版本:
-- CANN 版本:ascend-share/5.0.4.alpha002_tensorflow-ascend910-cp37-euleros2.8-aarch64-training:1.15.0-21.0.2_1202

三、测试步骤:
pycharm上使用ModelArts插件

四、日志信息:

代码及log

URL:
https://e-share.obs-website.cn-north-1.myhuaweicloud.com?token=uBkmRmRznP5F3QkyOs67o8NfWi7yA58V3Wr2x/OQFu0fKMA74TI0tzzjHFbAanCs90HupbwZu/XcrEBu4khSLI7pTwxnpiDvuHzZnKiWvDNbCfeipENFdB16SukE2mrKGJt3ZK+82+JkgitwFf74KbImvclQyLBxMusBm2aU6udMqJbGzKz/fc7MoY3+fCKJNMnk7u3+huqwUxl/etMraWoOq7tL8QLHuBpvKz34WtdZ3DQVimsxrHoay9X3HJJJDmHnBGl04kS7iR6w4nldzcliWp/02OoziLrseDEWoOG0UA9QEkeNmGXa9kcrC/IjKpAWLV/PP3FPJCJ5ayLDPvx9W2v1hntl0zo3DOGBgjY7SyEccGkqN4Tq35pQEn+VWyWlt2RNfMsKZpPMCcaL2srPJ5vYWUnMccjxGQC2uTYBVpHZVikaCLvwrwdPlOqwO0CaEjEswE9o5V4HNE3Oiqc+198D7ryqpZkXMNuZnT2q5dT90FlSLca9vfiK0Y5Hv0n3XpejyRV9lpCpOKRs0FnJDwYEosXPfHUVqSAbDZQbSzlqHXewuQ64eEQwHa/9rR/GqwSySzvRwubQmFdvDXJiNTGgivuYOucoqR/qnxPmdoEelhK7ZWZy0Jo4xF3f0Pk4TCLflKbugarDJWfpEQ==

提取码:
123456

*有效期至: 2022/06/08 14:51:50 GMT+08:00

评论 (3)

Bebad 创建了Bug-Report
zhujianpeng 负责人设置为张晓龙
zhujianpeng 任务状态TODO 修改为Analysing
展开全部操作日志

请问这个报错有什么好的解决方法吗

张晓龙 负责人张晓龙 修改为未设置
张晓龙 添加协作者张晓龙
张晓龙 负责人设置为宋保强

新版本已无报错,问题关闭。

张晓龙 任务状态Analysing 修改为DONE
吴定远 关联仓库Ascend/modelzoo-his 修改为Ascend/modelzoo

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(2)
1
https://gitee.com/ascend/modelzoo.git
git@gitee.com:ascend/modelzoo.git
ascend
modelzoo
modelzoo

搜索帮助