74 Star 219 Fork 167

Ascend / modelzoo

 / 详情

【众智】【杭州电子科技大学】【ID2048】【TFT】在cpu/gpu上训练正常,迁移到ModelArts上NPU训练时, 出现GeOp343_0GEOP::::DoRunAsync Failed

DONE
Bug-Report
创建于  
2021-11-06 20:44

一、问题现象(附报错日志上下文):
Traceback (most recent call last):
File "/home/ma-user/modelarts/user-job-dir/code/script_train_fixed_params.py", line 251, in
use_testing_mode=False) # Change to false to use original default params
File "/home/ma-user/modelarts/user-job-dir/code/script_train_fixed_params.py", line 142, in main
model.fit()
File "/home/ma-user/modelarts/user-job-dir/code/libs/tft_model.py", line 1164, in fit
workers=self.n_multiprocessing_workers)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training.py", line 727, in fit
use_multiprocessing=use_multiprocessing)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 675, in fit
steps_name='steps_per_epoch')
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 394, in model_iteration
batch_outs = f(ins_batch)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/keras/backend.py", line 3476, in call
run_metadata=self.run_metadata)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1472, in call
run_metadata_ptr)
tensorflow.python.framework.errors_impl.InternalError: GeOp343_0GEOP::::DoRunAsync Failed
Error Message is :
EZ9999: Inner Error!
EZ9999 The error from device(4), serial number is 4, there is an aicore error, core id is 27, error code = 0x800000, dump info: pc start: 0x10001080436f0000, current: 0x1080436f02a8, vec error info: 0x1ec5dd71, mte error info: 0x60000a6, ifu error info: 0x3630efb35e600, ccu error info: 0x0, cube error info: 0xc0, biu error info: 0x0, aic error mask: 0x65000200d000288, para base: 0x1080432335d0.[FUNC:PrintCoreErrorInfo][FILE:device_error_proc.cc][LINE:364]
The device(4), core list[0-0], error code is:[FUNC:PrintCoreInfoErrMsg][FILE:device_error_proc.cc][LINE:414]
coreId( 0): 0x800000 [FUNC:PrintCoreInfoErrMsg][FILE:device_error_proc.cc][LINE:425]
Aicore kernel execute failed, device_id=0, stream_id=1311, report_stream_id=1322, task_id=18, flip_num=0, fault kernel_name=0_611_training/Adam/gradients/clip_by_norm_23/ArithmeticOptimizer/ReplaceMulWithSquare_mul/SquareSumV1/SquareSumV1374, func_name=te_squaresumv1_0c08fd29d007052ed20f4cf2a4d06f1c92602fb3de2e36f93242bd95f9777be5_1__kernel0, program id=3356, hash=18004332741044157744[FUNC:GetError][FILE:stream.cc][LINE:737]
rtStreamSynchronize execute failed, reason=[the model stream execute failed][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:45]
invoke rtStreamSynchronize failed, ret = 507011[FUNC:Synchronize][FILE:hybrid_execution_context.cc][LINE:91]
failed to execute graph. model_id = 62[FUNC:HandleResult][FILE:hybrid_model_async_executor.cc][LINE:221]

 [[{{node GeOp343_0}}]]

[ModelArts Service Log]2022-03-07 15:44:26,330 - ERROR - proc-rank-0-device-0 (pid: 114) has exited with non-zero code: 1
[ModelArts Service Log]2022-03-07 15:44:26,330 - INFO - Begin destroy training processes
[ModelArts Service Log]2022-03-07 15:44:26,330 - INFO - proc-rank-0-device-0 (pid: 114) has exited
[ModelArts Service Log]2022-03-07 15:44:26,331 - INFO - End destroy training processes
time="2022-03-07T15:44:26+08:00" level=info msg="start and wait python command is exit with 1" file="controller.go:181" Args="[/home/ma-user/anaconda/bin/python /home/ma-user/modelarts/run/davincirun.py /home/ma-user/anaconda/bin/python /home/ma-user/modelarts/user-job-dir/code/script_train_fixed_params.py --data_url=/home/ma-user/modelarts/inputs/data_url_0/ --train_url=/home/ma-user/modelarts/outputs/train_url_0/ --data_url=/home/ma-user/modelarts/inputs/data_url_0/ --train_url=/home/ma-user/modelarts/outputs/train_url_0/]" Command=run-with-backoff Component=ma-training-toolkit Platform=ModelArts-Service TaskID=worker-0
time="2022-03-07T15:44:26+08:00" level=info msg="run-with-backoff exit with 1" file="controller.go:159" Args="[/home/ma-user/anaconda/bin/python /home/ma-user/modelarts/run/davincirun.py /home/ma-user/anaconda/bin/python /home/ma-user/modelarts/user-job-dir/code/script_train_fixed_params.py --data_url=/home/ma-user/modelarts/inputs/data_url_0/ --train_url=/home/ma-user/modelarts/outputs/train_url_0/ --data_url=/home/ma-user/modelarts/inputs/data_url_0/ --train_url=/home/ma-user/modelarts/outputs/train_url_0/]" Command=run-with-backoff Component=ma-training-toolkit Platform=ModelArts-Service TaskID=worker-0
[2022-03-07T15:44:26+08:00][ModelArts Service Log]exiting...
[2022-03-07T15:44:26+08:00][ModelArts Service Log]exit with 1
[2022-03-07T15:44:27+08:00][ModelArts Service Log][sidecar] training is completed
time="2022-03-07T15:44:27+08:00" level=warning msg="the "log-preview-size" parameter exceeds the limit and will be set to the default value 5242880" file="cli.go:192" Command=analyze Component=ma-training-toolkit Platform=ModelArts-Service
[2022-03-07T15:44:27+08:00][ModelArts Service Log][sidecar] stop toolkit_obs_upload_by_channels_pid = 39 by signal SIGTERM
time="2022-03-07T15:44:27+08:00" level=info msg="the periodic upload task exiting..." file="upload.go:214" Command=obs/upload_by_channels Component=ma-training-toolkit Platform=ModelArts-Service Task=train_url
time="2022-03-07T15:44:27+08:00" level=info msg="the periodic upload task exiting..." file="upload.go:214" Command=obs/upload_by_channels Component=ma-training-toolkit Platform=ModelArts-Service Task=log_url

二、软件版本:
ascend-share/5.1.rc1.alpha001_tensorflow-ascend910-cp37-euleros2.8-aarch64-training:1.15.0-21.0.2_0215
CANN 5.0.2:
Tensorflow1.15:
Python 3.7:
操作系统版本 :ascend-share/5.1.rc1.alpha002_tensorflow-ascend910-cp37-euleros2.8-aarch64-training:1.15.0-21.0.2_0303

三、测试步骤:
源码:URL:
URL:
URL:
https://e-share.obs-website.cn-north-1.myhuaweicloud.com?token=1eJrvePKx7DYA6XCRyr0VyX8DdhakYwODtUvhF6PROpVhQwlK9rA9TMHlYdqhXQrvhBMFSysNuPd6OFX0GfrOQcIjzKpc4TkYKsUrB6ci0FZeZStZ8qD5pZSwuOJCj9mOCdsAiuag89eGk4s25yv8F66X7+sPlq7mEryewKOikZCdM75rRdNpGkg2cdfAGXO99I+QuTAS8AcLpro7aEV6Pd5N8YkVThxrIjBlF0MSAM5VSfY7C304Lj7dd3EkMrR8yEwCi8QATDer4OmH50dcm8fEPl944wzBQ0QFz3jHcJ1wPy4RbwU2ZRknQCCA21h6SqFfB31SKdL6jN2vPp3x5cir4iyeZpKx+Zdm0UR/m2bcCu50sIrVELXbN0qmrs0PXNhuNjJlVgZJK3qAws/Nq1+Ac+0798tSIcerCo4MSlv+d/JBebdUx4T1cKOCH8miLUsCJE8sgSBqbR1eT9bnSFeLNGIJRKUvP7jec0MVug6CL2oHh87YmG3MVF6Rbm4R+zBJbK8G/g3PF5T4Ge+bpjIFg2kfiWpE8oH5BcQxtmc75x+9adbjtle91KqBE4KFp1OoyaF9fAz5JMPrAiF6mh6rGQoMoZoIzsZ9Ox+h9efkbRvjqfy1euqJ4LsKyUarMMO68BF9qr607aFoX7bb9DNb9Qi3EUSGxvKy7ShHas=

提取码:
123456

*有效期至: 2023/03/02 15:49:51 GMT+08:00

运行脚本:code/script_train_fixed_params.py
命令执行:python -m script_train_fixed_params(参数已经设置好)
数据集:outputs/data/volatility/

四、日志信息:

日志提供方式:
URL:
URL:
https://e-share.obs-website.cn-north-1.myhuaweicloud.com?token=1eJrvePKx7DYA6XCRyr0VyX8DdhakYwODtUvhF6PROpVhQwlK9rA9TMHlYdqhXQrvhBMFSysNuPd6OFX0GfrOQcIjzKpc4TkYKsUrB6ci0FZeZStZ8qD5pZSwuOJCj9mOCdsAiuag89eGk4s25yv8F66X7+sPlq7mEryewKOikZCdM75rRdNpGkg2cdfAGXO99I+QuTAS8AcLpro7aEV6Pd5N8YkVThxrIjBlF0MSAMOtwj3y6AkuN3dt9jpDd9J2eiZIrSOABrqXJeWfaaCSlxRlldFH0ttqNvhQdZlHbeCRpNwuXg3oTT08iNZC8u6EGMuzjHbrPe3iovUOS0NfssRHTdIvl3yFx7STxgU82kXp30G9iikWzcc2IAO27r9liB9NHnyfV85SO+woFNRCrC+Xemsnh3Q4dRMFEfmRu73MUSe7fWuIihF2eVGUTgsrVyq+MspawOSecfLUOIlMbZ0ExpAIadyzJhhjISrbDCV8nuw1KzzXDZ1g+5kou7ndIBz2owXDavMjYdiMHqE1L4K016aS+nGLqYu9rw1ULnzp1NjHLloV7lh1rBe6nEDl55BAWNKORgzY8dcxcgJEC1w6e8UDCwHGrQXC5CklZZA1mcL5DkwV77n+QQciX2oCXHdm006xV6sULPWPqwRhA==

提取码:
123456

*有效期至: 2023/03/02 15:50:25 GMT+08:00

评论 (25)

简 ++ 创建了Bug-Report
简 ++ 修改了描述
简 ++ 修改了描述
简 ++ 修改了描述
zhujianpeng 负责人设置为张晓龙
zhujianpeng 任务状态TODO 修改为Analysing
展开全部操作日志

你好,问题已收到,我们会尽快分析。

你好,从目前的报错来看,有功能报错,需要收集Graph图。我们需要收集Debug日志信息和图信息。操作方法如下:
https://support.huaweicloud.com/tfmigr-cann503alpha2training/atlasma_13_0004.html#section4
勾选Debugger选项卡然后训练。

当前的错误是我们不支持CudnnRNNV2,这个是GPU专有算子。
你好,参照一下这个算子,是否可以替代你当前的算子。
https://support.huaweicloud.com/tfmigr-cann503alpha2training/atlastfadapi_07_0036.html

你好,我发现一个问题就是我用没有迁移的代码在本地跑,然后在跑的过程中我故意去终止我本地还在跑的模型,然后控制台的报错,和我用迁移后的代码在NPU上跑,报的错误信息一模一样。我在想这个报错原因是不是npu中途中断我还在跑的模型啊。
在npu上跑的报错信息
本地跑故意终止的报错信息

不好意思,弄串了。这个网络不是CudnnRNNV2 算子问题。

张晓龙 添加协作者张晓龙
张晓龙 负责人张晓龙 修改为梁文杰
张晓龙 取消协作者张晓龙
张晓龙 添加协作者张晓龙

transpose 算子异常,是因为上游TensorArray算子infershape异常导致的,建议用以下版本验证一下。

ascend-share/5.0.3.alpha005_tensorflow-ascend910-cp37-euleros2.8-aarch64-training:1.15.0-2.0.12_1116

好的,我试一下。

换了版本跑,报的错误是一样的

简 ++ 修改了描述
简 ++ 修改了标题
简 ++ 修改了描述
简 ++ 修改了描述

你好,近期aicpu模块的业务改动较多,请换最新版本再测试下
输入图片说明

你好,我用最新的版本跑了,报的错误代码是一样的,我代码和日志地址都更新了。麻烦您帮我看一下。

你好,我已经用3.3的镜像跑了,报错信息也更新了。麻烦您看一下。

信息下载失败,麻烦帮忙确认一下,谢谢

日志信息:obs://cann-id2048/output-tft/MA-new-tft-03-10-10-24/log/
源码:obs://cann-id2048/output-tft/MA-new-tft-03-10-10-24/code/
(3月3号的镜像版本)

简 ++ 修改了描述
简 ++ 修改了描述
简 ++ 修改了标题
李想 修改了标题

https://support.huaweicloud.com/developmenttg-cann503alpha1training/atlasaicerrtrain_16_0004.html

@简 ++ 目前看到是存在AIC ERROR,通过这个资料输出一下dump数据,主要是增加如下两行开关:
custom_op.parameter_map["enable_exception_dump"].i = 1 # Dump AI Core Error算子的输入和输出信息,dump信息生成在当前脚本执行目录。不支持dump动态shape算子。
custom_op.parameter_map["op_debug_level"].i = 2 # 开启算子debug功能。

好的,我去试一下

老师好,我刚刚跑了一遍,运行失败了,麻烦老师在看一下。
日志信息地址:obs://cann-id2048/output-tft/MA-new-tft-03-14-13-47/

@简 ++ 确认一下,你这个网络执行不需要上传数据集的吗?

不用的,数据集在obs://cann-id2048/output-tft/MA-new-tft-03-14-13-47/code/outputs/data/volatility文件夹下的。

你好,我想问一下融合规则如何关闭啊?

老师好,我关闭了融合规则,之前报空文件异常我也解决了,然后我开始跑模型,模型跑了一部分后还是报错了,跟之前的报错代码是一样的。[[{{node GeOp343_0}}]]

输入图片说明
输入图片说明
输入图片说明

吴定远 关联仓库Ascend/modelzoo-his 修改为Ascend/modelzoo

当前功能错误已确认是融合规则泛化导致,关闭所有融合规则后功能通过。融合规则内部正在优化,ISSUE项目结项关闭。

颜亚文 任务状态Analysing 修改为DONE

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(5)
5212483 zhang jinbo 1636289695
1
https://gitee.com/ascend/modelzoo.git
git@gitee.com:ascend/modelzoo.git
ascend
modelzoo
modelzoo

搜索帮助