一、问题现象(附报错日志上下文):
Traceback (most recent call last):
File "/home/ma-user/modelarts/user-job-dir/code/script_train_fixed_params.py", line 251, in
use_testing_mode=False) # Change to false to use original default params
File "/home/ma-user/modelarts/user-job-dir/code/script_train_fixed_params.py", line 142, in main
model.fit()
File "/home/ma-user/modelarts/user-job-dir/code/libs/tft_model.py", line 1164, in fit
workers=self.n_multiprocessing_workers)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training.py", line 727, in fit
use_multiprocessing=use_multiprocessing)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 675, in fit
steps_name='steps_per_epoch')
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 394, in model_iteration
batch_outs = f(ins_batch)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/keras/backend.py", line 3476, in call
run_metadata=self.run_metadata)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1472, in call
run_metadata_ptr)
tensorflow.python.framework.errors_impl.InternalError: GeOp343_0GEOP::::DoRunAsync Failed
Error Message is :
EZ9999: Inner Error!
EZ9999 The error from device(4), serial number is 4, there is an aicore error, core id is 27, error code = 0x800000, dump info: pc start: 0x10001080436f0000, current: 0x1080436f02a8, vec error info: 0x1ec5dd71, mte error info: 0x60000a6, ifu error info: 0x3630efb35e600, ccu error info: 0x0, cube error info: 0xc0, biu error info: 0x0, aic error mask: 0x65000200d000288, para base: 0x1080432335d0.[FUNC:PrintCoreErrorInfo][FILE:device_error_proc.cc][LINE:364]
The device(4), core list[0-0], error code is:[FUNC:PrintCoreInfoErrMsg][FILE:device_error_proc.cc][LINE:414]
coreId( 0): 0x800000 [FUNC:PrintCoreInfoErrMsg][FILE:device_error_proc.cc][LINE:425]
Aicore kernel execute failed, device_id=0, stream_id=1311, report_stream_id=1322, task_id=18, flip_num=0, fault kernel_name=0_611_training/Adam/gradients/clip_by_norm_23/ArithmeticOptimizer/ReplaceMulWithSquare_mul/SquareSumV1/SquareSumV1374, func_name=te_squaresumv1_0c08fd29d007052ed20f4cf2a4d06f1c92602fb3de2e36f93242bd95f9777be5_1__kernel0, program id=3356, hash=18004332741044157744[FUNC:GetError][FILE:stream.cc][LINE:737]
rtStreamSynchronize execute failed, reason=[the model stream execute failed][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:45]
invoke rtStreamSynchronize failed, ret = 507011[FUNC:Synchronize][FILE:hybrid_execution_context.cc][LINE:91]
failed to execute graph. model_id = 62[FUNC:HandleResult][FILE:hybrid_model_async_executor.cc][LINE:221]
[[{{node GeOp343_0}}]]
[ModelArts Service Log]2022-03-07 15:44:26,330 - ERROR - proc-rank-0-device-0 (pid: 114) has exited with non-zero code: 1
[ModelArts Service Log]2022-03-07 15:44:26,330 - INFO - Begin destroy training processes
[ModelArts Service Log]2022-03-07 15:44:26,330 - INFO - proc-rank-0-device-0 (pid: 114) has exited
[ModelArts Service Log]2022-03-07 15:44:26,331 - INFO - End destroy training processes
time="2022-03-07T15:44:26+08:00" level=info msg="start and wait python command is exit with 1" file="controller.go:181" Args="[/home/ma-user/anaconda/bin/python /home/ma-user/modelarts/run/davincirun.py /home/ma-user/anaconda/bin/python /home/ma-user/modelarts/user-job-dir/code/script_train_fixed_params.py --data_url=/home/ma-user/modelarts/inputs/data_url_0/ --train_url=/home/ma-user/modelarts/outputs/train_url_0/ --data_url=/home/ma-user/modelarts/inputs/data_url_0/ --train_url=/home/ma-user/modelarts/outputs/train_url_0/]" Command=run-with-backoff Component=ma-training-toolkit Platform=ModelArts-Service TaskID=worker-0
time="2022-03-07T15:44:26+08:00" level=info msg="run-with-backoff exit with 1" file="controller.go:159" Args="[/home/ma-user/anaconda/bin/python /home/ma-user/modelarts/run/davincirun.py /home/ma-user/anaconda/bin/python /home/ma-user/modelarts/user-job-dir/code/script_train_fixed_params.py --data_url=/home/ma-user/modelarts/inputs/data_url_0/ --train_url=/home/ma-user/modelarts/outputs/train_url_0/ --data_url=/home/ma-user/modelarts/inputs/data_url_0/ --train_url=/home/ma-user/modelarts/outputs/train_url_0/]" Command=run-with-backoff Component=ma-training-toolkit Platform=ModelArts-Service TaskID=worker-0
[2022-03-07T15:44:26+08:00][ModelArts Service Log]exiting...
[2022-03-07T15:44:26+08:00][ModelArts Service Log]exit with 1
[2022-03-07T15:44:27+08:00][ModelArts Service Log][sidecar] training is completed
time="2022-03-07T15:44:27+08:00" level=warning msg="the "log-preview-size" parameter exceeds the limit and will be set to the default value 5242880" file="cli.go:192" Command=analyze Component=ma-training-toolkit Platform=ModelArts-Service
[2022-03-07T15:44:27+08:00][ModelArts Service Log][sidecar] stop toolkit_obs_upload_by_channels_pid = 39 by signal SIGTERM
time="2022-03-07T15:44:27+08:00" level=info msg="the periodic upload task exiting..." file="upload.go:214" Command=obs/upload_by_channels Component=ma-training-toolkit Platform=ModelArts-Service Task=train_url
time="2022-03-07T15:44:27+08:00" level=info msg="the periodic upload task exiting..." file="upload.go:214" Command=obs/upload_by_channels Component=ma-training-toolkit Platform=ModelArts-Service Task=log_url
二、软件版本:
ascend-share/5.1.rc1.alpha001_tensorflow-ascend910-cp37-euleros2.8-aarch64-training:1.15.0-21.0.2_0215
CANN 5.0.2:
Tensorflow1.15:
Python 3.7:
操作系统版本 :ascend-share/5.1.rc1.alpha002_tensorflow-ascend910-cp37-euleros2.8-aarch64-training:1.15.0-21.0.2_0303
提取码:
123456
*有效期至: 2023/03/02 15:49:51 GMT+08:00
运行脚本:code/script_train_fixed_params.py
命令执行:python -m script_train_fixed_params(参数已经设置好)
数据集:outputs/data/volatility/
四、日志信息:
提取码:
123456
*有效期至: 2023/03/02 15:50:25 GMT+08:00
你好,问题已收到,我们会尽快分析。
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
你好,从目前的报错来看,有功能报错,需要收集Graph图。我们需要收集Debug日志信息和图信息。操作方法如下:
https://support.huaweicloud.com/tfmigr-cann503alpha2training/atlasma_13_0004.html#section4
勾选Debugger选项卡然后训练。
当前的错误是我们不支持CudnnRNNV2,这个是GPU专有算子。
你好,参照一下这个算子,是否可以替代你当前的算子。
https://support.huaweicloud.com/tfmigr-cann503alpha2training/atlastfadapi_07_0036.html
你好,我这边是否使用GPU算子是可以控制的,我现在设置成了False,只使用cpu的,然后他又报了其他的错,我这边自己也在找,麻烦您那边也帮我看一下。
下面是重新debug跑的错误信息
提取码:
123456
*有效期至: 2021/11/24 19:48:31 GMT+08:00
不好意思,弄串了。这个网络不是CudnnRNNV2 算子问题。
transpose 算子异常,是因为上游TensorArray算子infershape异常导致的,建议用以下版本验证一下。
ascend-share/5.0.3.alpha005_tensorflow-ascend910-cp37-euleros2.8-aarch64-training:1.15.0-2.0.12_1116
你好,近期aicpu模块的业务改动较多,请换最新版本再测试下
你好,我用最新的版本跑了,报的错误代码是一样的,我代码和日志地址都更新了。麻烦您帮我看一下。
你好,我已经用3.3的镜像跑了,报错信息也更新了。麻烦您看一下。
日志信息:obs://cann-id2048/output-tft/MA-new-tft-03-10-10-24/log/
源码:obs://cann-id2048/output-tft/MA-new-tft-03-10-10-24/code/
(3月3号的镜像版本)
https://support.huaweicloud.com/developmenttg-cann503alpha1training/atlasaicerrtrain_16_0004.html
@简 ++ 目前看到是存在AIC ERROR,通过这个资料输出一下dump数据,主要是增加如下两行开关:
custom_op.parameter_map["enable_exception_dump"].i = 1 # Dump AI Core Error算子的输入和输出信息,dump信息生成在当前脚本执行目录。不支持dump动态shape算子。
custom_op.parameter_map["op_debug_level"].i = 2 # 开启算子debug功能。
当前功能错误已确认是融合规则泛化导致,关闭所有融合规则后功能通过。融合规则内部正在优化,ISSUE项目结项关闭。
登录 后才可以发表评论