一、问题现象(附报错日志上下文):
AvgPoolV2GradD算子
错误代码507015,如何排查问题
二、软件版本: 5.0.2
-- CANN 版本 (e.g., CANN 3.0.x,5.x.x):
--Tensorflow/Pytorch/MindSpore 版本:
--Python 版本 (e.g., Python 3.7.5):
-- MindStudio版本 (e.g., MindStudio 2.0.0 (beta3)):
--操作系统版本 (e.g., Ubuntu 18.04): Red Hat 4.8.5-36
三、测试步骤:
xxxx
四、日志信息:
[ERROR] RUNTIME(3786,avgpool2d):2021-07-14-17:06:03.716.318 [engine.cc:922]4011 ReportExceptProc:Task exception! stream_id=552, task_id=12, type=0, retCode=0x26.
[INFO] RUNTIME(3786,avgpool2d):2021-07-14-17:06:03.716.327 [device_error_proc.cc:557] 4011 ProcErrorInfo: Begin to process device error info.
[INFO] RUNTIME(3786,avgpool2d):2021-07-14-17:06:03.716.459 [device_error_proc.cc:503] 4011 ProcessOneElementInRingBuffer: it needs process 1 error messages.
[ERROR] RUNTIME(3786,avgpool2d):2021-07-14-17:06:03.716.476 [device_error_proc.cc:407]4011 ProcessCoreErrorInfo:The error from device(0), serial number is 229, there is a aicore error, core id is 0, error code = 0x800000, error string = The DDR address of the MTE instruction is out of range. dump info: pc start: 0x1000108040026000, current: 0x108040026344, vec error info: 0xffc7b7d, mte error info: 0x60000d0, ifu error info: 0x3fff55f97fe80, ccu error info: 0x0, cube error info: 0x5f, biu error info: 0x0, aic error mask: 0x65000200d000288, para base: 0x108000ad2800.
[ERROR] RUNTIME(3786,avgpool2d):2021-07-14-17:06:03.716.489 [device_error_proc.cc:407]4011 ProcessCoreErrorInfo:The error from device(0), serial number is 229, there is a aicore error, core id is 1, error code = 0x800000, error string = The DDR address of the MTE instruction is out of range. dump info: pc start: 0x1000108040026000, current: 0x108040026344, vec error info: 0xceffaff, mte error info: 0x60000d0, ifu error info: 0x3f733ef402280, ccu error info: 0x0, cube error info: 0x5f, biu error info: 0x0, aic error mask: 0x65000200d000288, para base: 0x108000ad2800.
[INFO] RUNTIME(3786,avgpool2d):2021-07-14-17:06:03.716.573 [device_error_proc.cc:598] 4011 ProcErrorInfo: finished to process device error info, retCode=0.
[INFO] RUNTIME(3786,avgpool2d):2021-07-14-17:06:03.716.585 [engine.cc:933] 4011 ReportExceptProc: excptCallBack_ is null.
[INFO] RUNTIME(3786,avgpool2d):2021-07-14-17:06:03.716.591 [stream.cc:901] 4011 TryDelRecordedTask: del public task from stream, stream_id=552, tailTaskId=12, delTaskId=10, head=2, tail=7
[INFO] RUNTIME(3786,avgpool2d):2021-07-14-17:06:03.716.600 [logger.cc:1300] 4011 TaskFinished: device_id=0, stream_id=552, task_id=10, task_type=0,task_finish_num=44
[INFO] RUNTIME(3786,avgpool2d):2021-07-14-17:06:03.716.609 [task.cc:73] 4011 TaskFailCallBack: task ok, stream_id=552, task_id=10, retCode=0
[INFO] RUNTIME(3786,avgpool2d):2021-07-14-17:06:03.716.620 [stream.cc:901] 4011 TryDelRecordedTask: del public task from stream, stream_id=552, tailTaskId=12, delTaskId=11, head=3, tail=7
[INFO] RUNTIME(3786,avgpool2d):2021-07-14-17:06:03.716.627 [logger.cc:1300] 4011 TaskFinished: device_id=0, stream_id=552, task_id=11, task_type=0,task_finish_num=45
[INFO] RUNTIME(3786,avgpool2d):2021-07-14-17:06:03.716.632 [task.cc:73] 4011 TaskFailCallBack: task ok, stream_id=552, task_id=11, retCode=0
[INFO] RUNTIME(3786,avgpool2d):2021-07-14-17:06:03.716.638 [stream.cc:901] 4011 TryDelRecordedTask: del public task from stream, stream_id=552, tailTaskId=12, delTaskId=12, head=4, tail=7
[INFO] RUNTIME(3786,avgpool2d):2021-07-14-17:06:03.716.643 [logger.cc:1300] 4011 TaskFinished: device_id=0, stream_id=552, task_id=12, task_type=0,task_finish_num=46
[ERROR] RUNTIME(3786,avgpool2d):2021-07-14-17:06:03.716.649 [task.cc:730]4011 PreCheckTaskErr:Kernel task happen error, retCode=0x26, [aicore exception].
[ERROR] RUNTIME(3786,avgpool2d):2021-07-14-17:06:03.719.470 [task.cc:712]4011 PrintErrorInfo:Aicore kernel execute failed, device_id=0, stream_id=552, report_stream_id=552, task_id=12, fault kernel_name=1/AvgPoolV2GradD_tvmbin, func_name=te_avgpoolv2gradd_2f3e32dac9352c44c510558ee4e2abbbd11759f096460e5253bba90a88323249_30b599a383a17a69_0__kernel0, program id=20, hash=3207780278308406204.
[INFO] RUNTIME(3786,avgpool2d):2021-07-14-17:06:03.719.486 [task.cc:88] 4011 TaskFailCallBack: rtCode=0x7150026,[aicore exception], errorTaskId=12, errorStreamId=552
[INFO] RUNTIME(3786,avgpool2d):2021-07-14-17:06:03.719.506 [engine.cc:610] 4011 ProcessTaskReport: RTS_DRIVER: report receive, stream_id=552, sq_id=297, task_id=15, sq_head=7.
[INFO] RUNTIME(3786,avgpool2d):2021-07-14-17:06:03.719.516 [engine.cc:610] 4011 ProcessTaskReport: RTS_DRIVER: report receive, stream_id=552, sq_id=297, task_id=15, sq_head=7.
[INFO] RUNTIME(3786,avgpool2d):2021-07-14-17:06:03.719.524 [stream.cc:901] 4011 TryDelRecordedTask: del public task from stream, stream_id=552, tailTaskId=15, delTaskId=13, head=5, tail=7
[INFO] RUNTIME(3786,avgpool2d):2021-07-14-17:06:03.719.531 [logger.cc:1300] 4011 TaskFinished: device_id=0, stream_id=552, task_id=13, task_type=0,task_finish_num=47
[INFO] RUNTIME(3786,avgpool2d):2021-07-14-17:06:03.719.536 [task.cc:73] 4011 TaskFailCallBack: task ok, stream_id=552, task_id=13, retCode=0
[INFO] RUNTIME(3786,avgpool2d):2021-07-14-17:06:03.719.542 [stream.cc:901] 4011 TryDelRecordedTask: del public task from stream, stream_id=552, tailTaskId=15, delTaskId=14, head=6, tail=7
[INFO] RUNTIME(3786,avgpool2d):2021-07-14-17:06:03.719.548 [logger.cc:1300] 4011 TaskFinished: device_id=0, stream_id=552, task_id=14, task_type=0,task_finish_num=48
[INFO] RUNTIME(3786,avgpool2d):2021-07-14-17:06:03.719.553 [task.cc:73] 4011 TaskFailCallBack: task ok, stream_id=552, task_id=14, retCode=0
[INFO] RUNTIME(3786,avgpool2d):2021-07-14-17:06:03.719.558 [stream.cc:901] 4011 TryDelRecordedTask: del public task from stream, stream_id=552, tailTaskId=15, delTaskId=15, head=7, tail=7
[INFO] RUNTIME(3786,avgpool2d):2021-07-14-17:06:03.719.564 [logger.cc:1300] 4011 TaskFinished: device_id=0, stream_id=552, task_id=15, task_type=2,task_finish_num=49
[ERROR] RUNTIME(3786,avgpool2d):2021-07-14-17:06:03.719.593 [stream.cc:690]3786 GetError:[FINAL][FINAL]Stream Synchronize failed, stream=552 retCode=0x26, [aicore exception].
[ERROR] RUNTIME(3786,avgpool2d):2021-07-14-17:06:03.719.655 [logger.cc:271]3786 StreamSynchronize:[FINAL][FINAL]Stream synchronize failed, stream = 0x46231030
[INFO] RUNTIME(3786,avgpool2d):2021-07-14-17:06:03.719.661 [npu_driver.cc:380] 4010 CommandOccupy: sqId=286, deviceId=0, tsId=0, command=0xf00011e0140, cmdCount=1.
[INFO] RUNTIME(3786,avgpool2d):2021-07-14-17:06:03.719.713 [engine.cc:802] 4010 SendingRun: CommandOccupy. sqId=286, cqId=179, deviceId=0, retCode=0.
[ERROR] RUNTIME(3786,avgpool2d):2021-07-14-17:06:03.719.720 [api_c.cc:504]3786 rtStreamSynchronize:[FINAL][FINAL]ErrCode=507015, desc=[aicore exception], InnerCode=0x7150026
[INFO] RUNTIME(3786,avgpool2d):2021-07-14-17:06:03.719.720 [stream.cc:869] 4010 AddTaskToStream: recorded public task to stream, stream_id=541, task_id=5, task_type=6, head=5, tail=6
[ERROR] RUNTIME(3786,avgpool2d):2021-07-14-17:06:03.719.756 [error_message_manage.cc:26]3786 ReportFuncErrorReason:[FINAL][FINAL]rtStreamSynchronize execute failed, reason=[aicore exception]
[INFO] RUNTIME(3786,avgpool2d):2021-07-14-17:06:03.719.760 [logger.cc:1289] 4010 TaskLaunchedEx: device_id=0, stream_id=541, task_id=5, task_type=6,task_launched_num=50
[ERROR] ASCENDCL(3786,avgpool2d):2021-07-14-17:06:03.719.799 [/home/jenkins/agent/workspace/Compile_GraphEngine_Centos_ARM/acl/runtime/stream.cpp:69]3786 aclrtSynchronizeStream: synchronize stream failed, runtime result = 507015
[INFO] RUNTIME(3786,avgpool2d):2021-07-14-17:06:03.719.806 [npu_driver.cc:407] 4010 CommandSend: Command send success, device_id=0, ts_id=0, sq_id=286, reportCount=1, command=0xf00011e0140.
请根据自己的运行环境参考以下方式搜集日志信息,如果涉及到算子开发相关的问题,建议也提供UT/ST测试和单算子集成测试相关的日志。
日志提供方式:
将日志打包后作为附件上传。若日志大小超出附件限制,则可上传至外部网盘后提供链接。
你是在做单算子测试吗,能否描述一下你的使用场景,你用的昇腾硬件是什么,方便的话最好将你的测试工程共享给我们,方便我们复现分析问题。
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
是的,在做单算子相关的测试,用的Ascend910
链接: https://pan.baidu.com/s/1-U9mZdZscty_dyXiXItoPg 提取码: 5kgm 复制这段内容后打开百度网盘手机App,操作更方便哦
调用AvgPoolV2Grad
ksize=3,3
strides=1,2
input_shape=1,1,4,5
output_shape=1,1,2,2
出现500002错误,这是不支持不对称strides吗?
请帮忙看看,谢谢🙏
链接: https://pan.baidu.com/s/1y3LY45cEdmy-Ay3Baevjsw 提取码: 1gft 复制这段内容后打开百度网盘手机App,操作更方便哦
请问AvgPoolV2Grad现在没问题了吧。
Conv2DBackpropFilter算子的filter_size这个输入也使用const试试呢。这种转D的算子有些尚不支持动态shape场景,因此个别输入需要使用const输入。
AvgPoolV2Grad第一个输入用的是Const,上面的例子报错,下面的例子可以
Conv2DBackpropFilter的filter_size也用的Const,上面的例子可以,下面的例子报错
代码都在这里
链接: https://pan.baidu.com/s/1y3LY45cEdmy-Ay3Baevjsw 提取码: 1gft 复制这段内容后打开百度网盘手机App,操作更方便哦
debug日志在这里,麻烦看一下
链接: https://pan.baidu.com/s/1dMAt_ysHi9Nt1VLQ-ilSFw 提取码: 9zia 复制这段内容后打开百度网盘手机App,操作更方便哦
Conv2DBackpropFilter的情况如何?
也是“改stride成对称的是可以work的,不对称会报错”?
Conv2DBackpropFilter是上面的配置可以报错,下面的配置可以work,groups=3
Conv2DBackpropFilter 的报错基本定位清楚了。这个是因为传了groups=3给算子,当前transdata对这种支撑还没有泛化,不能保证所有shape功能都是好的。我们有需求正在开发中,待支持后即可解决这个问题。
你好,你这边测试的时候不要调名称后带D的算子,这类算子需要搞辅助矩阵。
感谢解答
类似Conv2DBackpropFilterD这个算子也需要辅助矩阵吗?我在
input_size=2,3,5,5
output_size=2,3,4,4
filter_size=3,1,2,2
strides=1,1
pads=0,0
groups=3
时也碰到507015这个问题
头文件里没看见需要辅助矩阵
链接: https://pan.baidu.com/s/13UGnOqUcj0dcIUzEdVaRFw 提取码: tk9y 复制这段内容后打开百度网盘手机App,操作更方便哦
一般不要使用带D的OP,请使用不带D的。因为带D的OP都是我们经过处理的,例如会做一些Const转attr或者会在转D前构造辅助矩阵,辅助矩阵的格式要有一定的要求。因此请使用不带D的OP,请使用不带D的。
这次使用的是不带D的,Conv2DBackpropFilter
上面的case正常,下面的报507015
链接: https://pan.baidu.com/s/1X0z5o739fgTFWYGGg5a12g 提取码: 45a4 复制这段内容后打开百度网盘手机App,操作更方便哦
Conv2DBackpropFilterD的问题已经解决,预计下个社区版本即可包含该修改。
这个issue 我们先关掉了。后续如果问题,你可以再提issue,我们会及时为您答复解决的。
登录 后才可以发表评论