登录
注册
开源
企业版
高校版
搜索
帮助中心
使用条款
关于我们
开源
企业版
高校版
私有云
模力方舟
AI 队友
登录
注册
2025 Gitee 年度开源项目报名倒计时,12 月 31 日截止,抓紧时间提交啦!
代码拉取完成,页面将自动刷新
仓库状态说明
捐赠
捐赠前请先登录
取消
前往登录
扫描微信二维码支付
取消
支付完成
支付提示
将跳转至支付宝完成支付
确定
取消
Watch
不关注
关注所有动态
仅关注版本发行动态
关注但不提醒动态
68
Star
258
Fork
193
Ascend
/
modelzoo
暂停
代码
Issues
157
Pull Requests
9
Wiki
统计
流水线
服务
JavaDoc
PHPDoc
质量分析
Jenkins for Gitee
腾讯云托管
腾讯云 Serverless
悬镜安全
阿里云 SAE
Codeblitz
SBOM
我知道了,不再自动展开
更新失败,请稍后重试!
移除标识
内容风险标识
本任务被
标识为内容中包含有代码安全 Bug 、隐私泄露等敏感信息,仓库外成员不可访问
Yolov7_for_PyTorch 训练报错RuntimeError: ACL stream synchronize failed, error code:507018
DONE
#I8EXWC
训练问题
刘金水
创建于
2023-11-08 18:08
一、问题现象(附报错日志上下文): ![[输入图片说明]](https://foruda.gitee.com/images/1699437985484622913/5d7365c7_1132588.png "屏幕截图") 二、软件版本: -- CANN 版本 : 6.3.RC1 --Pytorch 版本 : 1.11.0 --Python 版本 : Python 3.7.5 --训练镜像版本 : pytorch-modelzoo:23.0.RC1-1.11.0 三、测试步骤: 按照https://gitee.com/ascend/modelzoo-GPL/tree/master/built-in/PyTorch/Official/cv/object_detection/Yolov7_for_PyTorch描述步骤进行操作 执行bash ./test/train_full_1p.sh --data_path=real_data_path 进行训练 四、日志 Image sizes 640 train, 640 test Using 4 dataloader workers Logging results to runs/train/yolov710 Starting training for 300 epochs... Epoch gpu_mem box obj cls total labels img_size [34m[1mautoanchor: [0mAnalyzing anchors... anchors/target = 4.42, Best Possible Recall (BPR) = 0.9912 0%| | 0/29570 [00:00<?, ?it/s].E39999: Inner Error! E39999 Aicpu kernel execute failed, device_id=0, stream_id=1087, task_id=3236, fault op_name=IndexPut[FUNC:GetError][FILE:stream.cc][LINE:1133] TraceBack (most recent call last): rtStreamSynchronize execute failed, reason=[aicpu exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:49] synchronize stream failed, runtime result = 507018[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161] DEVICE[0] PID[30198]: EXCEPTION TASK: Exception info:TGID=3244483, model id=65535, stream id=1087, stream phase=3, task id=3236, task type=aicpu kernel, recently received task id=3241, recently send task id=3235, task phase=RUN Message info[0]:aicpu=0,slot_id=0,report_mailbox_flag=0x5a5a5a5a,state=0x5210 Other info[0]:time=2023-11-08-11:24:01.513.572, function=proc_aicpu_task_done, line=970, error code=0x2a EXCEPTION TASK: Exception info:TGID=3244483, model id=65535, stream id=1087, stream phase=3, task id=3236, task type=aicpu kernel, recently received task id=3241, recently send task id=3236, task phase=COMPLETE Message info[0]:stream_id=63, task_id=3236, task_type=1, result=0x2a, pid=0 Other info[0]:time=2023-11-08-11:24:01.513.610, function=SendTaskReport, line=1161, error code=0x2a 0%| | 0/29570 [00:19<?, ?it/s] Traceback (most recent call last): File "train.py", line 693, in <module> train(hyp, opt, device, tb_writer) File "train.py", line 427, in train loss, loss_items = compute_loss_ota(pred, targets.to(device), imgs) # loss scaled by batch_size File "/home/banglian/AI/Yolov7_for_PyTorch/utils/loss.py", line 986, in __call__ t[self.rangen[fixed_gt_num], selected_tcls] = self.cp RuntimeError: ACL stream synchronize failed, error code:507018 [ERROR]THPModule_npu_shutdown,torch_npu/csrc/InitNpuBindings.cpp:65:"npuSynchronizeDevice failed err=:npuSynchronizeDevice:/usr1/workspace/FPTA_Daily_open_pytorchv1.11.0-5.0.rc1/CODE/torch_npu/csrc/core/npu/NPUStream.cpp:388 NPU error, error code is 507018 EH9999: Inner Error! rtDeviceSynchronize execute failed, reason=[aicpu exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:49] EH9999 wait for compute device to finish failed, runtime result = 507018.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161] TraceBack (most recent call last): Exception raised from npuSynchronizeDevice at /usr1/workspace/FPTA_Daily_open_pytorchv1.11.0-5.0.rc1/CODE/torch_npu/csrc/core/npu/NPUStream.cpp:388 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x8c (0xffff981b7114 in /usr/local/python3.7.5/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xa0 (0xffff981b3418 in /usr/local/python3.7.5/lib/python3.7/site-packages/torch/lib/libc10.so) frame #2: c10_npu::npuSynchronizeDevice() + 0x1a8 (0xffff59b9eab0 in /usr/local/python3.7.5/lib/python3.7/site-packages/torch_npu/lib/libtorch_npu.so) frame #3: THPModule_npu_shutdown(_object*) + 0x6c (0xffff90879434 in /usr/local/python3.7.5/lib/python3.7/site-packages/torch_npu/_C.cpython-37m-aarch64-linux-gnu.so) <omitting python frames> " [ERROR]THPModule_npu_shutdown,torch_npu/csrc/InitNpuBindings.cpp:73:"NPUCachingAllocator::emptyCache failed err=:npuSynchronizeDevice:/usr1/workspace/FPTA_Daily_open_pytorchv1.11.0-5.0.rc1/CODE/torch_npu/csrc/core/npu/NPUStream.cpp:388 NPU error, error code is 507018 EH9999: Inner Error! rtDeviceSynchronize execute failed, reason=[aicpu exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:49] EH9999 wait for compute device to finish failed, runtime result = 507018.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161] TraceBack (most recent call last): Exception raised from npuSynchronizeDevice at /usr1/workspace/FPTA_Daily_open_pytorchv1.11.0-5.0.rc1/CODE/torch_npu/csrc/core/npu/NPUStream.cpp:388 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x8c (0xffff981b7114 in /usr/local/python3.7.5/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xa0 (0xffff981b3418 in /usr/local/python3.7.5/lib/python3.7/site-packages/torch/lib/libc10.so) frame #2: c10_npu::npuSynchronizeDevice() + 0x1a8 (0xffff59b9eab0 in /usr/local/python3.7.5/lib/python3.7/site-packages/torch_npu/lib/libtorch_npu.so) frame #3: c10_npu::NPUCachingAllocator::emptyCache() + 0x22c (0xffff59b7e094 in /usr/local/python3.7.5/lib/python3.7/site-packages/torch_npu/lib/libtorch_npu.so) frame #4: THPModule_npu_shutdown(_object*) + 0xc0 (0xffff90879488 in /usr/local/python3.7.5/lib/python3.7/site-packages/torch_npu/_C.cpython-37m-aarch64-linux-gnu.so) <omitting python frames> " THPModule_npu_shutdown success.
一、问题现象(附报错日志上下文): ![[输入图片说明]](https://foruda.gitee.com/images/1699437985484622913/5d7365c7_1132588.png "屏幕截图") 二、软件版本: -- CANN 版本 : 6.3.RC1 --Pytorch 版本 : 1.11.0 --Python 版本 : Python 3.7.5 --训练镜像版本 : pytorch-modelzoo:23.0.RC1-1.11.0 三、测试步骤: 按照https://gitee.com/ascend/modelzoo-GPL/tree/master/built-in/PyTorch/Official/cv/object_detection/Yolov7_for_PyTorch描述步骤进行操作 执行bash ./test/train_full_1p.sh --data_path=real_data_path 进行训练 四、日志 Image sizes 640 train, 640 test Using 4 dataloader workers Logging results to runs/train/yolov710 Starting training for 300 epochs... Epoch gpu_mem box obj cls total labels img_size [34m[1mautoanchor: [0mAnalyzing anchors... anchors/target = 4.42, Best Possible Recall (BPR) = 0.9912 0%| | 0/29570 [00:00<?, ?it/s].E39999: Inner Error! E39999 Aicpu kernel execute failed, device_id=0, stream_id=1087, task_id=3236, fault op_name=IndexPut[FUNC:GetError][FILE:stream.cc][LINE:1133] TraceBack (most recent call last): rtStreamSynchronize execute failed, reason=[aicpu exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:49] synchronize stream failed, runtime result = 507018[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161] DEVICE[0] PID[30198]: EXCEPTION TASK: Exception info:TGID=3244483, model id=65535, stream id=1087, stream phase=3, task id=3236, task type=aicpu kernel, recently received task id=3241, recently send task id=3235, task phase=RUN Message info[0]:aicpu=0,slot_id=0,report_mailbox_flag=0x5a5a5a5a,state=0x5210 Other info[0]:time=2023-11-08-11:24:01.513.572, function=proc_aicpu_task_done, line=970, error code=0x2a EXCEPTION TASK: Exception info:TGID=3244483, model id=65535, stream id=1087, stream phase=3, task id=3236, task type=aicpu kernel, recently received task id=3241, recently send task id=3236, task phase=COMPLETE Message info[0]:stream_id=63, task_id=3236, task_type=1, result=0x2a, pid=0 Other info[0]:time=2023-11-08-11:24:01.513.610, function=SendTaskReport, line=1161, error code=0x2a 0%| | 0/29570 [00:19<?, ?it/s] Traceback (most recent call last): File "train.py", line 693, in <module> train(hyp, opt, device, tb_writer) File "train.py", line 427, in train loss, loss_items = compute_loss_ota(pred, targets.to(device), imgs) # loss scaled by batch_size File "/home/banglian/AI/Yolov7_for_PyTorch/utils/loss.py", line 986, in __call__ t[self.rangen[fixed_gt_num], selected_tcls] = self.cp RuntimeError: ACL stream synchronize failed, error code:507018 [ERROR]THPModule_npu_shutdown,torch_npu/csrc/InitNpuBindings.cpp:65:"npuSynchronizeDevice failed err=:npuSynchronizeDevice:/usr1/workspace/FPTA_Daily_open_pytorchv1.11.0-5.0.rc1/CODE/torch_npu/csrc/core/npu/NPUStream.cpp:388 NPU error, error code is 507018 EH9999: Inner Error! rtDeviceSynchronize execute failed, reason=[aicpu exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:49] EH9999 wait for compute device to finish failed, runtime result = 507018.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161] TraceBack (most recent call last): Exception raised from npuSynchronizeDevice at /usr1/workspace/FPTA_Daily_open_pytorchv1.11.0-5.0.rc1/CODE/torch_npu/csrc/core/npu/NPUStream.cpp:388 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x8c (0xffff981b7114 in /usr/local/python3.7.5/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xa0 (0xffff981b3418 in /usr/local/python3.7.5/lib/python3.7/site-packages/torch/lib/libc10.so) frame #2: c10_npu::npuSynchronizeDevice() + 0x1a8 (0xffff59b9eab0 in /usr/local/python3.7.5/lib/python3.7/site-packages/torch_npu/lib/libtorch_npu.so) frame #3: THPModule_npu_shutdown(_object*) + 0x6c (0xffff90879434 in /usr/local/python3.7.5/lib/python3.7/site-packages/torch_npu/_C.cpython-37m-aarch64-linux-gnu.so) <omitting python frames> " [ERROR]THPModule_npu_shutdown,torch_npu/csrc/InitNpuBindings.cpp:73:"NPUCachingAllocator::emptyCache failed err=:npuSynchronizeDevice:/usr1/workspace/FPTA_Daily_open_pytorchv1.11.0-5.0.rc1/CODE/torch_npu/csrc/core/npu/NPUStream.cpp:388 NPU error, error code is 507018 EH9999: Inner Error! rtDeviceSynchronize execute failed, reason=[aicpu exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:49] EH9999 wait for compute device to finish failed, runtime result = 507018.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161] TraceBack (most recent call last): Exception raised from npuSynchronizeDevice at /usr1/workspace/FPTA_Daily_open_pytorchv1.11.0-5.0.rc1/CODE/torch_npu/csrc/core/npu/NPUStream.cpp:388 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x8c (0xffff981b7114 in /usr/local/python3.7.5/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xa0 (0xffff981b3418 in /usr/local/python3.7.5/lib/python3.7/site-packages/torch/lib/libc10.so) frame #2: c10_npu::npuSynchronizeDevice() + 0x1a8 (0xffff59b9eab0 in /usr/local/python3.7.5/lib/python3.7/site-packages/torch_npu/lib/libtorch_npu.so) frame #3: c10_npu::NPUCachingAllocator::emptyCache() + 0x22c (0xffff59b7e094 in /usr/local/python3.7.5/lib/python3.7/site-packages/torch_npu/lib/libtorch_npu.so) frame #4: THPModule_npu_shutdown(_object*) + 0xc0 (0xffff90879488 in /usr/local/python3.7.5/lib/python3.7/site-packages/torch_npu/_C.cpython-37m-aarch64-linux-gnu.so) <omitting python frames> " THPModule_npu_shutdown success.
评论 (
7
)
登录
后才可以发表评论
状态
DONE
TODO
ACCEPTED
Analysing
Feedback
WIP
Replied
CLOSED
DONE
REJECTED
负责人
未设置
张安琪
zhang-anqi11
负责人
协作者
+负责人
+协作者
刘国庆
liugq672
负责人
协作者
+负责人
+协作者
标签
未设置
项目
未立项任务
未立项任务
里程碑
未关联里程碑
未关联里程碑
Pull Requests
未关联
未关联
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
未关联
未关联
master
开始日期   -   截止日期
-
置顶选项
不置顶
置顶等级:高
置顶等级:中
置顶等级:低
优先级
不指定
严重
主要
次要
不重要
预计工期
(小时)
参与者(2)
1
https://gitee.com/ascend/modelzoo.git
git@gitee.com:ascend/modelzoo.git
ascend
modelzoo
modelzoo
点此查找更多帮助
搜索帮助
Git 命令在线学习
如何在 Gitee 导入 GitHub 仓库
Git 仓库基础操作
企业版和社区版功能对比
SSH 公钥设置
如何处理代码冲突
仓库体积过大,如何减小?
如何找回被删除的仓库数据
Gitee 产品配额说明
GitHub仓库快速导入Gitee及同步更新
什么是 Release(发行版)
将 PHP 项目自动发布到 packagist.org
评论
仓库举报
回到顶部
登录提示
该操作需登录 Gitee 帐号,请先登录后再操作。
立即登录
没有帐号,去注册