登录
注册
开源
企业版
高校版
搜索
帮助中心
使用条款
关于我们
开源
企业版
高校版
私有云
模力方舟
AI 队友
登录
注册
轻量养虾,开箱即用!低 Token + 稳定算力,Gitee & 模力方舟联合出品的 PocketClaw 正式开售!点击了解详情~
代码拉取完成,页面将自动刷新
仓库状态说明
捐赠
捐赠前请先登录
取消
前往登录
扫描微信二维码支付
取消
支付完成
支付提示
将跳转至支付宝完成支付
确定
取消
Watch
不关注
关注所有动态
仅关注版本发行动态
关注但不提醒动态
68
Star
259
Fork
191
Ascend
/
modelzoo
暂停
代码
Issues
157
Pull Requests
9
Wiki
统计
流水线
服务
JavaDoc
PHPDoc
质量分析
Jenkins for Gitee
腾讯云托管
腾讯云 Serverless
悬镜安全
阿里云 SAE
Codeblitz
SBOM
开发画像分析
我知道了,不再自动展开
更新失败,请稍后重试!
移除标识
内容风险标识
本任务被
标识为内容中包含有代码安全 Bug 、隐私泄露等敏感信息,仓库外成员不可访问
pytorch 训练报错
DONE
#I6PALX
训练问题
hushangbin
创建于
2023-03-22 17:45
客户参照案例: https://www.hiascend.com/document/detail/zh/canncommercial/601/modeldevpt/ptmigr/ptmigr_0072.html 迁移代码,训练过程报错 版本信息:cann 6.0.1 pytorch1.8.1 报错信息: - Process 0 terminated with the following error: Traceback (most recent call last): File "/home/intellif/miniconda3/envs/ascend-pytorch/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/home/intellif/git/imagenet/main.py", line 359, in main_worker acc1 = validate(val_loader, model, criterion, args) File "/home/intellif/git/imagenet/main.py", line 498, in validate top1.all_reduce(args) File "/home/intellif/git/imagenet/main.py", line 557, in all_reduce dist.all_reduce(total, dist.ReduceOp.SUM, async_op=False) File "/home/intellif/miniconda3/envs/ascend-pytorch/lib/python3.7/site-packages/torch_npu/distributed/hccl_dtype_wraper.py", line 68, in wrapper return fn(*args, **kwargs) File "/home/intellif/miniconda3/envs/ascend-pytorch/lib/python3.7/site-packages/torch_npu/distributed/distributed_c10d.py", line 1059, in all_reduce work = default_pg.allreduce([tensor], opts) MemoryError: std::bad_alloc Process 0 terminated with the following error: Traceback (most recent call last): File "/home/intellif/miniconda3/envs/ascend-pytorch/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/home/intellif/git/imagenet/main.py", line 359, in main_worker acc1 = validate(val_loader, model, criterion, args) File "/home/intellif/git/imagenet/main.py", line 498, in validate top1.all_reduce(args) File "/home/intellif/git/imagenet/main.py", line 557, in all_reduce dist.all_reduce(total, dist.ReduceOp.SUM, async_op=False) File "/home/intellif/miniconda3/envs/ascend-pytorch/lib/python3.7/site-packages/torch_npu/distributed/hccl_dtype_wraper.py", line 68, in wrapper return fn(*args, **kwargs) File "/home/intellif/miniconda3/envs/ascend-pytorch/lib/python3.7/site-packages/torch_npu/distributed/distributed_c10d.py", line 1059, in all_reduce work = default_pg.allreduce([tensor], opts) RuntimeError: HCCL error in: /usr1/workspace/FPTA_Daily_Plugin_open_date/Plugin/torch_npu/csrc/distributed/ProcessGroupHCCL.cpp:516 EE1001: The argument is invalid.Reason: rtMemcpyAsync execute failed, reason=[invalid value] TraceBack (most recent call last): Memory async failed, src loc type=1, dst loc type=2, kind=3 is invalid![FUNC:MemcpyAsyncCheckKindAndLocation][FILE:api_error.cc][LINE:672] Memory async failed, check kind and loc, retCode=0x7110001[FUNC:MemcpyAsync][FILE:api_error.cc][LINE:634] The argument is invalid.Reason: rtMemcpyAsync execute failed, reason=[invalid value] 报错代码片段: def all_reduce(self,args): if torch_npu.npu.is_available(): device = 'npu:{}'.format(args.gpu) if torch.cuda.is_available(): device = torch.device("cuda") # elif torch.backends.mps.is_available(): # device = torch.device("mps") else: device = torch.device("cpu") total = torch.tensor([self.sum, self.count], dtype=torch.float32).to(device) dist.all_reduce(total, dist.ReduceOp.SUM, async_op=False) self.sum, self.count = total.tolist() self.avg = self.sum / self.count
客户参照案例: https://www.hiascend.com/document/detail/zh/canncommercial/601/modeldevpt/ptmigr/ptmigr_0072.html 迁移代码,训练过程报错 版本信息:cann 6.0.1 pytorch1.8.1 报错信息: - Process 0 terminated with the following error: Traceback (most recent call last): File "/home/intellif/miniconda3/envs/ascend-pytorch/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/home/intellif/git/imagenet/main.py", line 359, in main_worker acc1 = validate(val_loader, model, criterion, args) File "/home/intellif/git/imagenet/main.py", line 498, in validate top1.all_reduce(args) File "/home/intellif/git/imagenet/main.py", line 557, in all_reduce dist.all_reduce(total, dist.ReduceOp.SUM, async_op=False) File "/home/intellif/miniconda3/envs/ascend-pytorch/lib/python3.7/site-packages/torch_npu/distributed/hccl_dtype_wraper.py", line 68, in wrapper return fn(*args, **kwargs) File "/home/intellif/miniconda3/envs/ascend-pytorch/lib/python3.7/site-packages/torch_npu/distributed/distributed_c10d.py", line 1059, in all_reduce work = default_pg.allreduce([tensor], opts) MemoryError: std::bad_alloc Process 0 terminated with the following error: Traceback (most recent call last): File "/home/intellif/miniconda3/envs/ascend-pytorch/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/home/intellif/git/imagenet/main.py", line 359, in main_worker acc1 = validate(val_loader, model, criterion, args) File "/home/intellif/git/imagenet/main.py", line 498, in validate top1.all_reduce(args) File "/home/intellif/git/imagenet/main.py", line 557, in all_reduce dist.all_reduce(total, dist.ReduceOp.SUM, async_op=False) File "/home/intellif/miniconda3/envs/ascend-pytorch/lib/python3.7/site-packages/torch_npu/distributed/hccl_dtype_wraper.py", line 68, in wrapper return fn(*args, **kwargs) File "/home/intellif/miniconda3/envs/ascend-pytorch/lib/python3.7/site-packages/torch_npu/distributed/distributed_c10d.py", line 1059, in all_reduce work = default_pg.allreduce([tensor], opts) RuntimeError: HCCL error in: /usr1/workspace/FPTA_Daily_Plugin_open_date/Plugin/torch_npu/csrc/distributed/ProcessGroupHCCL.cpp:516 EE1001: The argument is invalid.Reason: rtMemcpyAsync execute failed, reason=[invalid value] TraceBack (most recent call last): Memory async failed, src loc type=1, dst loc type=2, kind=3 is invalid![FUNC:MemcpyAsyncCheckKindAndLocation][FILE:api_error.cc][LINE:672] Memory async failed, check kind and loc, retCode=0x7110001[FUNC:MemcpyAsync][FILE:api_error.cc][LINE:634] The argument is invalid.Reason: rtMemcpyAsync execute failed, reason=[invalid value] 报错代码片段: def all_reduce(self,args): if torch_npu.npu.is_available(): device = 'npu:{}'.format(args.gpu) if torch.cuda.is_available(): device = torch.device("cuda") # elif torch.backends.mps.is_available(): # device = torch.device("mps") else: device = torch.device("cpu") total = torch.tensor([self.sum, self.count], dtype=torch.float32).to(device) dist.all_reduce(total, dist.ReduceOp.SUM, async_op=False) self.sum, self.count = total.tolist() self.avg = self.sum / self.count
评论 (
0
)
登录
后才可以发表评论
状态
DONE
TODO
ACCEPTED
Analysing
Feedback
WIP
Replied
CLOSED
DONE
REJECTED
负责人
未设置
wuyan
MXYG
负责人
协作者
+负责人
+协作者
AnRuiXiang
AnRuiXiang
负责人
协作者
+负责人
+协作者
标签
未设置
项目
未立项任务
未立项任务
里程碑
未关联里程碑
未关联里程碑
Pull Requests
未关联
未关联
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
未关联
分支 (
-
)
标签 (
-
)
开始日期   -   截止日期
-
置顶选项
不置顶
置顶等级:高
置顶等级:中
置顶等级:低
优先级
不指定
严重
主要
次要
不重要
预计工期
(小时)
参与者(2)
1
https://gitee.com/ascend/modelzoo.git
git@gitee.com:ascend/modelzoo.git
ascend
modelzoo
modelzoo
点此查找更多帮助
搜索帮助
Git 命令在线学习
如何在 Gitee 导入 GitHub 仓库
Git 仓库基础操作
企业版和社区版功能对比
SSH 公钥设置
如何处理代码冲突
仓库体积过大,如何减小?
如何找回被删除的仓库数据
Gitee 产品配额说明
GitHub仓库快速导入Gitee及同步更新
什么是 Release(发行版)
将 PHP 项目自动发布到 packagist.org
评论
仓库举报
回到顶部
登录提示
该操作需登录 Gitee 帐号,请先登录后再操作。
立即登录
没有帐号,去注册