pytorch 训练报错

客户参照案例：
https://www.hiascend.com/document/detail/zh/canncommercial/601/modeldevpt/ptmigr/ptmigr_0072.html
迁移代码，训练过程报错
版本信息：cann 6.0.1  pytorch1.8.1
报错信息：
- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/intellif/miniconda3/envs/ascend-pytorch/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/intellif/git/imagenet/main.py", line 359, in main_worker
    acc1 = validate(val_loader, model, criterion, args)
  File "/home/intellif/git/imagenet/main.py", line 498, in validate
    top1.all_reduce(args)
  File "/home/intellif/git/imagenet/main.py", line 557, in all_reduce
    dist.all_reduce(total, dist.ReduceOp.SUM, async_op=False)
  File "/home/intellif/miniconda3/envs/ascend-pytorch/lib/python3.7/site-packages/torch_npu/distributed/hccl_dtype_wraper.py", line 68, in wrapper
    return fn(*args, **kwargs)
  File "/home/intellif/miniconda3/envs/ascend-pytorch/lib/python3.7/site-packages/torch_npu/distributed/distributed_c10d.py", line 1059, in all_reduce
    work = default_pg.allreduce([tensor], opts)
MemoryError: std::bad_alloc

Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/intellif/miniconda3/envs/ascend-pytorch/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/intellif/git/imagenet/main.py", line 359, in main_worker
    acc1 = validate(val_loader, model, criterion, args)
  File "/home/intellif/git/imagenet/main.py", line 498, in validate
    top1.all_reduce(args)
  File "/home/intellif/git/imagenet/main.py", line 557, in all_reduce
    dist.all_reduce(total, dist.ReduceOp.SUM, async_op=False)
  File "/home/intellif/miniconda3/envs/ascend-pytorch/lib/python3.7/site-packages/torch_npu/distributed/hccl_dtype_wraper.py", line 68, in wrapper
    return fn(*args, **kwargs)
  File "/home/intellif/miniconda3/envs/ascend-pytorch/lib/python3.7/site-packages/torch_npu/distributed/distributed_c10d.py", line 1059, in all_reduce
    work = default_pg.allreduce([tensor], opts)
RuntimeError: HCCL error in: /usr1/workspace/FPTA_Daily_Plugin_open_date/Plugin/torch_npu/csrc/distributed/ProcessGroupHCCL.cpp:516
EE1001: The argument is invalid.Reason: rtMemcpyAsync execute failed, reason=[invalid value]
        TraceBack (most recent call last):
        Memory async failed, src loc type=1, dst loc type=2, kind=3 is invalid![FUNC:MemcpyAsyncCheckKindAndLocation][FILE:api_error.cc][LINE:672]
        Memory async failed, check kind and loc, retCode=0x7110001[FUNC:MemcpyAsync][FILE:api_error.cc][LINE:634]
        The argument is invalid.Reason: rtMemcpyAsync execute failed, reason=[invalid value]

报错代码片段：

def all_reduce(self,args):

if torch_npu.npu.is_available():
            device = 'npu:{}'.format(args.gpu)

if torch.cuda.is_available():
            device = torch.device("cuda")
        # elif torch.backends.mps.is_available():
        #     device = torch.device("mps")
        else:
            device = torch.device("cpu")
        total = torch.tensor([self.sum, self.count], dtype=torch.float32).to(device)
        dist.all_reduce(total, dist.ReduceOp.SUM, async_op=False)
        self.sum, self.count = total.tolist()
        self.avg = self.sum / self.count

Ascend/modelzoo
暂停

内容风险标识

评论 (0)

Ascend/modelzoo暂停 .gitee-modal { width: 500px !important; }

内容风险标识