77 Star 597 Fork 1.2K

Ascend/pytorch

融合优化器:torch_npu.optim.NpuFusedAdamW报错

DONE
训练问题
创建于  
2024-08-29 14:52
  1. torch_npu.optim.NpuFusedAdamW 直接替换torch.optim.AdamW,显存会有明显增加,这个是否符合预期呢
    2.1 在显存不是瓶颈后,直接替换,首先提示ValueError: set_to_none is not supported in fused optimizers,对应代码为:self.optimizer.zero_grad(set_to_none=True)
    2.2 注释后,出现如下报错:the size of tensor selfRef [557413176] must match the size of tensor other [862615112]
    请问可能的原因/定位方法或解决方案。

具体使用上并无特别:

self.optimizer.zero_grad(set_to_none=True)
model.train()
self._loss.backward()
self.optimizer.step()

报错日志

--- Logging error ---
Traceback (most recent call last):
  File "/home/ma-user/work/code/0823_MFU/MIMO2_dev_bla_128/core/trainers/utils.py", line 95, in train
    self.run_step()
  File "/home/ma-user/work/code/0823_MFU/MIMO2_dev_bla_128/core/trainers/trainer_mmdit.py", line 405, in run_step
    self.optimizer.step()
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/torch/optim/optimizer.py", line 373, in wrapper
    out = func(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/torch_npu/optim/npu_fused_optim_base.py", line 123, in step
    self._group_step(i)
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/torch_npu/optim/npu_fused_adamw.py", line 176, in _group_step
    exp_avg.mul_(beta1).add_(combined_grad, alpha=1 - beta1)
RuntimeError: call aclnnInplaceAdd failed, detail:EZ1001: 2024-08-29-14:21:38.950.285 the size of tensor selfRef [557672824] must match the size of tensor other [563885960].
        TraceBack (most recent call last):
        563885960 and 557672824 cannot broadcast.
        the size of tensor selfRef [557672824] must match the size of tensor other [563885960].

[ERROR] 2024-08-29-14:21:38 (PID:655738, Device:0, RankID:-1) ERR01005 OPS internal error

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/logging/__init__.py", line 1083, in emit
    msg = self.format(record)
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/logging/__init__.py", line 927, in format
    return fmt.format(record)
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/logging/__init__.py", line 663, in format
    record.message = record.getMessage()
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/logging/__init__.py", line 367, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "<string>", line 1, in <module>
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/multiprocessing/spawn.py", line 129, in _main
    return self._bootstrap(parent_sentinel)
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 74, in _wrap
    fn(i, *args)
  File "/home/ma-user/work/code/0823_MFU/MIMO2_dev_bla_128/core/pipelines/train_pipeline.py", line 36, in pipeline
    trainer.train()
  File "/home/ma-user/work/code/0823_MFU/MIMO2_dev_bla_128/core/trainers/utils.py", line 100, in train
    self.logger.error("Exception during training: ", e)
Message: 'Exception during training: '
Arguments: (RuntimeError('call aclnnInplaceAdd failed, detail:EZ1001: 2024-08-29-14:21:38.950.285 the size of tensor selfRef [557672824] must match the size of tensor other [563885960].\n        TraceBack (most recent call last):\n        563885960 and 557672824 cannot broadcast.\n        the size of tensor selfRef [557672824] must match the size of tensor other [563885960].\n\n[ERROR] 2024-08-29-14:21:38 (PID:655738, Device:0, RankID:-1) ERR01005 OPS internal error'),)
[W compiler_depend.ts:2438] Warning: Tensor not is not allocated by NPUCachingAllocator, skip eraseStream. (function operator())

评论 (1)

sankarea 创建了训练问题 10个月前
sankarea 修改了描述 10个月前
sankarea 修改了标题 10个月前
展开全部操作日志

1、优化器进行替换,显存增加是可能的,优化器实现原理不一致,如果想看具体哪里的差异导致的,可以采集profiling数据观察下
2、ValueError: set_to_none is not supported in fused optimizers,不支持set_to_none=True,不需要注释,而是将参数改为False

huangyunlong 任务状态TODO 修改为Analysing 9个月前
huangyunlong 任务状态Analysing 修改为DONE 9个月前

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(2)
huangyunlong-huangyunlong2022 sankarea-sankarea
Python
1
https://gitee.com/ascend/pytorch.git
git@gitee.com:ascend/pytorch.git
ascend
pytorch
pytorch

搜索帮助