融合优化器：torch_npu.optim.NpuFusedAdamW报错

torch_npu.optim.NpuFusedAdamW 直接替换torch.optim.AdamW，显存会有明显增加，这个是否符合预期呢
2.1 在显存不是瓶颈后，直接替换，首先提示ValueError: set_to_none is not supported in fused optimizers，对应代码为：self.optimizer.zero_grad(set_to_none=True)
2.2 注释后，出现如下报错：the size of tensor selfRef [557413176] must match the size of tensor other [862615112]
请问可能的原因/定位方法或解决方案。

具体使用上并无特别：

self.optimizer.zero_grad(set_to_none=True)
model.train()
self._loss.backward()
self.optimizer.step()

报错日志

--- Logging error ---
Traceback (most recent call last):
  File "/home/ma-user/work/code/0823_MFU/MIMO2_dev_bla_128/core/trainers/utils.py", line 95, in train
    self.run_step()
  File "/home/ma-user/work/code/0823_MFU/MIMO2_dev_bla_128/core/trainers/trainer_mmdit.py", line 405, in run_step
    self.optimizer.step()
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/torch/optim/optimizer.py", line 373, in wrapper
    out = func(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/torch_npu/optim/npu_fused_optim_base.py", line 123, in step
    self._group_step(i)
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/torch_npu/optim/npu_fused_adamw.py", line 176, in _group_step
    exp_avg.mul_(beta1).add_(combined_grad, alpha=1 - beta1)
RuntimeError: call aclnnInplaceAdd failed, detail:EZ1001: 2024-08-29-14:21:38.950.285 the size of tensor selfRef [557672824] must match the size of tensor other [563885960].
        TraceBack (most recent call last):
        563885960 and 557672824 cannot broadcast.
        the size of tensor selfRef [557672824] must match the size of tensor other [563885960].

[ERROR] 2024-08-29-14:21:38 (PID:655738, Device:0, RankID:-1) ERR01005 OPS internal error

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/logging/__init__.py", line 1083, in emit
    msg = self.format(record)
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/logging/__init__.py", line 927, in format
    return fmt.format(record)
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/logging/__init__.py", line 663, in format
    record.message = record.getMessage()
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/logging/__init__.py", line 367, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "<string>", line 1, in <module>
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/multiprocessing/spawn.py", line 129, in _main
    return self._bootstrap(parent_sentinel)
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 74, in _wrap
    fn(i, *args)
  File "/home/ma-user/work/code/0823_MFU/MIMO2_dev_bla_128/core/pipelines/train_pipeline.py", line 36, in pipeline
    trainer.train()
  File "/home/ma-user/work/code/0823_MFU/MIMO2_dev_bla_128/core/trainers/utils.py", line 100, in train
    self.logger.error("Exception during training: ", e)
Message: 'Exception during training: '
Arguments: (RuntimeError('call aclnnInplaceAdd failed, detail:EZ1001: 2024-08-29-14:21:38.950.285 the size of tensor selfRef [557672824] must match the size of tensor other [563885960].\n        TraceBack (most recent call last):\n        563885960 and 557672824 cannot broadcast.\n        the size of tensor selfRef [557672824] must match the size of tensor other [563885960].\n\n[ERROR] 2024-08-29-14:21:38 (PID:655738, Device:0, RankID:-1) ERR01005 OPS internal error'),)
[W compiler_depend.ts:2438] Warning: Tensor not is not allocated by NPUCachingAllocator, skip eraseStream. (function operator())

1、优化器进行替换，显存增加是可能的，优化器实现原理不一致，如果想看具体哪里的差异导致的，可以采集profiling数据观察下
2、ValueError: set_to_none is not supported in fused optimizers，不支持set_to_none=True，不需要注释，而是将参数改为False

Ascend/pytorch

内容风险标识

评论 (1)

Ascend/pytorch .gitee-modal { width: 500px !important; }

内容风险标识

融合优化器：torch_npu.optim.NpuFusedAdamW报错

评论 (1)

搜索帮助

Ascend/pytorch