代码拉取完成,页面将自动刷新
self.optimizer.zero_grad(set_to_none=True)
具体使用上并无特别:
self.optimizer.zero_grad(set_to_none=True)
model.train()
self._loss.backward()
self.optimizer.step()
报错日志
--- Logging error ---
Traceback (most recent call last):
File "/home/ma-user/work/code/0823_MFU/MIMO2_dev_bla_128/core/trainers/utils.py", line 95, in train
self.run_step()
File "/home/ma-user/work/code/0823_MFU/MIMO2_dev_bla_128/core/trainers/trainer_mmdit.py", line 405, in run_step
self.optimizer.step()
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/torch/optim/optimizer.py", line 373, in wrapper
out = func(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/torch_npu/optim/npu_fused_optim_base.py", line 123, in step
self._group_step(i)
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/torch_npu/optim/npu_fused_adamw.py", line 176, in _group_step
exp_avg.mul_(beta1).add_(combined_grad, alpha=1 - beta1)
RuntimeError: call aclnnInplaceAdd failed, detail:EZ1001: 2024-08-29-14:21:38.950.285 the size of tensor selfRef [557672824] must match the size of tensor other [563885960].
TraceBack (most recent call last):
563885960 and 557672824 cannot broadcast.
the size of tensor selfRef [557672824] must match the size of tensor other [563885960].
[ERROR] 2024-08-29-14:21:38 (PID:655738, Device:0, RankID:-1) ERR01005 OPS internal error
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/logging/__init__.py", line 1083, in emit
msg = self.format(record)
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/logging/__init__.py", line 927, in format
return fmt.format(record)
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/logging/__init__.py", line 663, in format
record.message = record.getMessage()
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/logging/__init__.py", line 367, in getMessage
msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
File "<string>", line 1, in <module>
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/multiprocessing/spawn.py", line 129, in _main
return self._bootstrap(parent_sentinel)
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 74, in _wrap
fn(i, *args)
File "/home/ma-user/work/code/0823_MFU/MIMO2_dev_bla_128/core/pipelines/train_pipeline.py", line 36, in pipeline
trainer.train()
File "/home/ma-user/work/code/0823_MFU/MIMO2_dev_bla_128/core/trainers/utils.py", line 100, in train
self.logger.error("Exception during training: ", e)
Message: 'Exception during training: '
Arguments: (RuntimeError('call aclnnInplaceAdd failed, detail:EZ1001: 2024-08-29-14:21:38.950.285 the size of tensor selfRef [557672824] must match the size of tensor other [563885960].\n TraceBack (most recent call last):\n 563885960 and 557672824 cannot broadcast.\n the size of tensor selfRef [557672824] must match the size of tensor other [563885960].\n\n[ERROR] 2024-08-29-14:21:38 (PID:655738, Device:0, RankID:-1) ERR01005 OPS internal error'),)
[W compiler_depend.ts:2438] Warning: Tensor not is not allocated by NPUCachingAllocator, skip eraseStream. (function operator())
1、优化器进行替换,显存增加是可能的,优化器实现原理不一致,如果想看具体哪里的差异导致的,可以采集profiling数据观察下
2、ValueError: set_to_none is not supported in fused optimizers,不支持set_to_none=True,不需要注释,而是将参数改为False
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
登录 后才可以发表评论