驱动为24.1.rc2.b030,CANN为8.0.T13,配套CANN-kernels已安装,训练报错如下:
```
File "/opt/xuwei/lum/train.py", line 130, in train
trainer.train()
File "/root/anaconda3/envs/xuwei/lib/python3.9/site-packages/transformers/trainer.py", line 1780, in train
return inner_training_loop(
File "/root/anaconda3/envs/xuwei/lib/python3.9/site-packages/transformers/trainer.py", line 2118, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/root/anaconda3/envs/xuwei/lib/python3.9/site-packages/transformers/trainer.py", line 3045, in training_step
self.accelerator.backward(loss)
File "/root/anaconda3/envs/xuwei/lib/python3.9/site-packages/accelerate/accelerator.py", line 2001, in backward
loss.backward(**kwargs)
File "/root/anaconda3/envs/xuwei/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/root/anaconda3/envs/xuwei/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: call aclnnNLLLossBackward failed, detail:EZ9999: Inner Error!
EZ9999: 2024-08-06-19:35:30.419.792 Parse dynamic kernel config fail.
TraceBack (most recent call last):
AclOpKernelInit failed opType
Op NLLLossGrad does not has any binary.
Kernel Run failed. opType: 38, NLLLossGrad
launch failed for NLLLossGrad, errno:561000.
[ERROR] 2024-08-06-19:35:30 (PID:13658, Device:0, RankID:-1) ERR01005 OPS internal error
```