一、问题现象(附报错日志上下文):
torch.multinomial 在大模型 do sample 时候,完全随机,导致解码随机乱码
两个torch版本都试过,错误一样,
CANN7.0 torch2.1.0 没有这个问题
二、软件版本:
-- CANN 版本 (e.g., CANN 3.0.x,5.x.x): 8.0.RC2.alpha002
--Tensorflow/Pytorch/MindSpore 版本:
torch==2.1.0
torch-npu==2.1.0.post6
torch==2.3.1
torch-npu==2.3.1
--Python 版本 (e.g., Python 3.7.5): python 3.10.14
-- MindStudio版本 (e.g., MindStudio 2.0.0 (beta3)):
--操作系统版本 (e.g., Ubuntu 18.04): Ubuntu 20.04
三、测试步骤:
import torch
import torch_npu
a = torch.multinomial(torch.tensor([[164.7500, 163.6562, 163.1562, 162.9688, 162.6250,]+ [0,0,0,0,0,0,0 ]*1000]).npu().half(),num_samples=5)
print(a)
tensor([[1950, 2822, 6356, 0, 3]], device='npu:0')
应该是前5个才对,但是随机到后边的数字了
但是当数组长度小的时候,结果符合预期,长的数组不行
a = torch.multinomial(torch.tensor([[164.7500, 163.6562, 163.1562, 162.9688, 162.6250,]+ [0,0,0,0,0,0,0 ]*10]).npu().half(),num_samples=5)
print(a)
tensor([[1, 4, 3, 0, 2]], device='npu:0')
##############
tensor 直接从2卡转到1卡报错,需要先转到cpu再转到1卡才可以
a=torch.tensor([1])
a.to('npu:2').to('npu:1')
需要这样才可以
a.to('npu:2').cpu().to('npu:1')
RuntimeError Traceback (most recent call last)
Cell In[5], line 1
----> 1 a.to('npu:2').to('npu:1')
RuntimeError: copy_d2d_baseformat_opapi:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:175 NPU function error: c10_npu::acl::AclrtSynchronizeStreamWithTimeout(copy_stream), error code is 507001
[ERROR] 2024-07-12-10:30:30 (PID:1000094, Device:2, RankID:-1) ERR00100 PTA call acl api failed
[Error]: An internal error occurs in the task scheduler module on the device.
Rectify the fault based on the error information in the ascend log.
EI9999: Inner Error!
The error from device(chipId:2, dieId:0), serial number is 1. there is a sdma error, sdma channel is 16, sdmaBlkFsmState=0x7, dfxSdmaBlkFsmOstCnt=0x0, sdmaChFree=0x0, irqStatus=0x220000, cqeStatus=0x3 [FUNC:ProcessStarsSdmaErrorInfo][FILE:device_error_proc.cc][LINE:1281]
EI9999: 2024-07-12-10:30:30.882.697 Memory async copy failed, device_id=2, stream_id=2, task_id=1, flip_num=0, copy_type=2, memcpy_type=0, copy_data_type=0, length=8[FUNC:GetError][FILE:stream.cc][LINE:1502]
TraceBack (most recent call last):
rtStreamSynchronizeWithTimeout execute failed, reason=[task exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
synchronize stream failed, runtime result = 507001[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]