一、问题现象:
x.to("npu:0")
and x.to("npu:1")
transfer data to the same NPU, depending on the torch.npu.set_device()
. This is different from the CUDA behavior, where x.to("cuda:0")
and x.to("cuda:1")
will send data to different GPUs, and only the default x.to("cuda")
depends on the configuration of torch.cuda.set_device()
torch.npu.set_device()
multiple times during a Python session. Only the first time works. Second call leads to runtime error. Thus there is no way for a single host process to send data to both npu:0
and npu:1
. In CUDA, this is easily doable.二、软件版本:
-- CANN 版本: 6.0.1(C84)
-- Pytorch 版本: 1.8.1
--Python 版本: Python 3.7.10
--操作系统版本: EulerOS 2.0 (SP8)
三、测试步骤:
Run the script below:
import torch
import torch_npu
print(torch.__version__, torch_npu.__version__) # 1.8.0a0+56b43f4 1.8.1
print(torch.npu.is_available()) # True
print(torch.npu.device_count()) # 2
print(torch.npu.current_device()) # 0
switch_device_before = True # switch device at beginning is fine
if switch_device_before:
torch.npu.set_device("npu:1")
print(torch.npu.current_device()) # 1
x = torch.ones(3)
x_0 = x.to("npu:0")
print(x_0.device) # show npu-1; shouldn't it be npu-0 ??
x_1 = x.to("npu:1")
print(x_1.device) # show npu-1
print((x_0 + x_1).device) # show npu-1
switch_device_after = False # change to True to reproduce Runtime error
if switch_device_after:
torch.npu.set_device("npu:0")
print(torch.npu.current_device()) # 0
x.to("npu:0") # RuntimeError
Both x_0
and x_1
go to npu:1
, ignoring the x_0 = x.to("npu:0")
call.
On the other hand, by changing switch_device_before
to False
, then both x_0
and x_1
go to the default npu:0
, ignoring the x_1 = x.to("npu:1")
statement.
In CUDA, x_0 = x.to("cuda:0")
and x_1 = x.to("cuda:1")
go two different GPUs, and x_0 + x_1
leads RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!
, which is the correct and expected behavior.
After the x_0
and x_1
tensors get already initialized on npu:1
, I cannot switch back to npu:0
again. By changing switch_device_after
to True
thus calling torch.npu.set_device
one more time, the host-device transfer x.to("npu:0")
leads to RuntimeError
.
In CUDA, it is fine to call torch.cuda.set_device()
multiple times to switch the default device.
四、日志信息:
[W OperatorEntry.cpp:121] Warning: Overriding a previously registered kernel for the same operator and the same dispatch key
operator: aten::_has_compatible_shallow_copy_type(Tensor self, Tensor from) -> (bool)
registered at /usr1/pytorch/build/aten/src/ATen/RegisterSchema.cpp:20
dispatch key: Math
previous kernel: registered at /usr1/pytorch/build/aten/src/ATen/RegisterMath.cpp:5686
new kernel: registered at /usr1/workspace/FPTA_Daily_Plugin_open_v1.8.1-3.0.tr6/Plugin/torch_npu/csrc/aten/ops/HasCompatibleShallowCopyType.cpp:37 (function registerKernel)
1.8.0a0+56b43f4 1.8.1
True
2
0
1
device(type='npu', index=1)
device(type='npu', index=1)
device(type='npu', index=1)
0
Traceback (most recent call last):
File "set_device_bug.py", line 30, in <module>
x.to("npu:0") # RuntimeError
File "/home/ma-user/anaconda3/envs/PyTorch-1.8.1/lib/python3.7/site-packages/torch_npu/utils/device_guard.py", line 35, in wrapper
return func(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/PyTorch-1.8.1/lib/python3.7/site-packages/torch_npu/utils/tensor_methods.py", line 87, in _to
return torch_npu._C.to(self, *args, **kwargs)
RuntimeError
terminate called after throwing an instance of 'c10::Error'
what(): 0 INTERNAL ASSERT FAILED at "/usr1/workspace/FPTA_Daily_Plugin_open_v1.8.1-3.0.tr6/Plugin/torch_npu/csrc/core/npu/NPUStream.cpp":142, please report a bug to PyTorch. Could not compute stream ID for 0xffff0315d960 on device (something has gone horribly wrong!)
Exception raised from NPUStream_getStreamId at /usr1/workspace/FPTA_Daily_Plugin_open_v1.8.1-3.0.tr6/Plugin/torch_npu/csrc/core/npu/NPUStream.cpp:142 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x8c (0xffffa9c5ed0c in /home/ma-user/anaconda3/envs/PyTorch-1.8.1/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x802788 (0xffff02d92788 in /home/ma-user/anaconda3/envs/PyTorch-1.8.1/lib/python3.7/site-packages/torch_npu/lib/libtorch_npu.so)
frame #2: c10_npu::getCurrentNPUStream(signed char) + 0x7c (0xffff02d97464 in /home/ma-user/anaconda3/envs/PyTorch-1.8.1/lib/python3.7/site-packages/torch_npu/lib/libtorch_npu.so)
frame #3: <unknown function> + 0x816304 (0xffff02da6304 in /home/ma-user/anaconda3/envs/PyTorch-1.8.1/lib/python3.7/site-packages/torch_npu/lib/libtorch_npu.so)
frame #4: c10_npu::NpuSysCtrl::Finalize() + 0xf8 (0xffff02da6b98 in /home/ma-user/anaconda3/envs/PyTorch-1.8.1/lib/python3.7/site-packages/torch_npu/lib/libtorch_npu.so)
frame #5: THPModule_npu_shutdown(_object*) + 0xd8 (0xffff031753b0 in /home/ma-user/anaconda3/envs/PyTorch-1.8.1/lib/python3.7/site-packages/torch_npu/_C.cpython-37m-aarch64-linux-gnu.so)
<omitting python frames>
frame #16: __libc_start_main + 0xe0 (0xffffabbe7b20 in /lib64/libc.so.6)
Process ForkServerProcess-2:
Traceback (most recent call last):
File "/home/ma-user/anaconda3/envs/PyTorch-1.8.1/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/ma-user/anaconda3/envs/PyTorch-1.8.1/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 61, in wrapper
raise exp
File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 58, in wrapper
func(*args, **kwargs)
File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 268, in task_distribute
key, func_name, detail = resource_proxy[TASK_QUEUE].get()
File "<string>", line 2, in get
File "/home/ma-user/anaconda3/envs/PyTorch-1.8.1/lib/python3.7/multiprocessing/managers.py", line 819, in _callmethod
kind, result = conn.recv()
File "/home/ma-user/anaconda3/envs/PyTorch-1.8.1/lib/python3.7/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/home/ma-user/anaconda3/envs/PyTorch-1.8.1/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/home/ma-user/anaconda3/envs/PyTorch-1.8.1/lib/python3.7/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError
/home/ma-user/anaconda3/envs/PyTorch-1.8.1/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 91 leaked semaphores to clean up at shutdown
len(cache))
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
Please add labels , also you can visit https://gitee.com/ascend/community/blob/master/labels.md to find more.
为了让代码尽快被审核,请您为Issue打上标签,打上标签的Issue可以直接推送给责任人进行审核。
更多的标签可以查看https://gitee.com/ascend/community/blob/master/labels.md
以模型训练相关代码提交为例,如果你提交的是模型训练代码,你可以这样评论:
//train/model
另外你还可以给这个Issue标记类型,例如是bugfix或者是特性需求:
//kind/bug or //kind/feature
恭喜你,你已经学会了使用命令来打标签,接下来就在下面的评论里打上标签吧!
The current NPU does not support the preceding behavior.
有没有解决方案?
torch_npu 2.1.0及以上分支在6.0.rc1版本后已支持
登录 后才可以发表评论