torch.npu.set_device() only works at beginning of Python process

一、问题现象：

With 2 Ascend cards, both x.to("npu:0") and x.to("npu:1") transfer data to the same NPU, depending on the torch.npu.set_device(). This is different from the CUDA behavior, where x.to("cuda:0") and x.to("cuda:1") will send data to different GPUs, and only the default x.to("cuda") depends on the configuration of torch.cuda.set_device()
Cannot call torch.npu.set_device() multiple times during a Python session. Only the first time works. Second call leads to runtime error. Thus there is no way for a single host process to send data to both npu:0 and npu:1. In CUDA, this is easily doable.

二、软件版本:
-- CANN 版本: 6.0.1(C84)
-- Pytorch 版本: 1.8.1
--Python 版本: Python 3.7.10
--操作系统版本: EulerOS 2.0 (SP8)

三、测试步骤：

Run the script below:

import torch
import torch_npu

print(torch.__version__, torch_npu.__version__) # 1.8.0a0+56b43f4 1.8.1

print(torch.npu.is_available())  # True
print(torch.npu.device_count())  # 2
print(torch.npu.current_device())  # 0

switch_device_before = True  # switch device at beginning is fine
if switch_device_before:
    torch.npu.set_device("npu:1")
    print(torch.npu.current_device())  # 1

x = torch.ones(3)

x_0 = x.to("npu:0") 
print(x_0.device)  # show npu-1; shouldn't it be npu-0 ??

x_1 = x.to("npu:1")  
print(x_1.device)  # show npu-1

print((x_0 + x_1).device)  # show npu-1

switch_device_after = False  # change to True to reproduce Runtime error
if switch_device_after:
    torch.npu.set_device("npu:0")
    print(torch.npu.current_device())  # 0
    
    x.to("npu:0")  # RuntimeError

Both x_0 and x_1 go to npu:1, ignoring the x_0 = x.to("npu:0") call.

On the other hand, by changing switch_device_before to False, then both x_0 and x_1 go to the default npu:0, ignoring the x_1 = x.to("npu:1") statement.

In CUDA, x_0 = x.to("cuda:0") and x_1 = x.to("cuda:1") go two different GPUs, and x_0 + x_1 leads RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!, which is the correct and expected behavior.

After the x_0 and x_1 tensors get already initialized on npu:1, I cannot switch back to npu:0 again. By changing switch_device_after to True thus calling torch.npu.set_device one more time, the host-device transfer x.to("npu:0") leads to RuntimeError.

In CUDA, it is fine to call torch.cuda.set_device() multiple times to switch the default device.

四、日志信息:

[W OperatorEntry.cpp:121] Warning: Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_has_compatible_shallow_copy_type(Tensor self, Tensor from) -> (bool)
    registered at /usr1/pytorch/build/aten/src/ATen/RegisterSchema.cpp:20
  dispatch key: Math
  previous kernel: registered at /usr1/pytorch/build/aten/src/ATen/RegisterMath.cpp:5686
       new kernel: registered at /usr1/workspace/FPTA_Daily_Plugin_open_v1.8.1-3.0.tr6/Plugin/torch_npu/csrc/aten/ops/HasCompatibleShallowCopyType.cpp:37 (function registerKernel)
1.8.0a0+56b43f4 1.8.1
True
2
0
1
device(type='npu', index=1)
device(type='npu', index=1)
device(type='npu', index=1)
0
Traceback (most recent call last):
  File "set_device_bug.py", line 30, in <module>
    x.to("npu:0")  # RuntimeError
  File "/home/ma-user/anaconda3/envs/PyTorch-1.8.1/lib/python3.7/site-packages/torch_npu/utils/device_guard.py", line 35, in wrapper
    return func(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/PyTorch-1.8.1/lib/python3.7/site-packages/torch_npu/utils/tensor_methods.py", line 87, in _to
    return torch_npu._C.to(self, *args, **kwargs)
RuntimeError
terminate called after throwing an instance of 'c10::Error'
  what():  0 INTERNAL ASSERT FAILED at "/usr1/workspace/FPTA_Daily_Plugin_open_v1.8.1-3.0.tr6/Plugin/torch_npu/csrc/core/npu/NPUStream.cpp":142, please report a bug to PyTorch. Could not compute stream ID for 0xffff0315d960 on device  (something has gone horribly wrong!)
Exception raised from NPUStream_getStreamId at /usr1/workspace/FPTA_Daily_Plugin_open_v1.8.1-3.0.tr6/Plugin/torch_npu/csrc/core/npu/NPUStream.cpp:142 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x8c (0xffffa9c5ed0c in /home/ma-user/anaconda3/envs/PyTorch-1.8.1/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x802788 (0xffff02d92788 in /home/ma-user/anaconda3/envs/PyTorch-1.8.1/lib/python3.7/site-packages/torch_npu/lib/libtorch_npu.so)
frame #2: c10_npu::getCurrentNPUStream(signed char) + 0x7c (0xffff02d97464 in /home/ma-user/anaconda3/envs/PyTorch-1.8.1/lib/python3.7/site-packages/torch_npu/lib/libtorch_npu.so)
frame #3: <unknown function> + 0x816304 (0xffff02da6304 in /home/ma-user/anaconda3/envs/PyTorch-1.8.1/lib/python3.7/site-packages/torch_npu/lib/libtorch_npu.so)
frame #4: c10_npu::NpuSysCtrl::Finalize() + 0xf8 (0xffff02da6b98 in /home/ma-user/anaconda3/envs/PyTorch-1.8.1/lib/python3.7/site-packages/torch_npu/lib/libtorch_npu.so)
frame #5: THPModule_npu_shutdown(_object*) + 0xd8 (0xffff031753b0 in /home/ma-user/anaconda3/envs/PyTorch-1.8.1/lib/python3.7/site-packages/torch_npu/_C.cpython-37m-aarch64-linux-gnu.so)
<omitting python frames>
frame #16: __libc_start_main + 0xe0 (0xffffabbe7b20 in /lib64/libc.so.6)

Process ForkServerProcess-2:
Traceback (most recent call last):
  File "/home/ma-user/anaconda3/envs/PyTorch-1.8.1/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/ma-user/anaconda3/envs/PyTorch-1.8.1/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 61, in wrapper
    raise exp
  File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 58, in wrapper
    func(*args, **kwargs)
  File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 268, in task_distribute
    key, func_name, detail = resource_proxy[TASK_QUEUE].get()
  File "<string>", line 2, in get
  File "/home/ma-user/anaconda3/envs/PyTorch-1.8.1/lib/python3.7/multiprocessing/managers.py", line 819, in _callmethod
    kind, result = conn.recv()
  File "/home/ma-user/anaconda3/envs/PyTorch-1.8.1/lib/python3.7/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/home/ma-user/anaconda3/envs/PyTorch-1.8.1/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/ma-user/anaconda3/envs/PyTorch-1.8.1/lib/python3.7/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError
/home/ma-user/anaconda3/envs/PyTorch-1.8.1/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 91 leaked semaphores to clean up at shutdown
  len(cache))

Please add labels , also you can visit https://gitee.com/ascend/community/blob/master/labels.md to find more.
为了让代码尽快被审核，请您为Issue打上标签，打上标签的Issue可以直接推送给责任人进行审核。
更多的标签可以查看https://gitee.com/ascend/community/blob/master/labels.md
以模型训练相关代码提交为例，如果你提交的是模型训练代码，你可以这样评论：
//train/model
另外你还可以给这个Issue标记类型，例如是bugfix或者是特性需求：
//kind/bug or //kind/feature
恭喜你，你已经学会了使用命令来打标签，接下来就在下面的评论里打上标签吧！

The current NPU does not support the preceding behavior.

有没有解决方案？

torch_npu 2.1.0及以上分支在6.0.rc1版本后已支持

Ascend/pytorch

内容风险标识

评论 (5)

Ascend/pytorch .gitee-modal { width: 500px !important; }

内容风险标识