请问进程之间的NPU变量共享可以怎样实现呢？

我的需求是进程A计算后得到NPU上的tensor a，之后由进程B对tensor a进行简单处理后保存，A在此过程中继续计算达到并行的效果

不过目前版本看起来好像并不支持torch.TypedStorage.share_memory_，那么请问我有什么其他途径实现这项需求吗？

此外，我也测试了torch.multiprocessing.Queue，也会报错AttributeError: 'NpuStorage' object has no attribute 'is_cuda'，看起来是相同的原因，并不能绕开这个问题

版本：python 3.7；pytorch-1.11；torch_npu-1.11.0.post1

例如，我运行这段代码会卡死，无论是否设置a.share_memory_()

import torch
from torch import multiprocessing
import os
import torch_npu as npu

os.environ["ASCEND_RT_VISIBLE_DEVICES"] = '0,1,2,3'
device = torch.device("npu:{}".format(str(0)))
# device = torch.device("cpu")

def test(a):
    a+=1
    print(a)                               

if __name__ == "__main__":
    a = torch.tensor([.3, .4, 1.2]).to(device)
    # a.share_memory_()             
    p = multiprocessing.Process(target=test, args=(a, ))
    p.start()
    p.join()
    print(a)

另外，我测试了多线程

import torch
# from torch import multiprocessing
# import multiprocessing
from threading import Thread
import os
import torch_npu as npu

os.environ["ASCEND_RT_VISIBLE_DEVICES"] = '0,1,2,3'
device = torch.device("npu:{}".format(str(0)))
# device = torch.device("cpu")

a = torch.tensor([.3, .4, 1.2]).to(device)

def test(a):
    # global a
    a+=1
    print(a)                               

if __name__ == "__main__":
    # p = multiprocessing.Process(target=test, args=())
    p = Thread(target=test,args=(a,))
    p.start()
    p.join()
    print(a)

会报错

RuntimeError: current_device:/usr1/workspace/FPTA_Daily_open_pytorchv1.11.0-5.0.rc1/CODE/torch_npu/csrc/core/npu/NPUFunctions.h:38 NPU error, error code is 107002
EE1001: The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
        Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
        TraceBack (most recent call last):
        ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:4239]
        The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]

log报错

[ERROR] RUNTIME(1832805,python):2023-11-02-09:54:49.807.999 [api_impl.cc:4239]1834293 GetDevErrMsg:report error module_type=3, module_name=EE8888
[ERROR] RUNTIME(1832805,python):2023-11-02-09:54:49.808.012 [api_impl.cc:4239]1834293 GetDevErrMsg:ctx is NULL!
[ERROR] RUNTIME(1832805,python):2023-11-02-09:54:49.808.057 [api_impl.cc:4295]1834293 GetDevMsg:Failed to GetDeviceErrMsg, retCode=0x7070001.
[ERROR] RUNTIME(1832805,python):2023-11-02-09:54:49.808.061 [logger.cc:1557]1834293 GetDevMsg:GetDeviceMsg failed, getMsgType=0.
[ERROR] RUNTIME(1832805,python):2023-11-02-09:54:49.808.084 [api_c.cc:3843]1834293 rtGetDevMsg:ErrCode=107002, desc=[context pointer null], InnerCode=0x7070001
[ERROR] RUNTIME(1832805,python):2023-11-02-09:54:49.808.088 [error_message_manage.cc:48]1834293 FuncErrorReason:report error module_name=EE1001
[ERROR] RUNTIME(1832805,python):2023-11-02-09:54:49.808.096 [error_message_manage.cc:48]1834293 FuncErrorReason:rtGetDevMsg execute failed, reason=[context pointer null]

请问是我的某些设置错误吗？还是目前并不能直接进行线程间的npu变量共享？

这个还不支持

Ascend/pytorch

内容风险标识

评论 (3)

Ascend/pytorch .gitee-modal { width: 500px !important; }

内容风险标识

请问进程之间的NPU变量共享可以怎样实现呢？

评论 (3)

搜索帮助

Ascend/pytorch