75 Star 582 Fork 1.1K

Ascend/pytorch

请问进程之间的NPU变量共享可以怎样实现呢?

WIP
需求
创建于  
2023-11-01 16:44

我的需求是进程A计算后得到NPU上的tensor a,之后由进程B对tensor a进行简单处理后保存,A在此过程中继续计算达到并行的效果

不过目前版本看起来好像并不支持torch.TypedStorage.share_memory_,那么请问我有什么其他途径实现这项需求吗?

此外,我也测试了torch.multiprocessing.Queue,也会报错AttributeError: 'NpuStorage' object has no attribute 'is_cuda',看起来是相同的原因,并不能绕开这个问题

版本:python 3.7;pytorch-1.11;torch_npu-1.11.0.post1

评论 (3)

Nanillll 创建了需求 2年前
Nanillll 修改了描述 2年前
展开全部操作日志

例如,我运行这段代码会卡死,无论是否设置a.share_memory_()

import torch
from torch import multiprocessing
import os
import torch_npu as npu

os.environ["ASCEND_RT_VISIBLE_DEVICES"] = '0,1,2,3'
device = torch.device("npu:{}".format(str(0)))
# device = torch.device("cpu")

def test(a):
    a+=1
    print(a)                               

if __name__ == "__main__":
    a = torch.tensor([.3, .4, 1.2]).to(device)
    # a.share_memory_()             
    p = multiprocessing.Process(target=test, args=(a, ))
    p.start()
    p.join()
    print(a)         

另外,我测试了多线程

import torch
# from torch import multiprocessing
# import multiprocessing
from threading import Thread
import os
import torch_npu as npu

os.environ["ASCEND_RT_VISIBLE_DEVICES"] = '0,1,2,3'
device = torch.device("npu:{}".format(str(0)))
# device = torch.device("cpu")

a = torch.tensor([.3, .4, 1.2]).to(device)

def test(a):
    # global a
    a+=1
    print(a)                               

if __name__ == "__main__":
    # p = multiprocessing.Process(target=test, args=())
    p = Thread(target=test,args=(a,))
    p.start()
    p.join()
    print(a)      

会报错

RuntimeError: current_device:/usr1/workspace/FPTA_Daily_open_pytorchv1.11.0-5.0.rc1/CODE/torch_npu/csrc/core/npu/NPUFunctions.h:38 NPU error, error code is 107002
EE1001: The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
        Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
        TraceBack (most recent call last):
        ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:4239]
        The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]

log报错

[ERROR] RUNTIME(1832805,python):2023-11-02-09:54:49.807.999 [api_impl.cc:4239]1834293 GetDevErrMsg:report error module_type=3, module_name=EE8888
[ERROR] RUNTIME(1832805,python):2023-11-02-09:54:49.808.012 [api_impl.cc:4239]1834293 GetDevErrMsg:ctx is NULL!
[ERROR] RUNTIME(1832805,python):2023-11-02-09:54:49.808.057 [api_impl.cc:4295]1834293 GetDevMsg:Failed to GetDeviceErrMsg, retCode=0x7070001.
[ERROR] RUNTIME(1832805,python):2023-11-02-09:54:49.808.061 [logger.cc:1557]1834293 GetDevMsg:GetDeviceMsg failed, getMsgType=0.
[ERROR] RUNTIME(1832805,python):2023-11-02-09:54:49.808.084 [api_c.cc:3843]1834293 rtGetDevMsg:ErrCode=107002, desc=[context pointer null], InnerCode=0x7070001
[ERROR] RUNTIME(1832805,python):2023-11-02-09:54:49.808.088 [error_message_manage.cc:48]1834293 FuncErrorReason:report error module_name=EE1001
[ERROR] RUNTIME(1832805,python):2023-11-02-09:54:49.808.096 [error_message_manage.cc:48]1834293 FuncErrorReason:rtGetDevMsg execute failed, reason=[context pointer null]

请问是我的某些设置错误吗?还是目前并不能直接进行线程间的npu变量共享?

这个还不支持

Destiny 任务状态TODO 修改为WIP 2年前

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
优先级
预计工期 (小时)
开始日期   -   截止日期
-
置顶选项
参与者(2)
Nanillll-nanilll Destiny-wx1103340
Python
1
https://gitee.com/ascend/pytorch.git
git@gitee.com:ascend/pytorch.git
ascend
pytorch
pytorch

搜索帮助