代码拉取完成,页面将自动刷新
我的需求是进程A计算后得到NPU上的tensor a,之后由进程B对tensor a进行简单处理后保存,A在此过程中继续计算达到并行的效果
不过目前版本看起来好像并不支持torch.TypedStorage.share_memory_,那么请问我有什么其他途径实现这项需求吗?
此外,我也测试了torch.multiprocessing.Queue,也会报错AttributeError: 'NpuStorage' object has no attribute 'is_cuda',看起来是相同的原因,并不能绕开这个问题
版本:python 3.7;pytorch-1.11;torch_npu-1.11.0.post1
例如,我运行这段代码会卡死,无论是否设置a.share_memory_()
import torch
from torch import multiprocessing
import os
import torch_npu as npu
os.environ["ASCEND_RT_VISIBLE_DEVICES"] = '0,1,2,3'
device = torch.device("npu:{}".format(str(0)))
# device = torch.device("cpu")
def test(a):
a+=1
print(a)
if __name__ == "__main__":
a = torch.tensor([.3, .4, 1.2]).to(device)
# a.share_memory_()
p = multiprocessing.Process(target=test, args=(a, ))
p.start()
p.join()
print(a)
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
另外,我测试了多线程
import torch
# from torch import multiprocessing
# import multiprocessing
from threading import Thread
import os
import torch_npu as npu
os.environ["ASCEND_RT_VISIBLE_DEVICES"] = '0,1,2,3'
device = torch.device("npu:{}".format(str(0)))
# device = torch.device("cpu")
a = torch.tensor([.3, .4, 1.2]).to(device)
def test(a):
# global a
a+=1
print(a)
if __name__ == "__main__":
# p = multiprocessing.Process(target=test, args=())
p = Thread(target=test,args=(a,))
p.start()
p.join()
print(a)
会报错
RuntimeError: current_device:/usr1/workspace/FPTA_Daily_open_pytorchv1.11.0-5.0.rc1/CODE/torch_npu/csrc/core/npu/NPUFunctions.h:38 NPU error, error code is 107002
EE1001: The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
TraceBack (most recent call last):
ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:4239]
The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
log报错
[ERROR] RUNTIME(1832805,python):2023-11-02-09:54:49.807.999 [api_impl.cc:4239]1834293 GetDevErrMsg:report error module_type=3, module_name=EE8888
[ERROR] RUNTIME(1832805,python):2023-11-02-09:54:49.808.012 [api_impl.cc:4239]1834293 GetDevErrMsg:ctx is NULL!
[ERROR] RUNTIME(1832805,python):2023-11-02-09:54:49.808.057 [api_impl.cc:4295]1834293 GetDevMsg:Failed to GetDeviceErrMsg, retCode=0x7070001.
[ERROR] RUNTIME(1832805,python):2023-11-02-09:54:49.808.061 [logger.cc:1557]1834293 GetDevMsg:GetDeviceMsg failed, getMsgType=0.
[ERROR] RUNTIME(1832805,python):2023-11-02-09:54:49.808.084 [api_c.cc:3843]1834293 rtGetDevMsg:ErrCode=107002, desc=[context pointer null], InnerCode=0x7070001
[ERROR] RUNTIME(1832805,python):2023-11-02-09:54:49.808.088 [error_message_manage.cc:48]1834293 FuncErrorReason:report error module_name=EE1001
[ERROR] RUNTIME(1832805,python):2023-11-02-09:54:49.808.096 [error_message_manage.cc:48]1834293 FuncErrorReason:rtGetDevMsg execute failed, reason=[context pointer null]
请问是我的某些设置错误吗?还是目前并不能直接进行线程间的npu变量共享?
这个还不支持
登录 后才可以发表评论