75 Star 583 Fork 1.1K

Ascend/pytorch

弹性云服务器上setDevice报错

DONE
需求
创建于  
2023-03-10 23:53

云服务器规格:AI加速型 | kai1s.4xlarge.2 | 16vCPUs | 32GiB
CANN6.0.RC1
AscendPyTorch 3.0.0

执行如下代码时报错

device = "npu:0"
torch.npu.set_device(device)

报错如下

  File "/usr/local/whisper-main/whisper/__init__.py", line 94, in load_model
    torch.npu.set_device(device)
  File "/usr/local/python3.7.5/lib/python3.7/site-packages/torch_npu/npu/utils.py", line 149, in set_device
    torch_npu._C._npu_setDevice(torch.device(device).index)
RuntimeError: Initialize:/usr1/workspace/FPTA_Daily_Plugin_open_date/Plugin/torch_npu/csrc/core/npu/sys_ctrl/npu_sys_ctrl.cpp:92 NPU error, error code is 507033
E39999: Inner Error!
E39999  TsdOpen failed. devId=0, tdt error=34[FUNC:startAicpuExecutor][FILE:runtime.cc][LINE:1610]
        Start aicpu executor failed, retCode=0x7020009 devId=0[FUNC:DeviceRetain][FILE:runtime.cc][LINE:2056]
        Check param failed, dev can not be NULL![FUNC:PrimaryContextRetain][FILE:runtime.cc][LINE:1909]
        Check param failed, ctx can not be NULL![FUNC:PrimaryContextRetain][FILE:runtime.cc][LINE:1936]
        Check param failed, context can not be null.[FUNC:NewDevice][FILE:api_impl.cc][LINE:1284]
        new device failed, retCode=0x7010006[FUNC:SetDevice][FILE:api_impl.cc][LINE:1305]
        rtSetDevice execute failed, reason=[device retain error][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:49]
        open device 0 failed, runtime result = 507033.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:162]
        Check param failed, dev can not be NULL![FUNC:DeviceRetain][FILE:runtime.cc][LINE:2096]
        ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:3429]
        rtGetDevMsg execute failed, reason=[context pointer null][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:49]
        Solution: Please contact support engineer.

评论 (4)

苗泽远 创建了需求 2年前

Please add labels , also you can visit https://gitee.com/ascend/community/blob/master/labels.md to find more.
为了让代码尽快被审核,请您为Issue打上标签,打上标签的Issue可以直接推送给责任人进行审核。
更多的标签可以查看https://gitee.com/ascend/community/blob/master/labels.md
以模型训练相关代码提交为例,如果你提交的是模型训练代码,你可以这样评论:
//train/model
另外你还可以给这个Issue标记类型,例如是bugfix或者是特性需求:
//kind/bug or //kind/feature
恭喜你,你已经学会了使用命令来打标签,接下来就在下面的评论里打上标签吧!

日志
[EVENT] PROFILING(3591,python3.7):2023-03-10-23:27:55.907.135 [msprof_callback_impl.cpp:199] >>> (tid:3591) Started to register profiling ctrl callback.
[EVENT] PROFILING(3591,python3.7):2023-03-10-23:27:56.491.407 [msprof_callback_impl.cpp:78] >>> (tid:3591) MsprofCtrlCallback called, type: 255
[EVENT] PROFILING(3591,python3.7):2023-03-10-23:27:56.491.451 [prof_acl_mgr.cpp:1190] >>> (tid:3591) Init profiling for dynamic profiling
[ERROR] TDT(3591,python3.7):2023-03-10-23:27:56.492.233 [log.cpp:38]deviceId: 0 drvHdcSessionConnect failed, ret = 4,[hdc_client.cpp:193:CreateHdcSession]3591
[ERROR] TDT(3591,python3.7):2023-03-10-23:27:56.492.470 [log.cpp:38][TsdClient][deviceId=0]CreateSession for TSD failed in Open function,[process_mode_manager.cpp:318:InitTsdClient]3591
[ERROR] TDT(3591,python3.7):2023-03-10-23:27:56.492.556 [log.cpp:38]Send aicpu package to device failed.,[process_mode_manager.cpp:75:Open]3591
[ERROR] TDT(3591,python3.7):2023-03-10-23:27:56.492.629 [log.cpp:38]TsdOpen failed, deviceId[0].,[tsd_client.cpp:27:TsdOpen]3591
[ERROR] RUNTIME(3591,python3.7):2023-03-10-23:27:56.492.648 [runtime.cc:1610]3591 startAicpuExecutor:report error module_type=0, module_name=E39999
[ERROR] RUNTIME(3591,python3.7):2023-03-10-23:27:56.492.652 [runtime.cc:1610]3591 startAicpuExecutor:TsdOpen failed. devId=0, tdt error=34
[ERROR] RUNTIME(3591,python3.7):2023-03-10-23:27:56.492.691 [runtime.cc:2056]3591 DeviceRetain:report error module_type=0, module_name=EE9999
[ERROR] RUNTIME(3591,python3.7):2023-03-10-23:27:56.492.695 [runtime.cc:2056]3591 DeviceRetain:Start aicpu executor failed, retCode=0x7020009 devId=0
[ERROR] RUNTIME(3591,python3.7):2023-03-10-23:27:56.492.704 [runtime.cc:1909]3591 PrimaryContextRetain:report error module_type=0, module_name=EE9999
[ERROR] RUNTIME(3591,python3.7):2023-03-10-23:27:56.492.706 [runtime.cc:1909]3591 PrimaryContextRetain:Check param failed, dev can not be NULL!
[ERROR] RUNTIME(3591,python3.7):2023-03-10-23:27:56.492.714 [runtime.cc:1936]3591 PrimaryContextRetain:report error module_type=0, module_name=EE9999
[ERROR] RUNTIME(3591,python3.7):2023-03-10-23:27:56.492.717 [runtime.cc:1936]3591 PrimaryContextRetain:Check param failed, ctx can not be NULL!
[ERROR] RUNTIME(3591,python3.7):2023-03-10-23:27:56.492.723 [api_impl.cc:1284]3591 NewDevice:report error module_type=0, module_name=EE9999
[ERROR] RUNTIME(3591,python3.7):2023-03-10-23:27:56.492.726 [api_impl.cc:1284]3591 NewDevice:Check param failed, context can not be null.
[ERROR] RUNTIME(3591,python3.7):2023-03-10-23:27:56.492.734 [api_impl.cc:1305]3591 SetDevice:report error module_type=0, module_name=EE9999
[ERROR] RUNTIME(3591,python3.7):2023-03-10-23:27:56.492.737 [api_impl.cc:1305]3591 SetDevice:new device failed, retCode=0x7010006
[ERROR] RUNTIME(3591,python3.7):2023-03-10-23:27:56.492.743 [logger.cc:607]3591 SetDevice:Set device failed, device_id=0, deviceMode=0.
[ERROR] RUNTIME(3591,python3.7):2023-03-10-23:27:56.492.762 [api_c.cc:1331]3591 rtSetDevice:ErrCode=507033, desc=[device retain error], InnerCode=0x7010006
[ERROR] RUNTIME(3591,python3.7):2023-03-10-23:27:56.492.765 [error_message_manage.cc:49]3591 FuncErrorReason:report error module_type=3, module_name=EE8888
[ERROR] RUNTIME(3591,python3.7):2023-03-10-23:27:56.492.771 [error_message_manage.cc:49]3591 FuncErrorReason:rtSetDevice execute failed, reason=[device retain error]
[ERROR] ASCENDCL(3591,python3.7):2023-03-10-23:27:56.492.782 [device.cpp:68]3591 aclrtSetDevice: open device 0 failed, runtime result = 507033.
[ERROR] RUNTIME(3591,python3.7):2023-03-10-23:27:56.492.797 [runtime.cc:2096]3591 DeviceRetain:report error module_type=0, module_name=EE9999
[ERROR] RUNTIME(3591,python3.7):2023-03-10-23:27:56.492.800 [runtime.cc:2096]3591 DeviceRetain:Check param failed, dev can not be NULL!
[ERROR] RUNTIME(3591,python3.7):2023-03-10-23:27:56.492.809 [runtime.cc:1909]3591 PrimaryContextRetain:report error module_type=0, module_name=EE9999
[ERROR] RUNTIME(3591,python3.7):2023-03-10-23:27:56.492.811 [runtime.cc:1909]3591 PrimaryContextRetain:Check param failed, dev can not be NULL!
[ERROR] RUNTIME(3591,python3.7):2023-03-10-23:27:56.492.817 [runtime.cc:1936]3591 PrimaryContextRetain:report error module_type=0, module_name=EE9999
[ERROR] RUNTIME(3591,python3.7):2023-03-10-23:27:56.492.819 [runtime.cc:1936]3591 PrimaryContextRetain:Check param failed, ctx can not be NULL!
[ERROR] RUNTIME(3591,python3.7):2023-03-10-23:27:56.492.825 [api_impl.cc:1284]3591 NewDevice:report error module_type=0, module_name=EE9999
[ERROR] RUNTIME(3591,python3.7):2023-03-10-23:27:56.492.827 [api_impl.cc:1284]3591 NewDevice:Check param failed, context can not be null.
[ERROR] RUNTIME(3591,python3.7):2023-03-10-23:27:56.492.832 [api_impl.cc:1305]3591 SetDevice:report error module_type=0, module_name=EE9999
[ERROR] RUNTIME(3591,python3.7):2023-03-10-23:27:56.492.835 [api_impl.cc:1305]3591 SetDevice:new device failed, retCode=0x7010006
[ERROR] RUNTIME(3591,python3.7):2023-03-10-23:27:56.492.840 [logger.cc:607]3591 SetDevice:Set device failed, device_id=0, deviceMode=0.
[ERROR] RUNTIME(3591,python3.7):2023-03-10-23:27:56.492.846 [api_c.cc:1331]3591 rtSetDevice:ErrCode=507033, desc=[device retain error], InnerCode=0x7010006
[ERROR] RUNTIME(3591,python3.7):2023-03-10-23:27:56.492.848 [error_message_manage.cc:49]3591 FuncErrorReason:report error module_type=3, module_name=EE8888
[ERROR] RUNTIME(3591,python3.7):2023-03-10-23:27:56.492.852 [error_message_manage.cc:49]3591 FuncErrorReason:rtSetDevice execute failed, reason=[device retain error]
[ERROR] ASCENDCL(3591,python3.7):2023-03-10-23:27:56.492.860 [device.cpp:68]3591 aclrtSetDevice: open device 0 failed, runtime result = 507033.
[ERROR] RUNTIME(3591,python3.7):2023-03-10-23:27:56.493.121 [api_impl.cc:3429]3591 GetDevErrMsg:report error module_type=3, module_name=EE8888
[ERROR] RUNTIME(3591,python3.7):2023-03-10-23:27:56.493.125 [api_impl.cc:3429]3591 GetDevErrMsg:ctx is NULL!
[ERROR] RUNTIME(3591,python3.7):2023-03-10-23:27:56.493.133 [api_impl.cc:3485]3591 GetDevMsg:Failed to GetDeviceErrMsg, retCode=0x7070001.
[ERROR] RUNTIME(3591,python3.7):2023-03-10-23:27:56.493.136 [logger.cc:1335]3591 GetDevMsg:GetDeviceMsg failed, getMsgType=0.
[ERROR] RUNTIME(3591,python3.7):2023-03-10-23:27:56.493.142 [api_c.cc:3441]3591 rtGetDevMsg:ErrCode=107002, desc=[context pointer null], InnerCode=0x7070001
[ERROR] RUNTIME(3591,python3.7):2023-03-10-23:27:56.493.145 [error_message_manage.cc:49]3591 FuncErrorReason:report error module_type=3, module_name=EE8888
[ERROR] RUNTIME(3591,python3.7):2023-03-10-23:27:56.493.149 [error_message_manage.cc:49]3591 FuncErrorReason:rtGetDevMsg execute failed, reason=[context pointer null]
[EVENT] IDEDH(3591,python3.7):2023-03-10-23:27:56.607.687 [adx_server_manager.cpp:27][tid:3591]>>> start to deconstruct adx server manager
[EVENT] IDEDH(3591,python3.7):2023-03-10-23:27:56.611.704 [adx_server_manager.cpp:27][tid:3591]>>> start to deconstruct adx server manager

检查卡是否正常上,报ctx is NULL一般有2个原因:
1、卡不状态不正常。
2、在进程已经进行中重新设置卡号,当前不支持运行中切换卡号。

郭夏 任务状态TODO 修改为DONE 2年前

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(3)
ascend-robot-ascend-robot 苗泽远-miao_zi_yuan 郭夏-petissue
Python
1
https://gitee.com/ascend/pytorch.git
git@gitee.com:ascend/pytorch.git
ascend
pytorch
pytorch

搜索帮助