74 Star 572 Fork 1.1K

Ascend/pytorch

 / 详情

pytorch进行baichuan2-13b训练过程中报错[Error]: Failed to obtain the SOC version.

DONE
训练问题
创建于  
2024-02-22 14:52

一、问题现象(附报错日志上下文):
运行bash examples/baichuan2/pretrain_baichuan2_ptd_13B.sh时报错

/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/dynamo/__init__.py:18: UserWarning: Register eager implementation for the 'npu' backend of dynamo, as torch_npu was not compiled with torchair.
  warnings.warn(
/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/dynamo/__init__.py:18: UserWarning: Register eager implementation for the 'npu' backend of dynamo, as torch_npu was not compiled with torchair.
  warnings.warn(
/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/dynamo/__init__.py:18: UserWarning: Register eager implementation for the 'npu' backend of dynamo, as torch_npu was not compiled with torchair.
  warnings.warn(
/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/dynamo/__init__.py:18: UserWarning: Register eager implementation for the 'npu' backend of dynamo, as torch_npu was not compiled with torchair.
  warnings.warn(
/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/dynamo/__init__.py:18: UserWarning: Register eager implementation for the 'npu' backend of dynamo, as torch_npu was not compiled with torchair.
  warnings.warn(
/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/dynamo/__init__.py:18: UserWarning: Register eager implementation for the 'npu' backend of dynamo, as torch_npu was not compiled with torchair.
  warnings.warn(
/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/dynamo/__init__.py:18: UserWarning: Register eager implementation for the 'npu' backend of dynamo, as torch_npu was not compiled with torchair.
  warnings.warn(
/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/dynamo/__init__.py:18: UserWarning: Register eager implementation for the 'npu' backend of dynamo, as torch_npu was not compiled with torchair.
  warnings.warn(
/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/contrib/transfer_to_npu.py:162: ImportWarning: 
    *************************************************************************************************************
    The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now..
    The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now..
    The backend in torch.distributed.init_process_group set to hccl now..
    The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now..
    The device parameters have been replaced with npu in the function below:
    torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.nn.Module.to, torch.nn.Module.to_empty
    *************************************************************************************************************
    
  warnings.warn(msg, ImportWarning)
/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/contrib/transfer_to_npu.py:124: RuntimeWarning: torch.jit.script will be disabled by transfer_to_npu, which currently does not support it.
  warnings.warn(msg, RuntimeWarning)
/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/contrib/transfer_to_npu.py:124: RuntimeWarning: torch.jit.script will be disabled by transfer_to_npu, which currently does not support it.
  warnings.warn(msg, RuntimeWarning)
/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/contrib/transfer_to_npu.py:124: RuntimeWarning: torch.jit.script will be disabled by transfer_to_npu, which currently does not support it.
  warnings.warn(msg, RuntimeWarning)
/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/contrib/transfer_to_npu.py:124: RuntimeWarning: torch.jit.script will be disabled by transfer_to_npu, which currently does not support it.
  warnings.warn(msg, RuntimeWarning)
/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/contrib/transfer_to_npu.py:124: RuntimeWarning: torch.jit.script will be disabled by transfer_to_npu, which currently does not support it.
  warnings.warn(msg, RuntimeWarning)
/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/contrib/transfer_to_npu.py:124: RuntimeWarning: torch.jit.script will be disabled by transfer_to_npu, which currently does not support it.
  warnings.warn(msg, RuntimeWarning)
/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/contrib/transfer_to_npu.py:124: RuntimeWarning: torch.jit.script will be disabled by transfer_to_npu, which currently does not support it.
  warnings.warn(msg, RuntimeWarning)
/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/contrib/transfer_to_npu.py:124: RuntimeWarning: torch.jit.script will be disabled by transfer_to_npu, which currently does not support it.
  warnings.warn(msg, RuntimeWarning)
Traceback (most recent call last):
  File "pretrain_baichuan.py", line 25, in <module>
Traceback (most recent call last):
Traceback (most recent call last):
  File "pretrain_baichuan.py", line 25, in <module>
  File "pretrain_baichuan.py", line 25, in <module>
Traceback (most recent call last):
  File "pretrain_baichuan.py", line 25, in <module>
Traceback (most recent call last):
  File "pretrain_baichuan.py", line 25, in <module>
Traceback (most recent call last):
  File "pretrain_baichuan.py", line 25, in <module>
        import deepspeed_npuimport deepspeed_npu    

import deepspeed_npu
  File "/root/baichuan2/deepspeed_npu/deepspeed_npu/__init__.py", line 6, in <module>
  File "/root/baichuan2/deepspeed_npu/deepspeed_npu/__init__.py", line 6, in <module>
  File "/root/baichuan2/deepspeed_npu/deepspeed_npu/__init__.py", line 6, in <module>
    import deepspeed_npu
  File "/root/baichuan2/deepspeed_npu/deepspeed_npu/__init__.py", line 6, in <module>
    import deepspeed_npu
  File "/root/baichuan2/deepspeed_npu/deepspeed_npu/__init__.py", line 6, in <module>
    import deepspeed_npu
  File "/root/baichuan2/deepspeed_npu/deepspeed_npu/__init__.py", line 6, in <module>
    FLAG_SUPPORT_INF_NAN = hasattr(torch_npu.npu.utils, 'is_support_inf_nan') and torch_npu.npu.utils.is_support_inf_nan()
  File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/npu/utils.py", line 313, in is_support_inf_nan
    FLAG_SUPPORT_INF_NAN = hasattr(torch_npu.npu.utils, 'is_support_inf_nan') and torch_npu.npu.utils.is_support_inf_nan()
  File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/npu/utils.py", line 313, in is_support_inf_nan
    FLAG_SUPPORT_INF_NAN = hasattr(torch_npu.npu.utils, 'is_support_inf_nan') and torch_npu.npu.utils.is_support_inf_nan()
  File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/npu/utils.py", line 313, in is_support_inf_nan
        FLAG_SUPPORT_INF_NAN = hasattr(torch_npu.npu.utils, 'is_support_inf_nan') and torch_npu.npu.utils.is_support_inf_nan()FLAG_SUPPORT_INF_NAN = hasattr(torch_npu.npu.utils, 'is_support_inf_nan') and torch_npu.npu.utils.is_support_inf_nan()

  File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/npu/utils.py", line 313, in is_support_inf_nan
  File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/npu/utils.py", line 313, in is_support_inf_nan
    FLAG_SUPPORT_INF_NAN = hasattr(torch_npu.npu.utils, 'is_support_inf_nan') and torch_npu.npu.utils.is_support_inf_nan()
  File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/npu/utils.py", line 313, in is_support_inf_nan
    torch_npu.npu._lazy_init()
      File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/npu/__init__.py", line 203, in _lazy_init
torch_npu.npu._lazy_init()
  File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/npu/__init__.py", line 203, in _lazy_init
    torch_npu.npu._lazy_init()
  File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/npu/__init__.py", line 203, in _lazy_init
    torch_npu._C._npu_init()    
torch_npu.npu._lazy_init()    
RuntimeErrortorch_npu._C._npu_init()  File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/npu/__init__.py", line 203, in _lazy_init

: Initialize:torch_npu/csrc/core/npu/sys_ctrl/npu_sys_ctrl.cpp:120 NPU error, error code is 507008
[Error]: Failed to obtain the SOC version. 
        Rectify the fault based on the error information in the ascend log.
EE1001: The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
        Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
        TraceBack (most recent call last):
        [Init][Version]init soc version failed, ret = 507008[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
        ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:4541]
        The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
RuntimeError
: Initialize:torch_npu/csrc/core/npu/sys_ctrl/npu_sys_ctrl.cpp:120 NPU error, error code is 507008
[Error]: Failed to obtain the SOC version. 
        Rectify the fault based on the error information in the ascend log.
EE1001: The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
        Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
        TraceBack (most recent call last):
        [Init][Version]init soc version failed, ret = 507008[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
        ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:4541]
        The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
    
torch_npu.npu._lazy_init()
  File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/npu/__init__.py", line 203, in _lazy_init
    torch_npu._C._npu_init()
RuntimeError: Initialize:torch_npu/csrc/core/npu/sys_ctrl/npu_sys_ctrl.cpp:120 NPU error, error code is 507008
[Error]: Failed to obtain the SOC version. 
        Rectify the fault based on the error information in the ascend log.
EE1001: The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
        Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
        TraceBack (most recent call last):
        [Init][Version]init soc version failed, ret = 507008[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
        ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:4541]
        The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
    
torch_npu.npu._lazy_init()
  File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/npu/__init__.py", line 203, in _lazy_init
    torch_npu._C._npu_init()
RuntimeError    : torch_npu._C._npu_init()Initialize:torch_npu/csrc/core/npu/sys_ctrl/npu_sys_ctrl.cpp:120 NPU error, error code is 507008
[Error]: Failed to obtain the SOC version. 
        Rectify the fault based on the error information in the ascend log.
EE1001: The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
        Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
        TraceBack (most recent call last):
        [Init][Version]init soc version failed, ret = 507008[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
        ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:4541]
        The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]


RuntimeError: Initialize:torch_npu/csrc/core/npu/sys_ctrl/npu_sys_ctrl.cpp:120 NPU error, error code is 507008
[Error]: Failed to obtain the SOC version. 
        Rectify the fault based on the error information in the ascend log.
EE1001: The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
        Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
        TraceBack (most recent call last):
        [Init][Version]init soc version failed, ret = 507008[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
        ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:4541]
        The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]

    torch_npu._C._npu_init()
RuntimeError: Initialize:torch_npu/csrc/core/npu/sys_ctrl/npu_sys_ctrl.cpp:120 NPU error, error code is 507008
[Error]: Failed to obtain the SOC version. 
        Rectify the fault based on the error information in the ascend log.
EE1001: The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
        Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
        TraceBack (most recent call last):
        [Init][Version]init soc version failed, ret = 507008[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
        ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:4541]
        The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]

Traceback (most recent call last):
Traceback (most recent call last):
  File "pretrain_baichuan.py", line 25, in <module>
  File "pretrain_baichuan.py", line 25, in <module>
        import deepspeed_npuimport deepspeed_npu

  File "/root/baichuan2/deepspeed_npu/deepspeed_npu/__init__.py", line 6, in <module>
  File "/root/baichuan2/deepspeed_npu/deepspeed_npu/__init__.py", line 6, in <module>
        FLAG_SUPPORT_INF_NAN = hasattr(torch_npu.npu.utils, 'is_support_inf_nan') and torch_npu.npu.utils.is_support_inf_nan()FLAG_SUPPORT_INF_NAN = hasattr(torch_npu.npu.utils, 'is_support_inf_nan') and torch_npu.npu.utils.is_support_inf_nan()

  File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/npu/utils.py", line 313, in is_support_inf_nan
  File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/npu/utils.py", line 313, in is_support_inf_nan
    torch_npu.npu._lazy_init()
  File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/npu/__init__.py", line 203, in _lazy_init
    torch_npu.npu._lazy_init()
  File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/npu/__init__.py", line 203, in _lazy_init
    torch_npu._C._npu_init()
RuntimeError: Initialize:torch_npu/csrc/core/npu/sys_ctrl/npu_sys_ctrl.cpp:120 NPU error, error code is 507008
[Error]: Failed to obtain the SOC version. 
        Rectify the fault based on the error information in the ascend log.
EE1001: The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
        Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
        TraceBack (most recent call last):
        [Init][Version]init soc version failed, ret = 507008[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
        ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:4541]
        The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
    
torch_npu._C._npu_init()
RuntimeError: Initialize:torch_npu/csrc/core/npu/sys_ctrl/npu_sys_ctrl.cpp:120 NPU error, error code is 507008
[Error]: Failed to obtain the SOC version. 
        Rectify the fault based on the error information in the ascend log.
EE1001: The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
        Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
        TraceBack (most recent call last):
        [Init][Version]init soc version failed, ret = 507008[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
        ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:4541]
        The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]

[2024-02-22 14:35:53,097] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 205527) of binary: /root/.local/conda/envs/baichuan2/bin/python
Traceback (most recent call last):
  File "/root/.local/conda/envs/baichuan2/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/.local/conda/envs/baichuan2/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
pretrain_baichuan.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-02-22_14:35:53
  host      : dl-231116164921eba-pod-jupyter-b8f66cdd9-knmld
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 205528)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-02-22_14:35:53
  host      : dl-231116164921eba-pod-jupyter-b8f66cdd9-knmld
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 205529)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-02-22_14:35:53
  host      : dl-231116164921eba-pod-jupyter-b8f66cdd9-knmld
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 205530)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
  time      : 2024-02-22_14:35:53
  host      : dl-231116164921eba-pod-jupyter-b8f66cdd9-knmld
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 205531)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
  time      : 2024-02-22_14:35:53
  host      : dl-231116164921eba-pod-jupyter-b8f66cdd9-knmld
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 205532)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
  time      : 2024-02-22_14:35:53
  host      : dl-231116164921eba-pod-jupyter-b8f66cdd9-knmld
  rank      : 6 (local_rank: 6)
  exitcode  : 1 (pid: 205533)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
  time      : 2024-02-22_14:35:53
  host      : dl-231116164921eba-pod-jupyter-b8f66cdd9-knmld
  rank      : 7 (local_rank: 7)
  exitcode  : 1 (pid: 205534)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-02-22_14:35:53
  host      : dl-231116164921eba-pod-jupyter-b8f66cdd9-knmld
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 205527)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

二、软件版本:
-- CANN 版本 (e.g., CANN 3.0.x,5.x.x): Ascend-cann-toolkit_7.0.0_linux-x86_64
--Tensorflow/Pytorch/MindSpore 版本: torch 2.1.0 torch-npu 2.1.0
--Python 版本 (e.g., Python 3.7.5): 3.8.18
-- MindStudio版本 (e.g., MindStudio 2.0.0 (beta3)):
--操作系统版本 (e.g., Ubuntu 18.04): Ubuntu 18.04

三、测试步骤:
容器原CANN版本为 6.3RC1 在容器中安装CANN 7.0.0
按照modelzoo中baichuan2-13b 进行环境安装
执行训练时报错

评论 (3)

yinziqi 创建了训练问题 1年前

麻烦问下是否modelzoo模型

是的,就是modelzoo的模型

是否执行过类似torch.npu.set_device('npu:5')

huangyunlong 任务状态TODO 修改为ACCEPTED 1年前
huangyunlong 任务状态ACCEPTED 修改为Analysing 1年前
huangyunlong 任务状态Analysing 修改为DONE 9个月前

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(2)
Destiny-wx1103340 yinziqi-yinziqi0118
Python
1
https://gitee.com/ascend/pytorch.git
git@gitee.com:ascend/pytorch.git
ascend
pytorch
pytorch

搜索帮助