代码拉取完成,页面将自动刷新
一、问题现象(附报错日志上下文):
运行bash examples/baichuan2/pretrain_baichuan2_ptd_13B.sh时报错
/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/dynamo/__init__.py:18: UserWarning: Register eager implementation for the 'npu' backend of dynamo, as torch_npu was not compiled with torchair.
warnings.warn(
/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/dynamo/__init__.py:18: UserWarning: Register eager implementation for the 'npu' backend of dynamo, as torch_npu was not compiled with torchair.
warnings.warn(
/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/dynamo/__init__.py:18: UserWarning: Register eager implementation for the 'npu' backend of dynamo, as torch_npu was not compiled with torchair.
warnings.warn(
/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/dynamo/__init__.py:18: UserWarning: Register eager implementation for the 'npu' backend of dynamo, as torch_npu was not compiled with torchair.
warnings.warn(
/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/dynamo/__init__.py:18: UserWarning: Register eager implementation for the 'npu' backend of dynamo, as torch_npu was not compiled with torchair.
warnings.warn(
/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/dynamo/__init__.py:18: UserWarning: Register eager implementation for the 'npu' backend of dynamo, as torch_npu was not compiled with torchair.
warnings.warn(
/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/dynamo/__init__.py:18: UserWarning: Register eager implementation for the 'npu' backend of dynamo, as torch_npu was not compiled with torchair.
warnings.warn(
/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/dynamo/__init__.py:18: UserWarning: Register eager implementation for the 'npu' backend of dynamo, as torch_npu was not compiled with torchair.
warnings.warn(
/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/contrib/transfer_to_npu.py:162: ImportWarning:
*************************************************************************************************************
The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now..
The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now..
The backend in torch.distributed.init_process_group set to hccl now..
The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now..
The device parameters have been replaced with npu in the function below:
torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.nn.Module.to, torch.nn.Module.to_empty
*************************************************************************************************************
warnings.warn(msg, ImportWarning)
/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/contrib/transfer_to_npu.py:124: RuntimeWarning: torch.jit.script will be disabled by transfer_to_npu, which currently does not support it.
warnings.warn(msg, RuntimeWarning)
/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/contrib/transfer_to_npu.py:124: RuntimeWarning: torch.jit.script will be disabled by transfer_to_npu, which currently does not support it.
warnings.warn(msg, RuntimeWarning)
/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/contrib/transfer_to_npu.py:124: RuntimeWarning: torch.jit.script will be disabled by transfer_to_npu, which currently does not support it.
warnings.warn(msg, RuntimeWarning)
/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/contrib/transfer_to_npu.py:124: RuntimeWarning: torch.jit.script will be disabled by transfer_to_npu, which currently does not support it.
warnings.warn(msg, RuntimeWarning)
/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/contrib/transfer_to_npu.py:124: RuntimeWarning: torch.jit.script will be disabled by transfer_to_npu, which currently does not support it.
warnings.warn(msg, RuntimeWarning)
/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/contrib/transfer_to_npu.py:124: RuntimeWarning: torch.jit.script will be disabled by transfer_to_npu, which currently does not support it.
warnings.warn(msg, RuntimeWarning)
/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/contrib/transfer_to_npu.py:124: RuntimeWarning: torch.jit.script will be disabled by transfer_to_npu, which currently does not support it.
warnings.warn(msg, RuntimeWarning)
/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/contrib/transfer_to_npu.py:124: RuntimeWarning: torch.jit.script will be disabled by transfer_to_npu, which currently does not support it.
warnings.warn(msg, RuntimeWarning)
Traceback (most recent call last):
File "pretrain_baichuan.py", line 25, in <module>
Traceback (most recent call last):
Traceback (most recent call last):
File "pretrain_baichuan.py", line 25, in <module>
File "pretrain_baichuan.py", line 25, in <module>
Traceback (most recent call last):
File "pretrain_baichuan.py", line 25, in <module>
Traceback (most recent call last):
File "pretrain_baichuan.py", line 25, in <module>
Traceback (most recent call last):
File "pretrain_baichuan.py", line 25, in <module>
import deepspeed_npuimport deepspeed_npu
import deepspeed_npu
File "/root/baichuan2/deepspeed_npu/deepspeed_npu/__init__.py", line 6, in <module>
File "/root/baichuan2/deepspeed_npu/deepspeed_npu/__init__.py", line 6, in <module>
File "/root/baichuan2/deepspeed_npu/deepspeed_npu/__init__.py", line 6, in <module>
import deepspeed_npu
File "/root/baichuan2/deepspeed_npu/deepspeed_npu/__init__.py", line 6, in <module>
import deepspeed_npu
File "/root/baichuan2/deepspeed_npu/deepspeed_npu/__init__.py", line 6, in <module>
import deepspeed_npu
File "/root/baichuan2/deepspeed_npu/deepspeed_npu/__init__.py", line 6, in <module>
FLAG_SUPPORT_INF_NAN = hasattr(torch_npu.npu.utils, 'is_support_inf_nan') and torch_npu.npu.utils.is_support_inf_nan()
File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/npu/utils.py", line 313, in is_support_inf_nan
FLAG_SUPPORT_INF_NAN = hasattr(torch_npu.npu.utils, 'is_support_inf_nan') and torch_npu.npu.utils.is_support_inf_nan()
File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/npu/utils.py", line 313, in is_support_inf_nan
FLAG_SUPPORT_INF_NAN = hasattr(torch_npu.npu.utils, 'is_support_inf_nan') and torch_npu.npu.utils.is_support_inf_nan()
File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/npu/utils.py", line 313, in is_support_inf_nan
FLAG_SUPPORT_INF_NAN = hasattr(torch_npu.npu.utils, 'is_support_inf_nan') and torch_npu.npu.utils.is_support_inf_nan()FLAG_SUPPORT_INF_NAN = hasattr(torch_npu.npu.utils, 'is_support_inf_nan') and torch_npu.npu.utils.is_support_inf_nan()
File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/npu/utils.py", line 313, in is_support_inf_nan
File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/npu/utils.py", line 313, in is_support_inf_nan
FLAG_SUPPORT_INF_NAN = hasattr(torch_npu.npu.utils, 'is_support_inf_nan') and torch_npu.npu.utils.is_support_inf_nan()
File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/npu/utils.py", line 313, in is_support_inf_nan
torch_npu.npu._lazy_init()
File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/npu/__init__.py", line 203, in _lazy_init
torch_npu.npu._lazy_init()
File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/npu/__init__.py", line 203, in _lazy_init
torch_npu.npu._lazy_init()
File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/npu/__init__.py", line 203, in _lazy_init
torch_npu._C._npu_init()
torch_npu.npu._lazy_init()
RuntimeErrortorch_npu._C._npu_init() File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/npu/__init__.py", line 203, in _lazy_init
: Initialize:torch_npu/csrc/core/npu/sys_ctrl/npu_sys_ctrl.cpp:120 NPU error, error code is 507008
[Error]: Failed to obtain the SOC version.
Rectify the fault based on the error information in the ascend log.
EE1001: The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
TraceBack (most recent call last):
[Init][Version]init soc version failed, ret = 507008[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:4541]
The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
RuntimeError
: Initialize:torch_npu/csrc/core/npu/sys_ctrl/npu_sys_ctrl.cpp:120 NPU error, error code is 507008
[Error]: Failed to obtain the SOC version.
Rectify the fault based on the error information in the ascend log.
EE1001: The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
TraceBack (most recent call last):
[Init][Version]init soc version failed, ret = 507008[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:4541]
The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
torch_npu.npu._lazy_init()
File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/npu/__init__.py", line 203, in _lazy_init
torch_npu._C._npu_init()
RuntimeError: Initialize:torch_npu/csrc/core/npu/sys_ctrl/npu_sys_ctrl.cpp:120 NPU error, error code is 507008
[Error]: Failed to obtain the SOC version.
Rectify the fault based on the error information in the ascend log.
EE1001: The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
TraceBack (most recent call last):
[Init][Version]init soc version failed, ret = 507008[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:4541]
The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
torch_npu.npu._lazy_init()
File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/npu/__init__.py", line 203, in _lazy_init
torch_npu._C._npu_init()
RuntimeError : torch_npu._C._npu_init()Initialize:torch_npu/csrc/core/npu/sys_ctrl/npu_sys_ctrl.cpp:120 NPU error, error code is 507008
[Error]: Failed to obtain the SOC version.
Rectify the fault based on the error information in the ascend log.
EE1001: The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
TraceBack (most recent call last):
[Init][Version]init soc version failed, ret = 507008[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:4541]
The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
RuntimeError: Initialize:torch_npu/csrc/core/npu/sys_ctrl/npu_sys_ctrl.cpp:120 NPU error, error code is 507008
[Error]: Failed to obtain the SOC version.
Rectify the fault based on the error information in the ascend log.
EE1001: The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
TraceBack (most recent call last):
[Init][Version]init soc version failed, ret = 507008[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:4541]
The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
torch_npu._C._npu_init()
RuntimeError: Initialize:torch_npu/csrc/core/npu/sys_ctrl/npu_sys_ctrl.cpp:120 NPU error, error code is 507008
[Error]: Failed to obtain the SOC version.
Rectify the fault based on the error information in the ascend log.
EE1001: The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
TraceBack (most recent call last):
[Init][Version]init soc version failed, ret = 507008[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:4541]
The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
Traceback (most recent call last):
Traceback (most recent call last):
File "pretrain_baichuan.py", line 25, in <module>
File "pretrain_baichuan.py", line 25, in <module>
import deepspeed_npuimport deepspeed_npu
File "/root/baichuan2/deepspeed_npu/deepspeed_npu/__init__.py", line 6, in <module>
File "/root/baichuan2/deepspeed_npu/deepspeed_npu/__init__.py", line 6, in <module>
FLAG_SUPPORT_INF_NAN = hasattr(torch_npu.npu.utils, 'is_support_inf_nan') and torch_npu.npu.utils.is_support_inf_nan()FLAG_SUPPORT_INF_NAN = hasattr(torch_npu.npu.utils, 'is_support_inf_nan') and torch_npu.npu.utils.is_support_inf_nan()
File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/npu/utils.py", line 313, in is_support_inf_nan
File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/npu/utils.py", line 313, in is_support_inf_nan
torch_npu.npu._lazy_init()
File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/npu/__init__.py", line 203, in _lazy_init
torch_npu.npu._lazy_init()
File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch_npu/npu/__init__.py", line 203, in _lazy_init
torch_npu._C._npu_init()
RuntimeError: Initialize:torch_npu/csrc/core/npu/sys_ctrl/npu_sys_ctrl.cpp:120 NPU error, error code is 507008
[Error]: Failed to obtain the SOC version.
Rectify the fault based on the error information in the ascend log.
EE1001: The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
TraceBack (most recent call last):
[Init][Version]init soc version failed, ret = 507008[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:4541]
The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
torch_npu._C._npu_init()
RuntimeError: Initialize:torch_npu/csrc/core/npu/sys_ctrl/npu_sys_ctrl.cpp:120 NPU error, error code is 507008
[Error]: Failed to obtain the SOC version.
Rectify the fault based on the error information in the ascend log.
EE1001: The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
TraceBack (most recent call last):
[Init][Version]init soc version failed, ret = 507008[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:4541]
The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
[2024-02-22 14:35:53,097] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 205527) of binary: /root/.local/conda/envs/baichuan2/bin/python
Traceback (most recent call last):
File "/root/.local/conda/envs/baichuan2/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/root/.local/conda/envs/baichuan2/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in <module>
main()
File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/.local/conda/envs/baichuan2/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
pretrain_baichuan.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-02-22_14:35:53
host : dl-231116164921eba-pod-jupyter-b8f66cdd9-knmld
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 205528)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-02-22_14:35:53
host : dl-231116164921eba-pod-jupyter-b8f66cdd9-knmld
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 205529)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-02-22_14:35:53
host : dl-231116164921eba-pod-jupyter-b8f66cdd9-knmld
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 205530)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
time : 2024-02-22_14:35:53
host : dl-231116164921eba-pod-jupyter-b8f66cdd9-knmld
rank : 4 (local_rank: 4)
exitcode : 1 (pid: 205531)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
time : 2024-02-22_14:35:53
host : dl-231116164921eba-pod-jupyter-b8f66cdd9-knmld
rank : 5 (local_rank: 5)
exitcode : 1 (pid: 205532)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
time : 2024-02-22_14:35:53
host : dl-231116164921eba-pod-jupyter-b8f66cdd9-knmld
rank : 6 (local_rank: 6)
exitcode : 1 (pid: 205533)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
time : 2024-02-22_14:35:53
host : dl-231116164921eba-pod-jupyter-b8f66cdd9-knmld
rank : 7 (local_rank: 7)
exitcode : 1 (pid: 205534)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-02-22_14:35:53
host : dl-231116164921eba-pod-jupyter-b8f66cdd9-knmld
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 205527)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
二、软件版本:
-- CANN 版本 (e.g., CANN 3.0.x,5.x.x): Ascend-cann-toolkit_7.0.0_linux-x86_64
--Tensorflow/Pytorch/MindSpore 版本: torch 2.1.0 torch-npu 2.1.0
--Python 版本 (e.g., Python 3.7.5): 3.8.18
-- MindStudio版本 (e.g., MindStudio 2.0.0 (beta3)):
--操作系统版本 (e.g., Ubuntu 18.04): Ubuntu 18.04
三、测试步骤:
容器原CANN版本为 6.3RC1 在容器中安装CANN 7.0.0
按照modelzoo中baichuan2-13b 进行环境安装
执行训练时报错