多卡训练EJ0001: hccl报错

一、问题现象（附报错日志上下文）：
线下D910b4机器，使用docker进行训练
单卡单进程可以正常训：
ASCEND_RT_VISIBLE_DEVICES=0 RANK=0 LOCAL_RANK=0 WORLD_SIZE=1 python torchrun_main.py --opt configs/options/segment_mt_npu.yml
但是一上多卡，就报错：
ASCEND_RT_VISIBLE_DEVICES=6,7 python -m torch.distributed.launch --nproc_per_node 2 torchrun_main.py --opt configs/options/segment_mt_npu.yml
报错信息：

Traceback (most recent call last):
  File "/opt/mnt1/ql/code/202405_NPU_train/Swin_s/0506_tuner_with_npu_ql/torchrun_main.py", line 53, in <module>
    main(project_path)
  File "/opt/mnt1/ql/code/202405_NPU_train/Swin_s/0506_tuner_with_npu_ql/torchrun_main.py", line 41, in main
    build_train(opt)(None, project_path)
  File "/opt/mnt1/ql/code/202405_NPU_train/Swin_s/0506_tuner_with_npu_ql/core_code/core/segment_train.py", line 101, in train_segment
    model = build_model(opt)
  File "/opt/mnt1/ql/code/202405_NPU_train/Swin_s/0506_tuner_with_npu_ql/third_party/basicsr/models/__init__.py", line 27, in build_model
    model = MODEL_REGISTRY.get(opt['model_type'])(opt)
  File "/opt/mnt1/ql/code/202405_NPU_train/Swin_s/0506_tuner_with_npu_ql/models/seg_model.py", line 107, in __init__
    self.net_g = self.model_to_device(self.net_g)
  File "/opt/mnt1/ql/code/202405_NPU_train/Swin_s/0506_tuner_with_npu_ql/third_party/basicsr/models/base_model.py", line 153, in model_to_device
    net = DistributedDataParallel(
  File "/usr/local/python3/lib/python3.9/site-packages/torch_npu/contrib/transfer_to_npu.py", line 67, in decorated
    return fn(*args, **kwargs)
  File "/usr/local/python3/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 641, in __init__
    dist._verify_params_across_processes(self.process_group, parameters)
RuntimeError: [ERROR] HCCL error in: torch_npu/csrc/distributed/ProcessGroupHCCL.cpp:81
[ERROR] 2000-02-20-12:13:33 (PID:2887, Device:0, RankID:0) ERR02200 DIST call hccl api failed.
EJ0001: Failed to initialize the HCCP process. Reason: Maybe the last training process is running.
        Solution: Wait for 10s after killing the last training process and try again.
        TraceBack (most recent call last):
        tsd client wait response fail, device response code[1]. unknown device error.[FUNC:WaitRsp][FILE:process_mode_manager.cpp][LINE:283]

二、软件版本:
host驱动：Ascend-hdk-910b-npu-driver_23.0.3_linux-aarch64.run
Ascend-hdk-910b-npu-firmware_7.1.0.5.220.run

Version=23.0.3
ascendhal_version=7.35.19
aicpu_version=1.0
tdt_version=1.0
log_version=1.0
prof_version=2.0
dvppkernels_version=1.1
tsfw_version=1.0
Innerversion=V100R001C15SPC005B220
compatible_version=[V100R001C30],[V100R001C13],[V100R001C15]
compatible_version_fw=[7.0.0,7.1.99]
package_version=23.0.3

-- CANN 版本 : 7.2.T7.0.B121:8.0.T5
-- Pytorch版本: 1.11.0
-- torch-npu: 1.11.0.post11和1.11.0.post12都尝试了，问题一致
--Python 版本: 3.9.18
-- MindStudio版本:不涉及
--操作系统版本: eulerosv2r10（Linux 26e28b339282 4.19.90-vhulk2209.2.0.h1327.eulerosv2r10.aarch64）

三、测试步骤：
docker启动命令：

docker run -itd --privileged --pid=host --cap-add=SYS_PTRACE --device=/dev/davinci0 --device=/dev/davinci1 --device=/dev/davinci2 --device=/dev/davinci3 --device=/dev/davinci4 --device=/dev/davinci5 --device=/dev/davinci6 --device=/dev/davinci7 --device=/dev/davinci_manager --device=/dev/devmm_svm --device=/dev/hisi_hdc -v /usr/local/dcmi:/usr/local/dcmi -v /etc/ascend_install.info:/etc/ascend_install.info -v /sys/fs/cgroup:/sys/fs/cgroup:ro -v /usr/local/Ascend/driver:/usr/local/Ascend/driver -v /sbin/dmidecode:/sbin/dmidecode -v /dev/mem:/dev/mem -v /usr/bin/hostname:/usr/bin/hostname -v /usr/bin/hccn_tool:/usr/bin/hccn_tool -v /opt/mnt1/:/opt/mnt1/ -p 3006:22 --name test_env npu_train:v0.1 /bin/bash

四、日志信息:
详见附件

完整报错信息

ASCEND_RT_VISIBLE_DEVICES=6,7 python -m torch.distributed.launch --nproc_per_node 2 torchrun_main.py --opt configs/options/segment_mt_npu.yml


/usr/local/python3/lib/python3.9/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

  warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
/usr/local/python3/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
/usr/local/python3/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
/usr/local/python3/lib/python3.9/site-packages/torchvision/transforms/functional_tensor.py:5: UserWarning: The torchvision.transforms.functional_tensor module is deprecated in 0.15 and will be **removed in 0.17**. Please don't rely on it. You probably just need to use APIs in torchvision.transforms.functional or in torchvision.transforms.v2.functional.
  warnings.warn(
/usr/local/python3/lib/python3.9/site-packages/torchvision/transforms/functional_tensor.py:5: UserWarning: The torchvision.transforms.functional_tensor module is deprecated in 0.15 and will be **removed in 0.17**. Please don't rely on it. You probably just need to use APIs in torchvision.transforms.functional or in torchvision.transforms.v2.functional.
  warnings.warn(
/usr/local/python3/lib/python3.9/site-packages/torch_npu/contrib/transfer_to_npu.py:163: ImportWarning:
    *************************************************************************************************************
    The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now..
    The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now..
    The backend in torch.distributed.init_process_group set to hccl now..
    The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now..
    The device parameters have been replaced with npu in the function below:
    torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.Generator, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.nn.Module.to, torch.nn.Module.to_empty
    *************************************************************************************************************

  warnings.warn(msg, ImportWarning)
2000-02-20 12:13:18,907 INFO:
        666666

Version Information:
        PyTorch: 1.11.0
        TorchVision: 0.15.0

/usr/local/python3/lib/python3.9/site-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  /root/pytorch/aten/src/ATen/native/TensorShape.cpp:2227.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/usr/local/python3/lib/python3.9/site-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  /root/pytorch/aten/src/ATen/native/TensorShape.cpp:2227.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
2000-02-20 12:13:30,241 INFO: Network [SwinUNetOneOut] is created.
2000-02-20 12:13:30,241 INFO: Loading SwinUNetOneOut model from ../pretrained_models/seg_mt_SwinUNetOneOut_iter0_random_init.pth.
Traceback (most recent call last):
  File "/opt/mnt1/ql/code/202405_NPU_train/Swin_s/0506_tuner_with_npu_ql/torchrun_main.py", line 53, in <module>
    main(project_path)
  File "/opt/mnt1/ql/code/202405_NPU_train/Swin_s/0506_tuner_with_npu_ql/torchrun_main.py", line 41, in main
    build_train(opt)(None, project_path)
  File "/opt/mnt1/ql/code/202405_NPU_train/Swin_s/0506_tuner_with_npu_ql/core_code/core/segment_train.py", line 101, in train_segment
    model = build_model(opt)
  File "/opt/mnt1/ql/code/202405_NPU_train/Swin_s/0506_tuner_with_npu_ql/third_party/basicsr/models/__init__.py", line 27, in build_model
    model = MODEL_REGISTRY.get(opt['model_type'])(opt)
  File "/opt/mnt1/ql/code/202405_NPU_train/Swin_s/0506_tuner_with_npu_ql/models/seg_model.py", line 107, in __init__
    self.net_g = self.model_to_device(self.net_g)
  File "/opt/mnt1/ql/code/202405_NPU_train/Swin_s/0506_tuner_with_npu_ql/third_party/basicsr/models/base_model.py", line 153, in model_to_device
    net = DistributedDataParallel(
  File "/usr/local/python3/lib/python3.9/site-packages/torch_npu/contrib/transfer_to_npu.py", line 67, in decorated
    return fn(*args, **kwargs)
  File "/usr/local/python3/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 641, in __init__
    dist._verify_params_across_processes(self.process_group, parameters)
RuntimeError: [ERROR] HCCL error in: torch_npu/csrc/distributed/ProcessGroupHCCL.cpp:81
[ERROR] 2000-02-20-12:13:33 (PID:2887, Device:0, RankID:0) ERR02200 DIST call hccl api failed.
EJ0001: Failed to initialize the HCCP process. Reason: Maybe the last training process is running.
        Solution: Wait for 10s after killing the last training process and try again.
        TraceBack (most recent call last):
        tsd client wait response fail, device response code[1]. unknown device error.[FUNC:WaitRsp][FILE:process_mode_manager.cpp][LINE:283]

Traceback (most recent call last):
  File "/opt/mnt1/ql/code/202405_NPU_train/Swin_s/0506_tuner_with_npu_ql/torchrun_main.py", line 53, in <module>
    main(project_path)
  File "/opt/mnt1/ql/code/202405_NPU_train/Swin_s/0506_tuner_with_npu_ql/torchrun_main.py", line 41, in main
    build_train(opt)(None, project_path)
  File "/opt/mnt1/ql/code/202405_NPU_train/Swin_s/0506_tuner_with_npu_ql/core_code/core/segment_train.py", line 101, in train_segment
    model = build_model(opt)
  File "/opt/mnt1/ql/code/202405_NPU_train/Swin_s/0506_tuner_with_npu_ql/third_party/basicsr/models/__init__.py", line 27, in build_model
    model = MODEL_REGISTRY.get(opt['model_type'])(opt)
  File "/opt/mnt1/ql/code/202405_NPU_train/Swin_s/0506_tuner_with_npu_ql/models/seg_model.py", line 107, in __init__
    self.net_g = self.model_to_device(self.net_g)
  File "/opt/mnt1/ql/code/202405_NPU_train/Swin_s/0506_tuner_with_npu_ql/third_party/basicsr/models/base_model.py", line 153, in model_to_device
    net = DistributedDataParallel(
  File "/usr/local/python3/lib/python3.9/site-packages/torch_npu/contrib/transfer_to_npu.py", line 67, in decorated
    return fn(*args, **kwargs)
  File "/usr/local/python3/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 641, in __init__
    dist._verify_params_across_processes(self.process_group, parameters)
RuntimeError: [ERROR] HCCL error in: torch_npu/csrc/distributed/ProcessGroupHCCL.cpp:81
[ERROR] 2000-02-20-12:13:33 (PID:2888, Device:1, RankID:1) ERR02200 DIST call hccl api failed.
EJ0001: Failed to initialize the HCCP process. Reason: Maybe the last training process is running.
        Solution: Wait for 10s after killing the last training process and try again.
        TraceBack (most recent call last):
        tsd client wait response fail, device response code[1]. unknown device error.[FUNC:WaitRsp][FILE:process_mode_manager.cpp][LINE:283]
        Fail to get sq reg virtual addr, deviceId=7, sqId=4.[FUNC:Setup][FILE:stream.cc][LINE:1102]
        stream setup failed, retCode=0x7020010.[FUNC:SyncGetDevMsg][FILE:api_impl.cc][LINE:4608]
        Sync get device msg failed, retCode=0x7020010.[FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:4669]
        rtGetDevMsg execute failed, reason=[driver error:internal error][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2887 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 1 (pid: 2888) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/python3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/python3/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/python3/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/usr/local/python3/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/usr/local/python3/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/usr/local/python3/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/usr/local/python3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/python3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
torchrun_main.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2000-02-20_12:13:40
  host      : d16eae4371b2
  rank      : 1 (local_rank: 1)
  exitcode  : -11 (pid: 2888)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 2888
======================================================

尝试将进程杀干净后等待一段时间再跑

Ascend/pytorch

内容风险标识

评论 (2)

Ascend/pytorch .gitee-modal { width: 500px !important; }

内容风险标识