多卡训练报错

你好，我的python版本3.8.10，CAAN版本为7.0RC1，torch版本1.11.0

在跑自己的模型时，单卡训练正常，在跑多卡（4卡）时遇到以下问题：
```
Traceback (most recent call last):
  File "bin/train_ddp.py", line 122, in <module>
    main()
  File "bin/train_ddp.py", line 97, in main
    task.to_distributed(args.nodes, args.node_id, args.gpus, args.gpu)
  File "/home/liuhangchen/Workspace/eteh-v2-release-offline_export_transfDDP_202311/env_train/eteh/tools/interface/pytorch_backend/th_task.py", line 59, in to_distributed
    self.trainer.model = DDP(self.model, device_ids=[
  File "/usr/local/python3.8.10/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 641, in __init__
    dist._verify_params_across_processes(self.process_group, parameters)
RuntimeError: [ERROR] HCCL error in: /usr1/01/workspace/j_ctViaEnN/pytorch/torch_npu/csrc/distributed/HCCLUtils.hpp:80.
EJ0001: Failed to initialize the HCCP process. Reason: Maybe the last training process is running.
        Solution: Wait for 10s after killing the last training process and try again.
        TraceBack (most recent call last):
        tsd client wait response fail, device response code[1]. unknown device error.[FUNC:WaitRsp][FILE:process_mode_manager.cpp][LINE:270]

Traceback (most recent call last):
  File "bin/train_ddp.py", line 122, in <module>
    main()
  File "bin/train_ddp.py", line 97, in main
    task.to_distributed(args.nodes, args.node_id, args.gpus, args.gpu)
  File "/home/liuhangchen/Workspace/eteh-v2-release-offline_export_transfDDP_202311/env_train/eteh/tools/interface/pytorch_backend/th_task.py", line 59, in to_distributed
    self.trainer.model = DDP(self.model, device_ids=[
  File "/usr/local/python3.8.10/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 641, in __init__
    dist._verify_params_across_processes(self.process_group, parameters)
RuntimeError: [ERROR] HCCL error in: /usr1/01/workspace/j_ctViaEnN/pytorch/torch_npu/csrc/distributed/HCCLUtils.hpp:80.
EJ0001: Failed to initialize the HCCP process. Reason: Maybe the last training process is running.
        Solution: Wait for 10s after killing the last training process and try again.
        TraceBack (most recent call last):
        tsd client wait response fail, device response code[1]. unknown device error.[FUNC:WaitRsp][FILE:process_mode_manager.cpp][LINE:270]

/usr/local/python3.8.10/lib/python3.8/tempfile.py:818: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmplttd8atn'>
  _warnings.warn(warn_message, ResourceWarning)
/usr/local/python3.8.10/lib/python3.8/tempfile.py:818: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmpiud_t2y7'>
  _warnings.warn(warn_message, ResourceWarning)
/usr/local/python3.8.10/lib/python3.8/tempfile.py:818: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmph9uzpg9i'>
  _warnings.warn(warn_message, ResourceWarning)
Traceback (most recent call last):
  File "bin/train_ddp.py", line 122, in <module>
    main()
  File "bin/train_ddp.py", line 97, in main
    task.to_distributed(args.nodes, args.node_id, args.gpus, args.gpu)
  File "/home/liuhangchen/Workspace/eteh-v2-release-offline_export_transfDDP_202311/env_train/eteh/tools/interface/pytorch_backend/th_task.py", line 59, in to_distributed
    self.trainer.model = DDP(self.model, device_ids=[
  File "/usr/local/python3.8.10/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 641, in __init__
    dist._verify_params_across_processes(self.process_group, parameters)
RuntimeError: [ERROR] HCCL error in: /usr1/01/workspace/j_ctViaEnN/pytorch/torch_npu/csrc/distributed/HCCLUtils.hpp:80.
EI0006: Getting socket times out. Reason: Remote Rank did not send the data in time. Please check the reason for the rank being stuck
        Solution: 1. Check the rank service processes with other errors or no errors in the cluster.2. If this error is reported for all NPUs, check whether the time difference between the earliest and latest errors is greater than the connect timeout interval (120s by default). If so, adjust the timeout interval by using the HCCL_CONNECT_TIMEOUT environment variable.3. Check the connectivity of the communication link between nodes. (For details, see the TLS command and HCCN connectivity check examples.). For details:https://www.hiascend.com/document

/usr/local/python3.8.10/lib/python3.8/tempfile.py:818: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmp77wv06bj'>
  _warnings.warn(warn_message, ResourceWarning)
./run.offline.transDDP.sh: line 73: 74074 Segmentation fault      python3 bin/train_ddp.py -train_config ${train_config} -data_config ${data_conf} -train_name HKUST -task_file bin.taskegs.pytorch_backend.task_ctc_att -task_name CtcAttTask -exp_dir ${exp_dir} -num_epochs ${epochs} -seed 100 --split --nodes $num_nodes --node_id $node_rank --gpus $num_gpus --gpu $gpu_id --dist_backend $dist_backend --init_method $init_method
```

Ascend/pytorch
Paused

Content Risk Flag

Comments (9)

Ascend/pytorchPaused .gitee-modal { width: 500px !important; }

Content Risk Flag