流水线并行训练时stage间使用dist.isend发送数据出错（plog报hdc init failed）

一、问题现象（附报错日志上下文）：
我使用torch.distributed.init_process_group构建多进程,然后使用dist.isend发送数据，dist.irecv接收数据，代码在gpu上完美运行，但是在ascend 910b上出现问题，并且在ascend 910b上训练llama2-7b没有问题，hdc初始化成功，通信也没有问题，但是换成mistral-7b或者GPT2就会出现这个hdc初始化错误的问题，甚至gpt2我运行流水线并行数=4时一切正常，设置为2就会报这个错误，但这些模型在gpu上都可以正常跑。我认为应该不是我代码的问题，请问这个错误究竟是怎么出现的呢，我该如何解决！

二、软件版本:
-- CANN 版本 : CANN 8.0.0
--Tensorflow/Pytorch/MindSpore 版本:Pytorch2.1.0
--Python 版本:Python 3.8
-- MindStudio版本 (e.g., MindStudio 2.0.0 (beta3)):
--操作系统版本 :euler

三、测试步骤：

四、日志信息:
PLOG日志如下

[INFO] TDT(288,python):2024-07-17-17:02:12.104.057 [process_mode_manager.cpp:170][OpenProcess][tid:288] [TsdClient][deviceId=0] [sessionId=1] start hccp and computer process success
[INFO] HCCP(288,python):2024-07-17-17:02:12.104.072 [ra_host.c:332]tid:288,ra_init(332) : Input parameters: phy_id[1], nic_position:[1] hdc_type:[6]
[INFO] HCCP(288,python):2024-07-17-17:02:12.104.092 [ra_hdc.c:1540]tid:288,ra_hdc_init(1540) : hdc init start! logic id is 1, phy id is 1, hdc_type is 6
[ERROR] HCCP(288,python):2024-07-17-17:02:12.104.159 [ra_hdc.c:1550]tid:288,ra_hdc_init(1550) : [init][ra_hdc]hdc session_connect failed ret(35) phy_id(1)
[ERROR] HCCP(288,python):2024-07-17-17:02:12.104.167 [ra_host.c:355]tid:288,ra_init(355) : [init][ra]ra hdc init failed, ret(-1)
[ERROR] HCCL(288,python):2024-07-17-17:02:12.104.178 [adapter_hccp.cc:398][288][Init][Ra]errNo[0x0000000005000013] ra init fail. return[128000]
[ERROR] HCCL(288,python):2024-07-17-17:02:12.104.326 [network_manager.cc:121][288][NetworkManager][Init]errNo[0x0000000005000013] ra init failed,return[19] devicePhyId_[1], nic_position[1]
[ERROR] HCCL(288,python):2024-07-17-17:02:12.104.334 [topoinfo_detect.cc:419][288][Start][Network]NetworkManager Init failed! deviceLogicID[0], deploy[1]
[ERROR] HCCL(288,python):2024-07-17-17:02:12.104.340 [topoinfo_detect.cc:198][288][Setup][Agent]topo detect agent start network failed! rank[0]
[ERROR] HCCL(288,python):2024-07-17-17:02:12.104.349 [op_base.cc:434][288][Init][CommRootInfo]errNo[0x0000000005000013] setup topo detect error
[INFO] HCCP(288,python):2024-07-17-17:02:12.104.359 [ra_host.c:1968]tid:288,ra_get_ifnum(1968) : Input parameters: phy_id[0], nic_position:[0]
[INFO] HCCL(288,python):2024-07-17-17:02:12.105.760 [adapter_hccp.cc:925][288][Get][HostIf]hrtGetIfNum success. ifAddrNum[2].
[INFO] HCCP(288,python):2024-07-17-17:02:12.105.771 [ra_host.c:2016]tid:288,ra_get_ifaddrs(2016) : Input parameters: phy_id[0], nic_position:[0], interface num[2]
[INFO] HCCL(288,python):2024-07-17-17:02:12.106.336 [sal.cc:411][288]nic class[normal]: find nic[172.16.0.108%eth0] success.
[ERROR] HCCL(288,python):2024-07-17-17:02:12.106.348 [op_base.cc:481][288][HCCL_TRACE]HcclCommInitRootInfo failed, return[0x0000000005000013], rankNum[4], rank[0], rootInfo identifier[172.16.0.108%eth0_60001_1_1721235729854427], server[172.16.0.108%eth0], logicDevId[-1]
[INFO] HCCL(288,python):2024-07-17-17:02:12.106.688 [op_base.cc:1340][288]com is not global com
[ERROR] HCCL(288,python):2024-07-17-17:02:12.106.891 [op_base.cc:1352][288][HcclCommDestroy] comm is not exist, comm=0xaaab475968c0, group=172.16.0.108%eth0_60001_1_1721235729854427, deviceLogicId=0

Traceback (most recent call last):
  File "/home/ma-user/modelarts/user-job-dir/mistral-7B-npu/train_mistral.py", line 7, in <module>
    pretrain()
  File "/home/ma-user/modelarts/user-job-dir/mistral-7B-npu/mistral_pp.py", line 148, in pretrain
    loss = forward_backward_pipelining_without_interleaving2(
  File "/home/ma-user/nkn/gees/GeeSibling/python/geesibling/adapters/pytorch/pipeline/pipeline/training2.py", line 155, in forward_backward_pipelining_without_interleaving2
    p2p_communication.send_forward(output_tensor)
  File "/home/ma-user/nkn/gees/GeeSibling/python/geesibling/adapters/pytorch/pipeline/pipeline/p2p_communication.py", line 22, in send_forward
    req = dist.isend(tensor=output_tensor, dst=next_rank)
  File "/home/ma-user/anaconda3/envs/gees/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1525, in isend
    return default_pg.send([tensor], dst, tag)
RuntimeError: [ERROR] HCCL error in: torch_npu/csrc/distributed/ProcessGroupHCCL.cpp:64
[ERROR] 2024-07-17-17:02:12 (PID:288, Device:0, RankID:0) ERR02200 DIST call hccl api failed.
EI9999: Inner Error!
EI9999: 2024-07-17-17:02:12.104.200  ra init failed,return[19] devicePhyId_[1], nic_position[1][FUNC:Init][FILE:network_manager.cc][LINE:118]
        TraceBack (most recent call last):

[INFO] TDT(290,python):2024-07-17-17:02:12.120.652 [stub_process_mode_nowin.cpp:63][ProcessQueueForMdc][tid:290] [TsdClient] it is unnecessary of current mode[0] chiptype[5] to grant queue auth to aicpusd
[INFO] TDT(290,python):2024-07-17-17:02:12.120.671 [stub_process_mode_nowin.cpp:101][OpenInHost][tid:290] enter into OpenInHost deviceid[4]
[INFO] TDT(290,python):2024-07-17-17:02:12.120.714 [client_manager.cpp:368][IsHostEnvironment][tid:290] [TsdClient] logicDeviceId is [4], hostaicpunum[0]
[INFO] TDT(290,python):2024-07-17-17:02:12.120.722 [stub_process_mode_nowin.cpp:105][OpenInHost][tid:290] host cpu not support
[INFO] TDT(290,python):2024-07-17-17:02:12.120.727 [process_mode_manager.cpp:170][OpenProcess][tid:290] [TsdClient][deviceId=4] [sessionId=1] start hccp and computer process success
[INFO] HCCP(290,python):2024-07-17-17:02:12.120.739 [ra_host.c:332]tid:290,ra_init(332) : Input parameters: phy_id[4], nic_position:[1] hdc_type:[6]
[INFO] HCCP(290,python):2024-07-17-17:02:12.120.763 [ra_hdc.c:1540]tid:290,ra_hdc_init(1540) : hdc init start! logic id is 4, phy id is 4, hdc_type is 6
[INFO] HCCP(290,python):2024-07-17-17:02:12.121.104 [ra_hdc.c:1575]tid:290,ra_hdc_init(1575) : hdc init OK! phy_id[4]
[INFO] TDT(291,python):2024-07-17-17:02:12.123.110 [stub_process_mode_nowin.cpp:63][ProcessQueueForMdc][tid:291] [TsdClient] it is unnecessary of current mode[0] chiptype[5] to grant queue auth to aicpusd
[INFO] TDT(291,python):2024-07-17-17:02:12.123.134 [stub_process_mode_nowin.cpp:101][OpenInHost][tid:291] enter into OpenInHost deviceid[6]

当前报错看着就是hdc接口建链的时候，device hdc server不在了（可能是根本没创建出来），怀疑和操作有关。
如果要进一步定位，看下能否提供复现脚本
或者配置如下变量
export ASCEND_GLOBAL_LOG_LEVEL=0
export ASCEND_GLOBAL_EVENT_ENABLE=1
收集下/root/ascend/log的plog日志（跑之前可以清理下旧的日志）
同时执行msnpureport -f导出device日志

似乎是多个进程都涉及到了device0，事实也显示只有o号设备上存在多个进程，然后导致通信间初始化有问题，我对每个进程设置不同的默认device后顺利运行了

Ascend/pytorch

内容风险标识

评论 (2)

Ascend/pytorch .gitee-modal { width: 500px !important; }

内容风险标识