77 Star 596 Fork 1.2K

Ascend/pytorch

单机多卡流水线+模型并行训练传输数据HCCL通信报错(ERR02200 DIST call hccl api failed)

DONE
训练问题
创建于  
2024-07-08 17:12

评论 (9)

Nikanuo 创建了训练问题 11个月前
Nikanuo 修改了描述 11个月前
展开全部操作日志

能提供下详细plog日志吗

我重新搭了下环境,报错变了一下,但好像还是类似的我发在下面了

这是plog日志,似乎说是通信初始化失败,但是目前是8卡跑4个流水线成功,跑2个流水线遇到这个问题,为什么通信会初始化失败呢
[INFO] TDT(36151,python):2024-07-13-08:14:37.131.007 [stub_process_mode_nowin.cpp:63][ProcessQueueForMdc][tid:36151] [TsdClient] it is unnecessary of current mode[0] chiptype[5] to grant queue auth to aicpusd
[INFO] TDT(36151,python):2024-07-13-08:14:37.131.065 [stub_process_mode_nowin.cpp:101][OpenInHost][tid:36151] enter into OpenInHost deviceid[0]
[INFO] TDT(36151,python):2024-07-13-08:14:37.131.095 [client_manager.cpp:368][IsHostEnvironment][tid:36151] [TsdClient] logicDeviceId is [0], hostaicpunum[0]
[INFO] TDT(36151,python):2024-07-13-08:14:37.131.103 [stub_process_mode_nowin.cpp:105][OpenInHost][tid:36151] host cpu not support
[INFO] TDT(36151,python):2024-07-13-08:14:37.131.109 [process_mode_manager.cpp:170][OpenProcess][tid:36151] [TsdClient][deviceId=0] [sessionId=1] start hccp and computer process success
[INFO] HCCP(36151,python):2024-07-13-08:14:37.131.122 [ra_host.c:332]tid:36151,ra_init(332) : Input parameters: phy_id[2], nic_position:[1] hdc_type:[6]
[INFO] HCCP(36151,python):2024-07-13-08:14:37.131.143 [ra_hdc.c:1540]tid:36151,ra_hdc_init(1540) : hdc init start! logic id is 2, phy id is 2, hdc_type is 6
[ERROR] HCCP(36151,python):2024-07-13-08:14:37.131.219 [ra_hdc.c:1550]tid:36151,ra_hdc_init(1550) : [init][ra_hdc]hdc session_connect failed ret(35) phy_id(2)
[ERROR] HCCP(36151,python):2024-07-13-08:14:37.131.229 [ra_host.c:355]tid:36151,ra_init(355) : [init][ra]ra hdc init failed, ret(-1)
[ERROR] HCCL(36151,python):2024-07-13-08:14:37.131.241 [adapter_hccp.cc:398][36151][Init][Ra]errNo[0x0000000005000013] ra init fail. return[128000]
[ERROR] HCCL(36151,python):2024-07-13-08:14:37.131.394 [network_manager.cc:121][36151][NetworkManager][Init]errNo[0x0000000005000013] ra init failed,return[19] devicePhyId_[2], nic_position[1]
[ERROR] HCCL(36151,python):2024-07-13-08:14:37.131.405 [topoinfo_detect.cc:419][36151][Start][Network]NetworkManager Init failed! deviceLogicID[0], deploy[1]
[ERROR] HCCL(36151,python):2024-07-13-08:14:37.131.413 [topoinfo_detect.cc:198][36151][Setup][Agent]topo detect agent start network failed! rank[0]
[ERROR] HCCL(36151,python):2024-07-13-08:14:37.131.423 [op_base.cc:434][36151][Init][CommRootInfo]errNo[0x0000000005000013] setup topo detect error
[INFO] HCCP(36151,python):2024-07-13-08:14:37.131.436 [ra_host.c:1968]tid:36151,ra_get_ifnum(1968) : Input parameters: phy_id[0], nic_position:[0]
[INFO] HCCL(36151,python):2024-07-13-08:14:37.131.939 [adapter_hccp.cc:925][36151][Get][HostIf]hrtGetIfNum success. ifAddrNum[2].
[INFO] HCCP(36151,python):2024-07-13-08:14:37.131.952 [ra_host.c:2016]tid:36151,ra_get_ifaddrs(2016) : Input parameters: phy_id[0], nic_position:[0], interface num[2]
[INFO] HCCL(36151,python):2024-07-13-08:14:37.132.142 [sal.cc:411][36151]nic class[normal]: find nic[172.17.0.2%eth0] success.
[ERROR] HCCL(36151,python):2024-07-13-08:14:37.132.158 [op_base.cc:481][36151][HCCL_TRACE]HcclCommInitRootInfo failed, return[0x0000000005000013], rankNum[2], rank[0], rootInfo identifier[172.17.0.2%eth0_60002_2_1720858473830131], server[172.17.0.2%eth0], logicDevId[-1]
[INFO] HCCL(36151,python):2024-07-13-08:14:37.132.197 [op_base.cc:1340][36151]com is not global com
[ERROR] HCCL(36151,python):2024-07-13-08:14:37.132.227 [op_base.cc:1352][36151][HcclCommDestroy] comm is not exist, comm=0xaaaaee813a90, group=172.17.0.2%eth0_60002_2_1720858473830131, deviceLogicId=0

Epoch 1: 0%| | 0/230 [00:04<?, ?it/s]
Traceback (most recent call last):
File "/home/ma-user/nkn/gees/GeeSibling/examples/pytorch/pipeline/gpt2lm/train_gpt.py", line 7, in
pretrain()
File "/home/ma-user/nkn/gees/GeeSibling/examples/pytorch/pipeline/gpt2lm/gpt2_pp.py", line 193, in pretrain
loss = forward_backward_pipelining_without_interleaving2(
File "/home/ma-user/nkn/gees/GeeSibling/python/geesibling/adapters/pytorch/pipeline/pipeline/training2.py", line 155, in forward_backward_pipelining_without_interleaving2
p2p_communication.send_forward(output_tensor)
File "/home/ma-user/nkn/gees/GeeSibling/python/geesibling/adapters/pytorch/pipeline/pipeline/p2p_communication.py", line 22, in send_forward
req = dist.isend(tensor=output_tensor, dst=next_rank)
File "/home/ma-user/anaconda3/envs/gees/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1525, in isend
return default_pg.send([tensor], dst, tag)
RuntimeError: [ERROR] HCCL error in: torch_npu/csrc/distributed/ProcessGroupHCCL.cpp:64
[ERROR] 2024-07-13-08:14:37 (PID:36151, Device:0, RankID:0) ERR02200 DIST call hccl api failed.
EI9999: Inner Error!
EI9999: 2024-07-13-08:14:37.131.260 ra init failed,return[19] devicePhyId_[2], nic_position[1][FUNC:Init][FILE:network_manager.cc][LINE:118]
TraceBack (most recent call last):

我重新搭了一下环境,我提供这个错误的plog吧,这个好像也是类似的错误
同样是在gpu上可以运行,但是换到npu上就卡在通信这里

[INFO] TDT(18081,python):2024-07-11-10:04:55.128.266 [stub_process_mode_nowin.cpp:101][OpenInHost][tid:18081] enter into OpenInHost deviceid[1]
[INFO] TDT(18081,python):2024-07-11-10:04:55.128.291 [client_manager.cpp:346][IsHostEnvironment][tid:18081] [TsdClient] logicDeviceId is [1], hostaicpunum[0]
[INFO] TDT(18081,python):2024-07-11-10:04:55.128.303 [stub_process_mode_nowin.cpp:105][OpenInHost][tid:18081] host cpu not support
[INFO] TDT(18081,python):2024-07-11-10:04:55.128.309 [process_mode_manager.cpp:156][OpenProcess][tid:18081] [TsdClient][deviceId=1] [sessionId=1] start hccp and computer process success
[EVENT] HCCP(18081,python):2024-07-11-10:04:55.128.322 [ra_host.c:292]tid:18081,ra_init(292) : Input parameters: phy_id[1], nic_position:[1]
[EVENT] HCCP(18081,python):2024-07-11-10:04:55.128.612 [ra_hdc.c:1452]tid:18081,ra_hdc_init(1452) : hdc init start! logic id is 1, phy id is 1
[EVENT] HCCP(18083,python):2024-07-11-10:04:55.128.901 [ra_host.c:453]tid:18083,ra_socket_init_v1(453) : socket init:mode=0 phy_id=3 family=2 ip=172.17.0.2
[EVENT] HCCL(18083,python):2024-07-11-10:04:55.128.945 [adapter_hccp.cc:984][18083][Get][DeviceIP]hrtGetIfNum success. ifAddrNum[2].
[EVENT] HCCP(18081,python):2024-07-11-10:04:55.128.949 [ra_hdc.c:1487]tid:18081,ra_hdc_init(1487) : hdc init OK! phy_id[1]
[EVENT] HCCP(18083,python):2024-07-11-10:04:55.128.956 [ra_host.c:1929]tid:18083,ra_get_ifaddrs(1929) : Input parameters: phy_id[3], nic_position:[1], interface num[2]
[EVENT] HCCL(18083,python):2024-07-11-10:04:55.129.330 [topoinfo_detect.cc:471][18083]select AF_INET family as device socket family.
[EVENT] HCCL(18083,python):2024-07-11-10:04:55.129.344 [topoinfo_detect.cc:482][18083]no device ip: use 0 as device ip.
[EVENT] HCCP(18083,python):2024-07-11-10:04:55.129.393 [ra_host.c:824]tid:18083,ra_socket_batch_connect(824) : Input parameters: [0]th, phy_id[3], local_ip[172.17.0.2], remote_ip[172.17.0.2], tag:[topo_detect_default_tag_60000]
[EVENT] HCCP(18081,python):2024-07-11-10:04:55.131.537 [ra_host.c:453]tid:18081,ra_socket_init_v1(453) : socket init:mode=0 phy_id=1 family=2 ip=172.17.0.2
[EVENT] HCCL(18081,python):2024-07-11-10:04:55.131.574 [adapter_hccp.cc:984][18081][Get][DeviceIP]hrtGetIfNum success. ifAddrNum[2].
[EVENT] HCCP(18081,python):2024-07-11-10:04:55.131.583 [ra_host.c:1929]tid:18081,ra_get_ifaddrs(1929) : Input parameters: phy_id[1], nic_position:[1], interface num[2]
[EVENT] HCCL(18081,python):2024-07-11-10:04:55.132.079 [topoinfo_detect.cc:471][18081]select AF_INET family as device socket family.
[EVENT] HCCL(18081,python):2024-07-11-10:04:55.132.090 [topoinfo_detect.cc:482][18081]no device ip: use 0 as device ip.
[EVENT] HCCP(18081,python):2024-07-11-10:04:55.132.138 [ra_host.c:824]tid:18081,ra_socket_batch_connect(824) : Input parameters: [0]th, phy_id[1], local_ip[172.17.0.2], remote_ip[172.17.0.2], tag:[topo_detect_default_tag_60000]
Data loader created with limited dataset.
len loader :459
------------------
Epoch 1:   0%|                                                                                                                                                     | 0/230 [00:00<?, ?it/s]7 get before 1f1b 
在 7 初始化空tensor shape:(8, 128, 768)
在 7 从 6 接收数据
[EVENT] HCCL(18087,python):2024-07-11-10:04:58.927.912 [op_base.cc:405][18087]Entry-HcclCommInitRootInfo:ranks[8], rank[7], rootinfo: host ip[172.17.0.2] port[60000] nicDeploy[1] identifier[172.17.0.2%eth0_60000_0_1720692292043672], deviceLogicId[7]
[EVENT] HCCP(18087,python):2024-07-11-10:04:58.933.235 [ra_host.c:1881]tid:18087,ra_get_ifnum(1881) : Input parameters: phy_id[0], nic_position:[0]
[EVENT] HCCL(18087,python):2024-07-11-10:04:58.933.983 [adapter_hccp.cc:820][18087][Get][HostIf]hrtGetIfNum success. ifAddrNum[2].
[EVENT] HCCP(18087,python):2024-07-11-10:04:58.934.002 [ra_host.c:1929]tid:18087,ra_get_ifaddrs(1929) : Input parameters: phy_id[0], nic_position:[0], interface num[2]
[EVENT] HCCL(18087,python):2024-07-11-10:04:58.934.229 [sal.cc:383][18087]nic class[normal]: find nic[172.17.0.2%eth0] success.
[EVENT] HCCP(18087,python):2024-07-11-10:04:58.934.452 [ra_host.c:1721]tid:18087,ra_socket_set_white_list_status(1721) : Input parameters: enable[0]
[EVENT] HCCP(18087,python):2024-07-11-10:04:58.934.468 [ra_host.c:292]tid:18087,ra_init(292) : Input parameters: phy_id[7], nic_position:[0]
[EVENT] HCCP(18087,python):2024-07-11-10:04:58.935.097 [rs_ssl.c:1063]tid:18087,rs_ssl_init(1063) : TLS SWITCH (0)
[EVENT] HCCP(18087,python):2024-07-11-10:04:58.935.459 [rs_epoll.c:470]tid:21609,rs_epoll_handle(470) : pthread[epoll_pthread] is alive!
[EVENT] HCCP(18087,python):2024-07-11-10:04:58.935.575 [rs.c:402]tid:18087,rs_init(402) : rs init success, chip_id[7]
[EVENT] HCCP(18087,python):2024-07-11-10:04:58.935.593 [rs_epoll.c:595]tid:21610,rs_connect_handle(595) : pthread[connect_pthread] is alive!
[INFO] TDT(18087,python):2024-07-11-10:04:58.935.610 [process_mode_manager.cpp:109][OpenProcess][tid:18087] [ProcessModeManager] enter into open process deviceId[7] rankSize[2]
[INFO] TDT(18087,python):2024-07-11-10:04:58.935.961 [process_mode_manager.cpp:705][GetDeviceCheckCode][tid:18087] [ProcessModeManager][deviceId=7] aicpu package already exist in device
[INFO] TDT(18087,python):2024-07-11-10:04:58.936.009 [process_mode_manager.cpp:426][ConstructOpenMsg][tid:18087] [TsdClient] tsd get process sign successfully, procpid[134404] signSize[48]
[INFO] TDT(18087,python):2024-07-11-10:04:58.936.066 [process_mode_manager.cpp:126][OpenProcess][tid:18087] [ProcessModeManager] deviceId[7] sessionId[1] rankSize[2], wait sub process start respond
[ERROR] TDT(18087,python):2024-07-11-10:05:00.953.230 [process_mode_manager.cpp:595][DeviceMsgProcess][tid:18087] [TsdClient] DeviceMsgProc  errcode[EJ0001]
[ERROR] TDT(18087,python):2024-07-11-10:05:00.953.381 [process_mode_manager.cpp:274][WaitRsp][tid:18087] tsd client wait response fail, device response code[1]. unknown device error.
[ERROR] TDT(18087,python):2024-07-11-10:05:00.953.415 [process_mode_manager.cpp:129][OpenProcess][tid:18087] Wait open response from device failed.
[ERROR] TDT(18087,python):2024-07-11-10:05:00.953.426 [tsd_client.cpp:33][TsdOpen][tid:18087] TsdOpen failed, deviceId[7].
[ERROR] HCCL(18087,python):2024-07-11-10:05:00.953.518 [adapter_tdt.cc:21][18087][Open][Tsd]Open TsdClient failed, tdt error code : 31, error deviceLogicId[7], rankSize[2]
[ERROR] HCCL(18087,python):2024-07-11-10:05:00.953.530 [network_manager.cc:96][18087]call trace: hcclRet -> 7
[ERROR] HCCL(18087,python):2024-07-11-10:05:00.953.538 [topoinfo_detect.cc:404][18087][Start][Network]NetworkManager Init failed! deviceLogicID[7], deploy[1]
[ERROR] HCCL(18087,python):2024-07-11-10:05:00.953.546 [topoinfo_detect.cc:198][18087][Setup][Agent]topo detect agent start network failed! rank[7]
[ERROR] HCCL(18087,python):2024-07-11-10:05:00.953.557 [op_base.cc:426][18087][Init][CommRootInfo]errNo[0x0000000005000007] setup topo detect error
[EVENT] HCCP(18087,python):2024-07-11-10:05:00.953.574 [ra_host.c:1881]tid:18087,ra_get_ifnum(1881) : Input parameters: phy_id[0], nic_position:[0]
[EVENT] HCCL(18087,python):2024-07-11-10:05:00.954.100 [adapter_hccp.cc:820][18087][Get][HostIf]hrtGetIfNum success. ifAddrNum[2].
[EVENT] HCCP(18087,python):2024-07-11-10:05:00.954.113 [ra_host.c:1929]tid:18087,ra_get_ifaddrs(1929) : Input parameters: phy_id[0], nic_position:[0], interface num[2]
[EVENT] HCCL(18087,python):2024-07-11-10:05:00.954.320 [sal.cc:383][18087]nic class[normal]: find nic[172.17.0.2%eth0] success.
[ERROR] HCCL(18087,python):2024-07-11-10:05:00.954.336 [op_base.cc:473][18087][Init][CommRootInfo]HcclCommInitRootInfo failed, rankNum[8], rank[7], server[172.17.0.2%eth0], return[0x0000000005000007], rootInfo identifier[172.17.0.2%eth0_60000_0_1720692292043672]
[EVENT] HCCL(18087,python):2024-07-11-10:05:00.954.386 [op_base.cc:1312][18087]com is not global com
[ERROR] HCCL(18087,python):2024-07-11-10:05:00.954.429 [op_base.cc:1324][18087][HcclCommDestroy] comm is not exist, comm=0xaaab0d62c170, group=172.17.0.2%eth0_60000_0_1720692292043672, deviceLogicId=7
Epoch 1:   0%|                                                                                                                                                     | 0/230 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "/home/ma-user/work/gees1/GeeSibling/examples/pytorch/pipeline/gpt2lm/train_gpt.py", line 7, in <module>
    pretrain()
  File "/home/ma-user/work/gees1/GeeSibling/examples/pytorch/pipeline/gpt2lm/gpt2_pp.py", line 198, in pretrain
    loss = forward_backward_pipelining_without_interleaving2(
  File "/home/ma-user/work/gees/GeeSibling/python/geesibling/adapters/pytorch/pipeline/pipeline/training2.py", line 169, in forward_backward_pipelining_without_interleaving2
    input_tensor = p2p_communication.recv_forward(recv_fwd_tensor_shape,gpus_list[0])
  File "/home/ma-user/work/gees/GeeSibling/python/geesibling/adapters/pytorch/pipeline/pipeline/p2p_communication.py", line 35, in recv_forward
    req = dist.irecv(tensor=recv_tensor, src=prev_rank)
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1565, in irecv
    return pg.recv([tensor], src, tag)
RuntimeError: [ERROR] HCCL error in: build/CMakeFiles/torch_npu.dir/compiler_depend.ts:63.
EJ0001: Failed to initialize the HCCP process. Reason: Maybe the last training process is running.
        Solution: Wait for 10s after killing the last training process and try again.
        TraceBack (most recent call last):
        tsd client wait response fail, device response code[1]. unknown device error.[FUNC:WaitRsp][FILE:process_mode_manager.cpp][LINE:275]

[INFO] TDT(18087,python):2024-07-11-10:05:00.964.953 [process_mode_manager.cpp:184][Close][tid:18087] [TsdClient] Close [deviceId=6][sessionId=1] hccp and computer enter
[INFO] TDT(18087,python):2024-07-11-10:05:00.964.977 [version_verify.cpp:112][SpecialFeatureCheck][tid:18087] VersionVerify: previous type[7], supported
[INFO] TDT(18087,python):2024-07-11-10:05:00.965.019 [process_mode_manager.cpp:192][Close][tid:18087] [TsdClient][deviceId=6] [sessionId=1] wait hccp and computer process close respond
[INFO] TDT(18087,python):2024-07-11-10:05:04.022.885 [process_mode_manager.cpp:197][Close][tid:18087] [TsdClient][logicDeviceId_=6]has recv close hccp and computer process respond
[INFO] TDT(18087,python):2024-07-11-10:05:04.022.926 [stub_process_mode_nowin.cpp:151][CloseInHost][tid:18087] enter into CloseInHost deviceid[6]
[INFO] TDT(18087,python):2024-07-11-10:05:04.022.958 [client_manager.cpp:346][IsHostEnvironment][tid:18087] [TsdClient] logicDeviceId is [6], hostaicpunum[0]
[INFO] TDT(18087,python):2024-07-11-10:05:04.022.967 [stub_process_mode_nowin.cpp:154][CloseInHost][tid:18087] host cpu not support
[INFO] TDT(18087,python):2024-07-11-10:05:04.023.017 [process_mode_manager.cpp:208][Close][tid:18087] [TsdClient][deviceId=6] [sessionId=1] close hccp and computer process success
[INFO] ATRACE(18087,python):2024-07-11-10:05:04.023.029 [atrace_api.c:73](tid:18087) AtraceDestroy start
[INFO] ATRACE(18087,python):2024-07-11-10:05:04.023.053 [atrace_api.c:75](tid:18087) AtraceDestroy end

你这个报错里面显示上个进程还没退出,hccp初始化失败

但是我确定上个进程退出了,没有进程存在

那你能提供下上个进程退出后复现的plog日志以及device日志吗

ascend支持单卡上运行多个进程吗,我代码逻辑是进程对应流水线stage,在gpu上是一个进程管理多张卡,一张卡上有一个进程,但是在npu上运行发现每张卡上都有多个进程存在

ascend单卡可以运行多个进程。每张卡上都有多个进程需要看下你代码的实现逻辑

huangyunlong 任务状态TODO 修改为Analysing 11个月前
huangyunlong 任务状态Analysing 修改为DONE 10个月前

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
优先级
预计工期 (小时)
开始日期   -   截止日期
-
置顶选项
参与者(2)
huangyunlong-huangyunlong2022 Nikanuo-nikanuo
Python
1
https://gitee.com/ascend/pytorch.git
git@gitee.com:ascend/pytorch.git
ascend
pytorch
pytorch

搜索帮助