能提供下详细plog日志吗
我重新搭了下环境,报错变了一下,但好像还是类似的我发在下面了
这是plog日志,似乎说是通信初始化失败,但是目前是8卡跑4个流水线成功,跑2个流水线遇到这个问题,为什么通信会初始化失败呢
[INFO] TDT(36151,python):2024-07-13-08:14:37.131.007 [stub_process_mode_nowin.cpp:63][ProcessQueueForMdc][tid:36151] [TsdClient] it is unnecessary of current mode[0] chiptype[5] to grant queue auth to aicpusd
[INFO] TDT(36151,python):2024-07-13-08:14:37.131.065 [stub_process_mode_nowin.cpp:101][OpenInHost][tid:36151] enter into OpenInHost deviceid[0]
[INFO] TDT(36151,python):2024-07-13-08:14:37.131.095 [client_manager.cpp:368][IsHostEnvironment][tid:36151] [TsdClient] logicDeviceId is [0], hostaicpunum[0]
[INFO] TDT(36151,python):2024-07-13-08:14:37.131.103 [stub_process_mode_nowin.cpp:105][OpenInHost][tid:36151] host cpu not support
[INFO] TDT(36151,python):2024-07-13-08:14:37.131.109 [process_mode_manager.cpp:170][OpenProcess][tid:36151] [TsdClient][deviceId=0] [sessionId=1] start hccp and computer process success
[INFO] HCCP(36151,python):2024-07-13-08:14:37.131.122 [ra_host.c:332]tid:36151,ra_init(332) : Input parameters: phy_id[2], nic_position:[1] hdc_type:[6]
[INFO] HCCP(36151,python):2024-07-13-08:14:37.131.143 [ra_hdc.c:1540]tid:36151,ra_hdc_init(1540) : hdc init start! logic id is 2, phy id is 2, hdc_type is 6
[ERROR] HCCP(36151,python):2024-07-13-08:14:37.131.219 [ra_hdc.c:1550]tid:36151,ra_hdc_init(1550) : [init][ra_hdc]hdc session_connect failed ret(35) phy_id(2)
[ERROR] HCCP(36151,python):2024-07-13-08:14:37.131.229 [ra_host.c:355]tid:36151,ra_init(355) : [init][ra]ra hdc init failed, ret(-1)
[ERROR] HCCL(36151,python):2024-07-13-08:14:37.131.241 [adapter_hccp.cc:398][36151][Init][Ra]errNo[0x0000000005000013] ra init fail. return[128000]
[ERROR] HCCL(36151,python):2024-07-13-08:14:37.131.394 [network_manager.cc:121][36151][NetworkManager][Init]errNo[0x0000000005000013] ra init failed,return[19] devicePhyId_[2], nic_position[1]
[ERROR] HCCL(36151,python):2024-07-13-08:14:37.131.405 [topoinfo_detect.cc:419][36151][Start][Network]NetworkManager Init failed! deviceLogicID[0], deploy[1]
[ERROR] HCCL(36151,python):2024-07-13-08:14:37.131.413 [topoinfo_detect.cc:198][36151][Setup][Agent]topo detect agent start network failed! rank[0]
[ERROR] HCCL(36151,python):2024-07-13-08:14:37.131.423 [op_base.cc:434][36151][Init][CommRootInfo]errNo[0x0000000005000013] setup topo detect error
[INFO] HCCP(36151,python):2024-07-13-08:14:37.131.436 [ra_host.c:1968]tid:36151,ra_get_ifnum(1968) : Input parameters: phy_id[0], nic_position:[0]
[INFO] HCCL(36151,python):2024-07-13-08:14:37.131.939 [adapter_hccp.cc:925][36151][Get][HostIf]hrtGetIfNum success. ifAddrNum[2].
[INFO] HCCP(36151,python):2024-07-13-08:14:37.131.952 [ra_host.c:2016]tid:36151,ra_get_ifaddrs(2016) : Input parameters: phy_id[0], nic_position:[0], interface num[2]
[INFO] HCCL(36151,python):2024-07-13-08:14:37.132.142 [sal.cc:411][36151]nic class[normal]: find nic[172.17.0.2%eth0] success.
[ERROR] HCCL(36151,python):2024-07-13-08:14:37.132.158 [op_base.cc:481][36151][HCCL_TRACE]HcclCommInitRootInfo failed, return[0x0000000005000013], rankNum[2], rank[0], rootInfo identifier[172.17.0.2%eth0_60002_2_1720858473830131], server[172.17.0.2%eth0], logicDevId[-1]
[INFO] HCCL(36151,python):2024-07-13-08:14:37.132.197 [op_base.cc:1340][36151]com is not global com
[ERROR] HCCL(36151,python):2024-07-13-08:14:37.132.227 [op_base.cc:1352][36151][HcclCommDestroy] comm is not exist, comm=0xaaaaee813a90, group=172.17.0.2%eth0_60002_2_1720858473830131, deviceLogicId=0
Epoch 1: 0%| | 0/230 [00:04<?, ?it/s]
Traceback (most recent call last):
File "/home/ma-user/nkn/gees/GeeSibling/examples/pytorch/pipeline/gpt2lm/train_gpt.py", line 7, in
pretrain()
File "/home/ma-user/nkn/gees/GeeSibling/examples/pytorch/pipeline/gpt2lm/gpt2_pp.py", line 193, in pretrain
loss = forward_backward_pipelining_without_interleaving2(
File "/home/ma-user/nkn/gees/GeeSibling/python/geesibling/adapters/pytorch/pipeline/pipeline/training2.py", line 155, in forward_backward_pipelining_without_interleaving2
p2p_communication.send_forward(output_tensor)
File "/home/ma-user/nkn/gees/GeeSibling/python/geesibling/adapters/pytorch/pipeline/pipeline/p2p_communication.py", line 22, in send_forward
req = dist.isend(tensor=output_tensor, dst=next_rank)
File "/home/ma-user/anaconda3/envs/gees/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1525, in isend
return default_pg.send([tensor], dst, tag)
RuntimeError: [ERROR] HCCL error in: torch_npu/csrc/distributed/ProcessGroupHCCL.cpp:64
[ERROR] 2024-07-13-08:14:37 (PID:36151, Device:0, RankID:0) ERR02200 DIST call hccl api failed.
EI9999: Inner Error!
EI9999: 2024-07-13-08:14:37.131.260 ra init failed,return[19] devicePhyId_[2], nic_position[1][FUNC:Init][FILE:network_manager.cc][LINE:118]
TraceBack (most recent call last):
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
我重新搭了一下环境,我提供这个错误的plog吧,这个好像也是类似的错误
同样是在gpu上可以运行,但是换到npu上就卡在通信这里
[INFO] TDT(18081,python):2024-07-11-10:04:55.128.266 [stub_process_mode_nowin.cpp:101][OpenInHost][tid:18081] enter into OpenInHost deviceid[1]
[INFO] TDT(18081,python):2024-07-11-10:04:55.128.291 [client_manager.cpp:346][IsHostEnvironment][tid:18081] [TsdClient] logicDeviceId is [1], hostaicpunum[0]
[INFO] TDT(18081,python):2024-07-11-10:04:55.128.303 [stub_process_mode_nowin.cpp:105][OpenInHost][tid:18081] host cpu not support
[INFO] TDT(18081,python):2024-07-11-10:04:55.128.309 [process_mode_manager.cpp:156][OpenProcess][tid:18081] [TsdClient][deviceId=1] [sessionId=1] start hccp and computer process success
[EVENT] HCCP(18081,python):2024-07-11-10:04:55.128.322 [ra_host.c:292]tid:18081,ra_init(292) : Input parameters: phy_id[1], nic_position:[1]
[EVENT] HCCP(18081,python):2024-07-11-10:04:55.128.612 [ra_hdc.c:1452]tid:18081,ra_hdc_init(1452) : hdc init start! logic id is 1, phy id is 1
[EVENT] HCCP(18083,python):2024-07-11-10:04:55.128.901 [ra_host.c:453]tid:18083,ra_socket_init_v1(453) : socket init:mode=0 phy_id=3 family=2 ip=172.17.0.2
[EVENT] HCCL(18083,python):2024-07-11-10:04:55.128.945 [adapter_hccp.cc:984][18083][Get][DeviceIP]hrtGetIfNum success. ifAddrNum[2].
[EVENT] HCCP(18081,python):2024-07-11-10:04:55.128.949 [ra_hdc.c:1487]tid:18081,ra_hdc_init(1487) : hdc init OK! phy_id[1]
[EVENT] HCCP(18083,python):2024-07-11-10:04:55.128.956 [ra_host.c:1929]tid:18083,ra_get_ifaddrs(1929) : Input parameters: phy_id[3], nic_position:[1], interface num[2]
[EVENT] HCCL(18083,python):2024-07-11-10:04:55.129.330 [topoinfo_detect.cc:471][18083]select AF_INET family as device socket family.
[EVENT] HCCL(18083,python):2024-07-11-10:04:55.129.344 [topoinfo_detect.cc:482][18083]no device ip: use 0 as device ip.
[EVENT] HCCP(18083,python):2024-07-11-10:04:55.129.393 [ra_host.c:824]tid:18083,ra_socket_batch_connect(824) : Input parameters: [0]th, phy_id[3], local_ip[172.17.0.2], remote_ip[172.17.0.2], tag:[topo_detect_default_tag_60000]
[EVENT] HCCP(18081,python):2024-07-11-10:04:55.131.537 [ra_host.c:453]tid:18081,ra_socket_init_v1(453) : socket init:mode=0 phy_id=1 family=2 ip=172.17.0.2
[EVENT] HCCL(18081,python):2024-07-11-10:04:55.131.574 [adapter_hccp.cc:984][18081][Get][DeviceIP]hrtGetIfNum success. ifAddrNum[2].
[EVENT] HCCP(18081,python):2024-07-11-10:04:55.131.583 [ra_host.c:1929]tid:18081,ra_get_ifaddrs(1929) : Input parameters: phy_id[1], nic_position:[1], interface num[2]
[EVENT] HCCL(18081,python):2024-07-11-10:04:55.132.079 [topoinfo_detect.cc:471][18081]select AF_INET family as device socket family.
[EVENT] HCCL(18081,python):2024-07-11-10:04:55.132.090 [topoinfo_detect.cc:482][18081]no device ip: use 0 as device ip.
[EVENT] HCCP(18081,python):2024-07-11-10:04:55.132.138 [ra_host.c:824]tid:18081,ra_socket_batch_connect(824) : Input parameters: [0]th, phy_id[1], local_ip[172.17.0.2], remote_ip[172.17.0.2], tag:[topo_detect_default_tag_60000]
Data loader created with limited dataset.
len loader :459
------------------
Epoch 1: 0%| | 0/230 [00:00<?, ?it/s]7 get before 1f1b
在 7 初始化空tensor shape:(8, 128, 768)
在 7 从 6 接收数据
[EVENT] HCCL(18087,python):2024-07-11-10:04:58.927.912 [op_base.cc:405][18087]Entry-HcclCommInitRootInfo:ranks[8], rank[7], rootinfo: host ip[172.17.0.2] port[60000] nicDeploy[1] identifier[172.17.0.2%eth0_60000_0_1720692292043672], deviceLogicId[7]
[EVENT] HCCP(18087,python):2024-07-11-10:04:58.933.235 [ra_host.c:1881]tid:18087,ra_get_ifnum(1881) : Input parameters: phy_id[0], nic_position:[0]
[EVENT] HCCL(18087,python):2024-07-11-10:04:58.933.983 [adapter_hccp.cc:820][18087][Get][HostIf]hrtGetIfNum success. ifAddrNum[2].
[EVENT] HCCP(18087,python):2024-07-11-10:04:58.934.002 [ra_host.c:1929]tid:18087,ra_get_ifaddrs(1929) : Input parameters: phy_id[0], nic_position:[0], interface num[2]
[EVENT] HCCL(18087,python):2024-07-11-10:04:58.934.229 [sal.cc:383][18087]nic class[normal]: find nic[172.17.0.2%eth0] success.
[EVENT] HCCP(18087,python):2024-07-11-10:04:58.934.452 [ra_host.c:1721]tid:18087,ra_socket_set_white_list_status(1721) : Input parameters: enable[0]
[EVENT] HCCP(18087,python):2024-07-11-10:04:58.934.468 [ra_host.c:292]tid:18087,ra_init(292) : Input parameters: phy_id[7], nic_position:[0]
[EVENT] HCCP(18087,python):2024-07-11-10:04:58.935.097 [rs_ssl.c:1063]tid:18087,rs_ssl_init(1063) : TLS SWITCH (0)
[EVENT] HCCP(18087,python):2024-07-11-10:04:58.935.459 [rs_epoll.c:470]tid:21609,rs_epoll_handle(470) : pthread[epoll_pthread] is alive!
[EVENT] HCCP(18087,python):2024-07-11-10:04:58.935.575 [rs.c:402]tid:18087,rs_init(402) : rs init success, chip_id[7]
[EVENT] HCCP(18087,python):2024-07-11-10:04:58.935.593 [rs_epoll.c:595]tid:21610,rs_connect_handle(595) : pthread[connect_pthread] is alive!
[INFO] TDT(18087,python):2024-07-11-10:04:58.935.610 [process_mode_manager.cpp:109][OpenProcess][tid:18087] [ProcessModeManager] enter into open process deviceId[7] rankSize[2]
[INFO] TDT(18087,python):2024-07-11-10:04:58.935.961 [process_mode_manager.cpp:705][GetDeviceCheckCode][tid:18087] [ProcessModeManager][deviceId=7] aicpu package already exist in device
[INFO] TDT(18087,python):2024-07-11-10:04:58.936.009 [process_mode_manager.cpp:426][ConstructOpenMsg][tid:18087] [TsdClient] tsd get process sign successfully, procpid[134404] signSize[48]
[INFO] TDT(18087,python):2024-07-11-10:04:58.936.066 [process_mode_manager.cpp:126][OpenProcess][tid:18087] [ProcessModeManager] deviceId[7] sessionId[1] rankSize[2], wait sub process start respond
[ERROR] TDT(18087,python):2024-07-11-10:05:00.953.230 [process_mode_manager.cpp:595][DeviceMsgProcess][tid:18087] [TsdClient] DeviceMsgProc errcode[EJ0001]
[ERROR] TDT(18087,python):2024-07-11-10:05:00.953.381 [process_mode_manager.cpp:274][WaitRsp][tid:18087] tsd client wait response fail, device response code[1]. unknown device error.
[ERROR] TDT(18087,python):2024-07-11-10:05:00.953.415 [process_mode_manager.cpp:129][OpenProcess][tid:18087] Wait open response from device failed.
[ERROR] TDT(18087,python):2024-07-11-10:05:00.953.426 [tsd_client.cpp:33][TsdOpen][tid:18087] TsdOpen failed, deviceId[7].
[ERROR] HCCL(18087,python):2024-07-11-10:05:00.953.518 [adapter_tdt.cc:21][18087][Open][Tsd]Open TsdClient failed, tdt error code : 31, error deviceLogicId[7], rankSize[2]
[ERROR] HCCL(18087,python):2024-07-11-10:05:00.953.530 [network_manager.cc:96][18087]call trace: hcclRet -> 7
[ERROR] HCCL(18087,python):2024-07-11-10:05:00.953.538 [topoinfo_detect.cc:404][18087][Start][Network]NetworkManager Init failed! deviceLogicID[7], deploy[1]
[ERROR] HCCL(18087,python):2024-07-11-10:05:00.953.546 [topoinfo_detect.cc:198][18087][Setup][Agent]topo detect agent start network failed! rank[7]
[ERROR] HCCL(18087,python):2024-07-11-10:05:00.953.557 [op_base.cc:426][18087][Init][CommRootInfo]errNo[0x0000000005000007] setup topo detect error
[EVENT] HCCP(18087,python):2024-07-11-10:05:00.953.574 [ra_host.c:1881]tid:18087,ra_get_ifnum(1881) : Input parameters: phy_id[0], nic_position:[0]
[EVENT] HCCL(18087,python):2024-07-11-10:05:00.954.100 [adapter_hccp.cc:820][18087][Get][HostIf]hrtGetIfNum success. ifAddrNum[2].
[EVENT] HCCP(18087,python):2024-07-11-10:05:00.954.113 [ra_host.c:1929]tid:18087,ra_get_ifaddrs(1929) : Input parameters: phy_id[0], nic_position:[0], interface num[2]
[EVENT] HCCL(18087,python):2024-07-11-10:05:00.954.320 [sal.cc:383][18087]nic class[normal]: find nic[172.17.0.2%eth0] success.
[ERROR] HCCL(18087,python):2024-07-11-10:05:00.954.336 [op_base.cc:473][18087][Init][CommRootInfo]HcclCommInitRootInfo failed, rankNum[8], rank[7], server[172.17.0.2%eth0], return[0x0000000005000007], rootInfo identifier[172.17.0.2%eth0_60000_0_1720692292043672]
[EVENT] HCCL(18087,python):2024-07-11-10:05:00.954.386 [op_base.cc:1312][18087]com is not global com
[ERROR] HCCL(18087,python):2024-07-11-10:05:00.954.429 [op_base.cc:1324][18087][HcclCommDestroy] comm is not exist, comm=0xaaab0d62c170, group=172.17.0.2%eth0_60000_0_1720692292043672, deviceLogicId=7
Epoch 1: 0%| | 0/230 [00:02<?, ?it/s]
Traceback (most recent call last):
File "/home/ma-user/work/gees1/GeeSibling/examples/pytorch/pipeline/gpt2lm/train_gpt.py", line 7, in <module>
pretrain()
File "/home/ma-user/work/gees1/GeeSibling/examples/pytorch/pipeline/gpt2lm/gpt2_pp.py", line 198, in pretrain
loss = forward_backward_pipelining_without_interleaving2(
File "/home/ma-user/work/gees/GeeSibling/python/geesibling/adapters/pytorch/pipeline/pipeline/training2.py", line 169, in forward_backward_pipelining_without_interleaving2
input_tensor = p2p_communication.recv_forward(recv_fwd_tensor_shape,gpus_list[0])
File "/home/ma-user/work/gees/GeeSibling/python/geesibling/adapters/pytorch/pipeline/pipeline/p2p_communication.py", line 35, in recv_forward
req = dist.irecv(tensor=recv_tensor, src=prev_rank)
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1565, in irecv
return pg.recv([tensor], src, tag)
RuntimeError: [ERROR] HCCL error in: build/CMakeFiles/torch_npu.dir/compiler_depend.ts:63.
EJ0001: Failed to initialize the HCCP process. Reason: Maybe the last training process is running.
Solution: Wait for 10s after killing the last training process and try again.
TraceBack (most recent call last):
tsd client wait response fail, device response code[1]. unknown device error.[FUNC:WaitRsp][FILE:process_mode_manager.cpp][LINE:275]
[INFO] TDT(18087,python):2024-07-11-10:05:00.964.953 [process_mode_manager.cpp:184][Close][tid:18087] [TsdClient] Close [deviceId=6][sessionId=1] hccp and computer enter
[INFO] TDT(18087,python):2024-07-11-10:05:00.964.977 [version_verify.cpp:112][SpecialFeatureCheck][tid:18087] VersionVerify: previous type[7], supported
[INFO] TDT(18087,python):2024-07-11-10:05:00.965.019 [process_mode_manager.cpp:192][Close][tid:18087] [TsdClient][deviceId=6] [sessionId=1] wait hccp and computer process close respond
[INFO] TDT(18087,python):2024-07-11-10:05:04.022.885 [process_mode_manager.cpp:197][Close][tid:18087] [TsdClient][logicDeviceId_=6]has recv close hccp and computer process respond
[INFO] TDT(18087,python):2024-07-11-10:05:04.022.926 [stub_process_mode_nowin.cpp:151][CloseInHost][tid:18087] enter into CloseInHost deviceid[6]
[INFO] TDT(18087,python):2024-07-11-10:05:04.022.958 [client_manager.cpp:346][IsHostEnvironment][tid:18087] [TsdClient] logicDeviceId is [6], hostaicpunum[0]
[INFO] TDT(18087,python):2024-07-11-10:05:04.022.967 [stub_process_mode_nowin.cpp:154][CloseInHost][tid:18087] host cpu not support
[INFO] TDT(18087,python):2024-07-11-10:05:04.023.017 [process_mode_manager.cpp:208][Close][tid:18087] [TsdClient][deviceId=6] [sessionId=1] close hccp and computer process success
[INFO] ATRACE(18087,python):2024-07-11-10:05:04.023.029 [atrace_api.c:73](tid:18087) AtraceDestroy start
[INFO] ATRACE(18087,python):2024-07-11-10:05:04.023.053 [atrace_api.c:75](tid:18087) AtraceDestroy end
你这个报错里面显示上个进程还没退出,hccp初始化失败
但是我确定上个进程退出了,没有进程存在
那你能提供下上个进程退出后复现的plog日志以及device日志吗
ascend支持单卡上运行多个进程吗,我代码逻辑是进程对应流水线stage,在gpu上是一个进程管理多张卡,一张卡上有一个进程,但是在npu上运行发现每张卡上都有多个进程存在
ascend单卡可以运行多个进程。每张卡上都有多个进程需要看下你代码的实现逻辑
登录 后才可以发表评论