77 Star 597 Fork 1.2K

Ascend/pytorch

HcclCommInitRootInfo(numRanks, &rootInfo, rank, &(comm->hcclComm_)), error code is 1

DONE
Requirement
Opened this issue  
2025-03-12 20:21

Description

I am running distributed reinforcement learning on a framework that uses Ray Core and multiple actors. Unfortunately it seems like Ray does not allow for communication between actors using the hccl backend (please correct me if I am wrong!) so I have been using torch.distributed in order to perform this communication. There is a torch.distributed.init_process_group for each of these remote actors. The framework I am using has a custom implementation that tries to share a PrefixStore between these actors. Unfortunately when I try a collective torch.distributed.broadcast function, I get the following error:

Error

2025-03-12 19:16:40,149 ERROR ray_test.py:314 -- Error during Ray-based distributed training: ray::DistributedWorker.run_broadcast_test() (pid=3943699, ip=0.0.0.0, actor_id=4af49119d131112dffd124a102000000, repr=<ray_test.DistributedWorker object at 0xffcfb9e32790>)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ma-user/work/test-scripts/hcclcomminittest/ray_test.py", line 98, in run_broadcast_test
dist.broadcast(tensor, 0, group=self.process_group)
File "/home/ma-user/venv-vllm-v3/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ma-user/venv-vllm-v3/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2421, in broadcast
work = group.broadcast([tensor], opts)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: create:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:89 HCCL function error: HcclCommInitRootInfo(numRanks, &rootInfo, rank, &(comm->hcclComm_)), error code is 1
[ERROR] 2025-03-12-19:16:40 (PID:3943699, Device:0, RankID:1) ERR02200 DIST call hccl api failed. job_id=02000000 worker_id=02000000ffffffffffffffffffffffffffffffffffffffffffffffff node_id=129b56433d50290241412abea74d6abf2b5d38363458bbf181e3a897 timestamp_ns=1741778200149107141
2025-03-12 19:16:40,151 ERROR ray_test.py:316 -- Traceback (most recent call last):
File "/home/ma-user/work/test-scripts/hcclcomminittest/ray_test.py", line 302, in main
broadcast_results = ray.get([worker.run_broadcast_test.remote() for worker in workers])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ma-user/venv-vllm-v3/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/ma-user/venv-vllm-v3/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ma-user/venv-vllm-v3/lib/python3.11/site-packages/ray/_private/worker.py", line 2772, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ma-user/venv-vllm-v3/lib/python3.11/site-packages/ray/private/worker.py", line 919, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::DistributedWorker.run_broadcast_test() (pid=3943699, ip=0.0.0.0, actor_id=4af49119d131112dffd124a102000000, repr=<ray_test.DistributedWorker object at 0xffcfb9e32790>)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ma-user/work/test-scripts/hcclcomminittest/ray_test.py", line 98, in run_broadcast_test
dist.broadcast(tensor, 0, group=self.process_group)
File "/home/ma-user/venv-vllm-v3/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ma-user/venv-vllm-v3/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2421, in broadcast
work = group.broadcast([tensor], opts)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: create:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:89 HCCL function error: HcclCommInitRootInfo(numRanks, &rootInfo, rank, &(comm->hcclComm
)), error code is 1
[ERROR] 2025-03-12-19:16:40 (PID:3943699, Device:0, RankID:1) ERR02200 DIST call hccl api failed.

中文我读得懂,使用中文回答这个问题也可以

Python/Torch

Python 3.11.11
Torch 2.4.0
Torch-npu 2.4.0.post2

ASCEND TOOLKIT

version: 1.0
runtime_running_version=[7.6.0.1.220:8.0.0]
compiler_running_version=[7.6.0.1.220:8.0.0]
hccl_running_version=[7.6.0.1.220:8.0.0]
opp_running_version=[7.6.0.1.220:8.0.0]
toolkit_running_version=[7.6.0.1.220:8.0.0]
aoe_running_version=[7.6.0.1.220:8.0.0]
ncs_running_version=[7.6.0.1.220:8.0.0]
opp_kernel_running_version=[7.6.0.1.220:8.0.0]
hccl_upgrade_version=[7.6.0.1.220:8.0.0]
ncs_upgrade_version=[7.6.0.1.220:8.0.0]
runtime_upgrade_version=[7.6.0.1.220:8.0.0]
compiler_upgrade_version=[7.6.0.1.220:8.0.0]
opp_upgrade_version=[7.6.0.1.220:8.0.0]
toolkit_upgrade_version=[7.6.0.1.220:8.0.0]
aoe_upgrade_version=[7.6.0.1.220:8.0.0]
opp_kernel_upgrade_version=[7.6.0.1.220:8.0.0]
hccl_installed_version=[7.2.0.1.235:8.0.RC1][7.6.0.1.220:8.0.0]
ncs_installed_version=[7.2.0.1.235:8.0.RC1][7.6.0.1.220:8.0.0]
runtime_installed_version=[7.2.0.1.235:8.0.RC1][7.6.0.1.220:8.0.0]
compiler_installed_version=[7.2.0.1.235:8.0.RC1][7.6.0.1.220:8.0.0]
opp_installed_version=[7.2.0.1.235:8.0.RC1][7.6.0.1.220:8.0.0]
toolkit_installed_version=[7.2.0.1.235:8.0.RC1][7.6.0.1.220:8.0.0]
aoe_installed_version=[7.2.0.1.235:8.0.RC1][7.6.0.1.220:8.0.0]
opp_kernel_installed_version=[7.2.0.1.235:8.0.RC1][7.6.0.1.220:8.0.0]

Firmware

Version=23.0.rc2.2
ascendhal_version=7.25.6
aicpu_version=1.0
tdt_version=1.0
log_version=1.0
prof_version=2.0
dvppkernels_version=1.1
tsfw_version=1.0
Innerversion=V100R001C30SPC002B220
compatible_version=[V100R001C30]
compatible_version_fw=[6.4.0,6.4.99]
package_version=23.0.rc2.2

Ascend-cann-atb : 8.0.0

Ascend-cann-atb Version : 8.0.0.B100
Platform : aarch64
branch : br_release_cann_8.0.0_20250521

commit id : af0ec2e868267322b4fb7949da3ae7af769d9644

** Other useful information**

I have tried many many different approaches and have consistently gotten this HcclCommInitRootInfo error. Some things that I have tried to debug/solve

  1. Launching with torchrun and the deepspeed launcher (no Ray) – same error
  2. Running as a Python script (no Ray) and using torch.multiprocessing for process management – same error
  3. Other collective functions (e.g. allgather) – same error
  4. The port works. I can start a server on it and connect to the server from multiple clients
  5. The torch NPU version of a process group - torch.distributed.ProcessGroupHCCL https://www.hiascend.com/document/detail/zh/Pytorch/600/apiref/apilist/ptaoplist_000350.html – I cannot import from either torch.distributed or torch_npu
  6. I have tried this somewhat related script from the torch_npu API and still get the same HcclCommInitRootInfo on hcomm_info = default_pg._get_backend(torch.device("npu")).get_hccl_comm_name(rank)

import torch
import torch_npu
import torch.multiprocessing as mp
import os
from torch.distributed.distributed_c10d import _get_default_group
import torch.distributed as dist
def example(rank, world_size):
torch.npu.set_device("npu:" + str(rank))
dist.init_process_group("hccl", rank=rank, world_size=world_size)
default_pg = _get_default_group()
if torch.version > '2.0':
hcomm_info = default_pg._get_backend(torch.device("npu")).get_hccl_comm_name(rank)
else:
hcomm_info = default_pg.get_hccl_comm_name(rank)
print(hcomm_info)
def main():
world_size = 2
mp.spawn(example, args=(world_size, ),
nprocs=world_size,
join=True)
if name == "main":
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "29505"
main()

  1. I have used custom buffer space as is mentioned in the API – same error
  2. I have tried not initializing deepspeed.init_distributed first. If I use the default torch.distributed.init_process_group, I get the same HcclCommInitRootInfo error. If I use my custom init_process_group, I get an error because there is no default distributed process group
  3. I have tried unique and distinct group_names for the different process groups
  4. If there is only one process_group initialized, there is no problem for torch.distributed.broadcast()
    

full pip list

(test-broadcast) [ma-user test-scripts]$pip list
Package Version


absl-py 2.1.0
attrs 25.1.0
auto_tune 0.1.0
cloudpickle 3.1.1
dataflow 0.0.1
decorator 5.2.1
filelock 3.17.0
fsspec 2025.3.0
hccl 0.1.0
hccl_parser 0.1
Jinja2 3.1.6
llm_datadist 0.0.1
MarkupSafe 3.0.2
ml_dtypes 0.5.1
mpmath 1.3.0
msobjdump 0.1.0
networkx 3.4.2
npu_bridge 1.15.0
npu_device 0.1
numpy 1.26.4
op_compile_tool 0.1.0
op_gen 0.1
op_test_frame 0.1
opc_tool 0.1.0
pip 25.0.1
psutil 7.0.0
PyYAML 6.0.2
schedule_search 0.0.1
scipy 1.15.2
setuptools 65.5.0
show_kernel_debug_data 0.1.0
sympy 1.13.3
te 0.4.0
torch 2.4.0
torch-npu 2.4.0.post2
tornado 6.4.2
typing_extensions 4.12.2

Comments (12)

Ethan Kallett created需求 3 months ago

Full output with logging:

** Full output with log information **
Job submission server address: http://127.0.0.1:8265


Job 'raysubmit_gSAXCTNnF6JEn3Zd' submitted successfully

2025-03-12 19:16:31,260 - main - INFO - PrefixStore created: <class 'torch.distributed.distributed_c10d.PrefixStore'>
2025-03-12 19:16:31,260 INFO ray_test.py:196 -- PrefixStore created: <class 'torch.distributed.distributed_c10d.PrefixStore'> job_id=02000000 worker_id=0ea6a391d729fd06cc9ddd35588e04ba20d4953667b23190b2ad351a node_id=129b56433d50290241412abea74d6abf2b5d38363458bbf181e3a897 actor_id=4af49119d131112dffd124a102000000 task_id=f81ec6ff838b16db4af49119d131112dffd124a102000000 task_name=DistributedWorker.create_process_group task_func_name=ray_test.DistributedWorker.create_process_group actor_name= timestamp_ns=1741778191260851999
2025-03-12 19:16:31,261 - main - INFO - Determining appropriate parameter name for PyTorch version 2.5.1
2025-03-12 19:16:31,261 INFO ray_test.py:205 -- Determining appropriate parameter name for PyTorch version 2.5.1 job_id=02000000 worker_id=0ea6a391d729fd06cc9ddd35588e04ba20d4953667b23190b2ad351a node_id=129b56433d50290241412abea74d6abf2b5d38363458bbf181e3a897 actor_id=4af49119d131112dffd124a102000000 task_id=f81ec6ff838b16db4af49119d131112dffd124a102000000 task_name=DistributedWorker.create_process_group task_func_name=ray_test.DistributedWorker.create_process_group actor_name= timestamp_ns=1741778191261038752
2025-03-12 19:16:31,261 - main - INFO - Using parameter name: pg_options
2025-03-12 19:16:31,261 INFO ray_test.py:207 -- Using parameter name: pg_options job_id=02000000 worker_id=0ea6a391d729fd06cc9ddd35588e04ba20d4953667b23190b2ad351a node_id=129b56433d50290241412abea74d6abf2b5d38363458bbf181e3a897 actor_id=4af49119d131112dffd124a102000000 task_id=f81ec6ff838b16db4af49119d131112dffd124a102000000 task_name=DistributedWorker.create_process_group task_func_name=ray_test.DistributedWorker.create_process_group actor_name= timestamp_ns=1741778191261225624
2025-03-12 19:16:31,261 - main - INFO - Importing _new_process_group_helper from torch.distributed.distributed_c10d
2025-03-12 19:16:31,261 INFO ray_test.py:210 -- Importing _new_process_group_helper from torch.distributed.distributed_c10d job_id=02000000 worker_id=0ea6a391d729fd06cc9ddd35588e04ba20d4953667b23190b2ad351a node_id=129b56433d50290241412abea74d6abf2b5d38363458bbf181e3a897 actor_id=4af49119d131112dffd124a102000000 task_id=f81ec6ff838b16db4af49119d131112dffd124a102000000 task_name=DistributedWorker.create_process_group task_func_name=ray_test.DistributedWorker.create_process_group actor_name= timestamp_ns=1741778191261398087
2025-03-12 19:16:31,261 - main - INFO - Successfully imported _new_process_group_helper
2025-03-12 19:16:31,261 INFO ray_test.py:213 -- Successfully imported _new_process_group_helper job_id=02000000 worker_id=0ea6a391d729fd06cc9ddd35588e04ba20d4953667b23190b2ad351a node_id=129b56433d50290241412abea74d6abf2b5d38363458bbf181e3a897 actor_id=4af49119d131112dffd124a102000000 task_id=f81ec6ff838b16db4af49119d131112dffd124a102000000 task_name=DistributedWorker.create_process_group task_func_name=ray_test.DistributedWorker.create_process_group actor_name= timestamp_ns=1741778191261575860
2025-03-12 19:16:31,261 - main - INFO - Creating process group with world_size=2, rank=1, backend=hccl
2025-03-12 19:16:31,261 INFO ray_test.py:219 -- Creating process group with world_size=2, rank=1, backend=hccl job_id=02000000 worker_id=0ea6a391d729fd06cc9ddd35588e04ba20d4953667b23190b2ad351a node_id=129b56433d50290241412abea74d6abf2b5d38363458bbf181e3a897 actor_id=4af49119d131112dffd124a102000000 task_id=f81ec6ff838b16db4af49119d131112dffd124a102000000 task_name=DistributedWorker.create_process_group task_func_name=ray_test.DistributedWorker.create_process_group actor_name= timestamp_ns=1741778191261745842
2025-03-12 19:16:31,261 - main - INFO - Calling _new_process_group_helper with options: {'pg_options': None}
2025-03-12 19:16:31,261 INFO ray_test.py:222 -- Calling _new_process_group_helper with options: {'pg_options': None} job_id=02000000 worker_id=0ea6a391d729fd06cc9ddd35588e04ba20d4953667b23190b2ad351a node_id=129b56433d50290241412abea74d6abf2b5d38363458bbf181e3a897 actor_id=4af49119d131112dffd124a102000000 task_id=f81ec6ff838b16db4af49119d131112dffd124a102000000 task_name=DistributedWorker.create_process_group task_func_name=ray_test.DistributedWorker.create_process_group actor_name= timestamp_ns=1741778191261917785
2025-03-12 19:16:31,262 - main - INFO - Process group created successfully: <torch.distributed.distributed_c10d.ProcessGroup object at 0xffd034448770>
2025-03-12 19:16:31,262 INFO ray_test.py:233 -- Process group created successfully: <torch.distributed.distributed_c10d.ProcessGroup object at 0xffd034448770> job_id=02000000 worker_id=0ea6a391d729fd06cc9ddd35588e04ba20d4953667b23190b2ad351a node_id=129b56433d50290241412abea74d6abf2b5d38363458bbf181e3a897 actor_id=4af49119d131112dffd124a102000000 task_id=f81ec6ff838b16db4af49119d131112dffd124a102000000 task_name=DistributedWorker.create_process_group task_func_name=ray_test.DistributedWorker.create_process_group actor_name= timestamp_ns=1741778191262605005
2025-03-12 19:16:31,262 - main - INFO - Setting up group ranks mapping
2025-03-12 19:16:31,262 INFO ray_test.py:241 -- Setting up group ranks mapping job_id=02000000 worker_id=0ea6a391d729fd06cc9ddd35588e04ba20d4953667b23190b2ad351a node_id=129b56433d50290241412abea74d6abf2b5d38363458bbf181e3a897 actor_id=4af49119d131112dffd124a102000000 task_id=f81ec6ff838b16db4af49119d131112dffd124a102000000 task_name=DistributedWorker.create_process_group task_func_name=ray_test.DistributedWorker.create_process_group actor_name= timestamp_ns=1741778191262862519
2025-03-12 19:16:31,263 - main - INFO - Successfully imported _world from torch.distributed.distributed_c10d
2025-03-12 19:16:31,263 INFO ray_test.py:244 -- Successfully imported world from torch.distributed.distributed_c10d job_id=02000000 worker_id=0ea6a391d729fd06cc9ddd35588e04ba20d4953667b23190b2ad351a node_id=129b56433d50290241412abea74d6abf2b5d38363458bbf181e3a897 actor_id=4af49119d131112dffd124a102000000 task_id=f81ec6ff838b16db4af49119d131112dffd124a102000000 task_name=DistributedWorker.create_process_group task_func_name=ray_test.DistributedWorker.create_process_group actor_name= timestamp_ns=1741778191263055432
2025-03-12 19:16:31,263 - main - INFO - Setting up pg_group_ranks for 2 ranks
2025-03-12 19:16:31,263 INFO ray_test.py:245 -- Setting up pg_group_ranks for 2 ranks job_id=02000000 worker_id=0ea6a391d729fd06cc9ddd35588e04ba20d4953667b23190b2ad351a node_id=129b56433d50290241412abea74d6abf2b5d38363458bbf181e3a897 actor_id=4af49119d131112dffd124a102000000 task_id=f81ec6ff838b16db4af49119d131112dffd124a102000000 task_name=DistributedWorker.create_process_group task_func_name=ray_test.DistributedWorker.create_process_group actor_name= timestamp_ns=1741778191263231555
2025-03-12 19:16:31,263 - main - INFO - Group ranks mapping set up: {0: 0, 1: 1}
2025-03-12 19:16:31,263 INFO ray_test.py:247 -- Group ranks mapping set up: {0: 0, 1: 1} job_id=02000000 worker_id=0ea6a391d729fd06cc9ddd35588e04ba20d4953667b23190b2ad351a node_id=129b56433d50290241412abea74d6abf2b5d38363458bbf181e3a897 actor_id=4af49119d131112dffd124a102000000 task_id=f81ec6ff838b16db4af49119d131112dffd124a102000000 task_name=DistributedWorker.create_process_group task_func_name=ray_test.DistributedWorker.create_process_group actor_name= timestamp_ns=1741778191263414077
2025-03-12 19:16:31,263 - main - INFO - Process group details: type=<class 'torch.distributed.distributed_c10d.ProcessGroup'>, id=281269695186800
2025-03-12 19:16:31,263 INFO ray_test.py:255 -- Process group details: type=<class 'torch.distributed.distributed_c10d.ProcessGroup'>, id=281269695186800 job_id=02000000 worker_id=0ea6a391d729fd06cc9ddd35588e04ba20d4953667b23190b2ad351a node_id=129b56433d50290241412abea74d6abf2b5d38363458bbf181e3a897 actor_id=4af49119d131112dffd124a102000000 task_id=f81ec6ff838b16db4af49119d131112dffd124a102000000 task_name=DistributedWorker.create_process_group task_func_name=ray_test.DistributedWorker.create_process_group actor_name= timestamp_ns=1741778191263614940
2025-03-12 19:16:31,263 - main - INFO - Worker 1 custom process group created successfully
2025-03-12 19:16:31,263 INFO ray_test.py:76 -- Worker 1 custom process group created successfully job_id=02000000 worker_id=0ea6a391d729fd06cc9ddd35588e04ba20d4953667b23190b2ad351a node_id=129b56433d50290241412abea74d6abf2b5d38363458bbf181e3a897 actor_id=4af49119d131112dffd124a102000000 task_id=f81ec6ff838b16db4af49119d131112dffd124a102000000 task_name=DistributedWorker.create_process_group task_func_name=ray_test.DistributedWorker.create_process_group actor_name= timestamp_ns=1741778191263798113
2025-03-12 19:16:31,265 - main - INFO - Worker 1 running broadcast test
2025-03-12 19:16:31,265 INFO ray_test.py:87 -- Worker 1 running broadcast test job_id=02000000 worker_id=0ea6a391d729fd06cc9ddd35588e04ba20d4953667b23190b2ad351a node_id=129b56433d50290241412abea74d6abf2b5d38363458bbf181e3a897 actor_id=4af49119d131112dffd124a102000000 task_id=80b655a2d9b04d404af49119d131112dffd124a102000000 task_name=DistributedWorker.run_broadcast_test task_func_name=ray_test.DistributedWorker.run_broadcast_test actor_name= timestamp_ns=1741778191265669131
[rank0]:[W312 19:16:37.791408743 compiler_depend.ts:37] Warning: A common user is using the files of the root user. (function operator())
2025-03-12 19:16:37,972 - main - INFO - Worker 0 tensor before broadcast: tensor([[0., 0., 0.],
[0., 0., 0.]], device='npu:0')
2025-03-12 19:16:37,972 INFO ray_test.py:95 -- Worker 0 tensor before broadcast: tensor([[0., 0., 0.],
[0., 0., 0.]], device='npu:0') job_id=02000000 worker_id=13a80b25a791939119d729695a9d26d00b7b79a77b593fa7ca143f46 node_id=129b56433d50290241412abea74d6abf2b5d38363458bbf181e3a897 actor_id=cb51f3e20fca42956554742402000000 task_id=32b0eec39cfa87accb51f3e20fca42956554742402000000 task_name=DistributedWorker.run_broadcast_test task_func_name=ray_test.DistributedWorker.run_broadcast_test actor_name= timestamp_ns=1741778197972416343
2025-03-12 19:16:38,057 - main - INFO - Worker 1 tensor before broadcast: tensor([[1., 1., 1.],
[1., 1., 1.]], device='npu:0')
2025-03-12 19:16:38,057 INFO ray_test.py:95 -- Worker 1 tensor before broadcast: tensor([[1., 1., 1.],
[1., 1., 1.]], device='npu:0') job_id=02000000 worker_id=0ea6a391d729fd06cc9ddd35588e04ba20d4953667b23190b2ad351a node_id=129b56433d50290241412abea74d6abf2b5d38363458bbf181e3a897 actor_id=4af49119d131112dffd124a102000000 task_id=80b655a2d9b04d404af49119d131112dffd124a102000000 task_name=DistributedWorker.run_broadcast_test task_func_name=ray_test.DistributedWorker.run_broadcast_test actor_name= timestamp_ns=1741778198057905953
2025-03-12 19:16:40,149 ERROR ray_test.py:314 -- Error during Ray-based distributed training: ray::DistributedWorker.run_broadcast_test() (pid=3943699, ip=0.0.0.0, actor_id=4af49119d131112dffd124a102000000, repr=<ray_test.DistributedWorker object at 0xffcfb9e32790>)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ma-user/work/test-scripts/hcclcomminittest/ray_test.py", line 98, in run_broadcast_test
dist.broadcast(tensor, 0, group=self.process_group)
File "/home/ma-user/venv-vllm-v3/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ma-user/venv-vllm-v3/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2421, in broadcast
work = group.broadcast([tensor], opts)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: create:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:89 HCCL function error: HcclCommInitRootInfo(numRanks, &rootInfo, rank, &(comm->hcclComm
)), error code is 1
[ERROR] 2025-03-12-19:16:40 (PID:3943699, Device:0, RankID:1) ERR02200 DIST call hccl api failed. job_id=02000000 worker_id=02000000ffffffffffffffffffffffffffffffffffffffffffffffff node_id=129b56433d50290241412abea74d6abf2b5d38363458bbf181e3a897

Ethan Kallett changed description 3 months ago
Ethan Kallett changed description 3 months ago

Customized implementation of torch.distributed.init_process_group

This implementation uses a TCPStore-based PrefixCache to coordinate these process groups. It is taken from https://github.com/OpenRLHF/OpenRLHF/blob/main/openrlhf/utils/distributed_util.py and leverages very heavily the original init_process_group code from https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py#L365

def init_process_group(
backend: Union[str, Backend] = None,
init_method: Optional[str] = None,
timeout: Optional[timedelta] = None,
world_size: int = -1,
rank: int = -1,
store: Optional[Store] = None,
group_name: str = None,
pg_options: Optional[Any] = None,
):
assert (store is None) or (init_method is None), "Cannot specify both init_method and store."

if store is not None:
    assert world_size > 0, "world_size must be positive if using store"
    assert rank >= 0, "rank must be non-negative if using store"
elif init_method is None:
    init_method = "env://"

if backend:
    backend = Backend(backend)
else:
    backend = Backend("undefined")

if timeout is None:
    timeout = default_pg_timeout

# backward compatible API
if store is None:
    rendezvous_iterator = rendezvous(init_method, rank, world_size, timeout=timeout)
    store, rank, world_size = next(rendezvous_iterator)
    store.set_timeout(timeout)

    # Use a PrefixStore to avoid accidental overrides of keys used by
    # different systems (e.g. RPC) in case the store is multi-tenant.
    store = PrefixStore(group_name, store)

# NOTE: The pg_options parameter was renamed into backend_options in PyTorch 2.6.0
# https://github.com/pytorch/pytorch/commit/a0c7029a75628cd5fa8df83c0de0ea98ee7fd844
# We need to determine the appropriate parameter name based on PyTorch version
pg_options_param_name = "backend_options" if str(torch.__version__) >= "2.6" else "pg_options"
pg, _ = _new_process_group_helper(
    world_size,
    rank,
    [],
    backend,
    store,
    group_name=group_name,
    **{pg_options_param_name: pg_options},
    timeout=timeout,
)

_world.pg_group_ranks[pg] = {i: i for i in range(world_size)}

return pg
Ethan Kallett changed description 3 months ago

麻烦提供下详细plog日志

[ERROR] HCCL(2346109,):2025-03-13-17:34:23.457.474 [hccl_ip_address.cc:76][2346109]if name is null.
[ERROR] HCCL(2346109,):2025-03-13-17:34:23.457.479 [adapter_hccp.cc:1551][2346109]call trace: hcclRet -> 1
[ERROR] HCCL(2346109,):2025-03-13-17:34:23.457.485 [topoinfo_detect.cc:561][2346109]call trace: hcclRet -> 1
[ERROR] HCCL(2346109,):2025-03-13-17:34:23.457.489 [topoinfo_detect.cc:259][2346109][Setup][Agent]topo detect generate local rank info failed! rank[0]
[ERROR] HCCL(2346109,):2025-03-13-17:34:23.457.493 [op_base.cc:841][2346109][Init][CommRootInfo]errNo[0x0000000005000001] setup topo detect error
[ERROR] HCCP(2346109,):2025-03-13-17:34:23.457.508 [rs_socket.c:2744]tid:2346109,rs_peer_get_ifnum(2744) : param error, g_rs_cb is NULL
[ERROR] HCCP(2346109,):2025-03-13-17:34:23.457.511 [ra_peer.c:964]tid:2346109,ra_peer_get_ifnum(964) : [get][ra_peer_ifnum]rs_peer_get_ifnum failed(-22)
[ERROR] HCCP(2346109,):2025-03-13-17:34:23.457.528 [ra_host.c:1930]tid:2346109,ra_get_ifnum(1930) : [get][ra_ifnum]ra_peer_get_ifnum failed, ret(-22)
[ERROR] HCCL(2346109,):2025-03-13-17:34:23.457.533 [adapter_hccp.cc:1492][2346109][Get][IfNum]errNo[0x000000000500000b] ra get if num fail. ret[128303], num[0]
[ERROR] HCCL(2346109,):2025-03-13-17:34:23.457.536 [adapter_hccp.cc:1355][2346109]call trace: hcclRet -> 11
[ERROR] HCCL(2346109,):2025-03-13-17:34:23.457.540 [sal.cc:346][2346109]call trace: hcclRet -> 11
[ERROR] HCCL(2346109,):2025-03-13-17:34:23.457.545 [op_base.cc:922][2346109][InitCommRootInfo]Init failed, return[0x0000000005000001], rankNum[2], rank[0], rootInfo identifier[172.16.10.219%eth0_60004_4_1741858461689369], server[0.0.0.0], logicDevId[-1]
[ERROR] HCCL(2346109,):2025-03-13-17:34:23.457.575 [hccl_comm.cc:992][2346109]errNo[0x0000000005000002] ptr [communicator_] is nullptr, return HCCL_E_PTR
[ERROR] HCCL(2346109,):2025-03-13-17:34:23.457.598 [op_base.cc:1816][2346109][HcclCommDestroy] comm is not exist, comm=0x555e2d80, group=172.16.10.219%eth0_60004_4_1741858461689369, deviceLogicId=0
[ERROR] HCCL(2346109,):2025-03-13-17:34:23.457.620 [op_base.cc:974][2346109][Init][CommRootInfoConfig]errNo[0x0000000005000001]HcclCommInitRootInfo failed.

这些是一些Info-level PLOG的日志片段, 所有Error的日志在上面。按照日志,找不到RANK_TABLE_FILE,因为这次因为环境变量未设置。可是,RANK_TABLE_FILE这个环境变量我用过,遇到一样的报错

[INFO] RUNTIME(2433625,):2025-03-13-18:01:24.907.958 [stream.cc:2899] 2433625 StarsWaitForTask: No.0 stream_id=2, taskId=26, currentId=26
[INFO] RUNTIME(2433625,):2025-03-13-18:01:24.907.962 [api_error.cc:752] 2433625 StreamSynchronize: stream synchronize exit, stream_id=2, result=0x0
[INFO] ASCENDCL(2433625,):2025-03-13-18:01:24.907.965 [stream.cpp:123]2433625 aclrtSynchronizeStream: Synchronize stream success
[INFO] ASCENDCL(2433625,):2025-03-13-18:01:24.907.967 [memory.cpp:289]2433625 aclrtMemcpy: start to execute aclrtMemcpy, destMaxSize = 12, srcSize = 12, kind = 2
[INFO] APP(2433625,):2025-03-13-18:01:24.909.715 [log_inner.cpp:82]2433625 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:createHCCLCommEx:1215: "[PTA]:"The rank_table_file is not available, switch to original interface.""
[INFO] RUNTIME(2433625,):2025-03-13-18:01:24.910.079 [runtime.cc:5559] 2433625 isNeedChangeDeviceId: userDevId=1, isSetVisibleDev=1 isDeviceSetResetOp=0 curCtx=0x38106840 ContextMode=0.
[INFO] RUNTIME(2433625,):2025-03-13-18:01:24.910.086 [runtime.cc:5601] 2433625 GetUserDevIdByDeviceId: userDevId=0, deviceId=1
[INFO] HCCL(2433625,):2025-03-13-18:01:24.910.102 [adapter_rts.cc:197][2433625][hrtGetDeviceRefresh]deviceLogicId[0]
[INFO] HCCL(2433625,):2025-03-13-18:01:24.910.124 [externalinput.cc:2061][2433625]HCCL_LOG_CONFIG set by default to [0]
[INFO] HCCL(2433625,):2025-03-13-18:01:24.910.147 [externalinput.cc:622][2433625]environmental variable PROFILING_MODE and GE profiling option is not set, default: false
[INFO] HCCL(2433625,):2025-03-13-18:01:24.910.177 [externalinput.cc:728][2433625]HCCL_DEVICE_NIC_DISABLE set by default to [0]
[INFO] HCCL(2433625,):2025-03-13-18:01:24.910.181 [externalinput.cc:758][2433625]HCCL_HOST_RDMA_ENABLE set by default to [0]
[INFO] HCCL(2433625,):2025-03-13-18:01:24.910.194 [externalinput.cc:1672][2433625][CannVersion][Verification]environmental variable LD_LIBRARY_PATH is too long, actual len[1575], maxlen[1024]
[INFO] HCCL(2433625,):2025-03-13-18:01:24.910.199 [externalinput.cc:462][2433625]HCCL_HIGH_PERF_ENABLE set by default to [0]
[WARNING] HCCL(2433625,):2025-03-13-18:01:24.910.295 [dlhal_function.cc:112][2433625]dlHalBindCgroup is nullptr, can not use dlHalBindCgroup
[WARNING] RUNTIME(2433625,):2025-03-13-18:01:24.910.319 [config.cc:834] 2433625 ReadHeterogenousModeFromConfigIni: read ASCEND_LATEST_INSTALL_PATH failed! isHeterogenous=0.
[INFO] RUNTIME(2433625,):2025-03-13-18:01:24.910.324 [api_c.cc:3034] 2433625 rtGetSocVersion: soc version is Ascend910B4
[INFO] HCCL(2433625,):2025-03-13-18:01:24.910.346 [externalinput.cc:1808][2433625]HCCL_OP_BASE_FFTS_MODE_ENABLE set by default to [true]
[INFO] HCCL(2433625,):2025-03-13-18:01:24.910.350 [externalinput.cc:515][2433625]HCCL_DETERMINISTIC set by default to [false]
[INFO] HCCL(2433625,):2025-03-13-18:01:24.910.362 [externalinput.cc:1944][2433625]HCCL_INTER_HCCS_DISABLE is not set, default value is 0.
[INFO] HCCL(2433625,):2025-03-13-18:01:24.910.372 [externalinput.cc:1964][2433625]HCCL_INTER_VNIC_DISABLE is not set, default value is 0.
[WARNING] HCCL(2433625,):2025-03-13-18:01:24.910.395 [dlhal_function.cc:112][2433625]dlHalBindCgroup is nullptr, can not use dlHalBindCgroup
[INFO] HCCL(2433625,):2025-03-13-18:01:24.910.410 [externalinput.cc:2032][2433625]environmental variable ASCEND_MC2_DEBUG_MODE is not set
[INFO] HCCL(2433625,):2025-03-13-18:01:24.910.416 [externalinput.cc:2205][2433625]HCCL_CONCURRENT_ENABLE is not set, default value is [0]
[INFO] HCCL(2433625,):2025-03-13-18:01:24.910.425 [externalinput.cc:889][2433625]HCCL_SEND_CQ_DEPTH set by default to [8192]
[INFO] HCCL(2433625,):2025-03-13-18:01:24.910.428 [externalinput.cc:914][2433625]HCCL_DFT is not set, default value is 1.
[WARNING] HCCL(2433625,):2025-03-13-18:01:24.910.454 [externalinput.cc:672][2433625][Parse][MultiQpSrcPortConfigPath]environmental variable HCCL_RDMA_QP_PORT_CONFIG_PATH is empty,but HCCL_RDMA_QP_PORT_CONFIG_PATH is not set
[INFO] HCCL(2433625,):2025-03-13-18:01:24.910.475 [op_base.cc:82][2433625][HcclGetDeviceId] deviceLogicId[0]
[INFO] HCCL(2433625,):2025-03-13-18:01:24.910.479 [op_base.cc:97][2433625][GetHcclOpInfoCtx] Set device, use g_hcclDeviceId[0]
[INFO] HCCL(2433625,):2025-03-13-18:01:24.910.502 [rank_consistent.cc:29][2433625][GetInstance] get deviceLogicId[0]
[INFO] RUNTIME(2433625,):2025-03-13-18:01:24.910.523 [runtime.cc:5559] 2433625 isNeedChangeDeviceId: userDevId=0, isSetVisibleDev=1 isDeviceSetResetOp=0 curCtx=0x38106840 ContextMode=0.
[INFO] RUNTIME(2433625,):2025-03-13-18:01:24.910.527 [runtime.cc:5586] 2433625 ChgUserDevIdToDeviceId: devId=0, realDevId=1
[INFO] RUNTIME(2433625,):2025-03-13-18:01:24.910.530 [api_impl.cc:2575] 2433625 GetDevicePhyIdByIndex: get PhyId by Index=1.
[INFO] RUNTIME(2433625,):2025-03-13-18:01:24.910.534 [api_impl.cc:2586] 2433625 GetDevicePhyIdByIndex: GetDevicePhyIdByIndex, device_mode=0, device_id=1
[INFO] RUNTIME(2433625,):2025-03-13-18:01:24.910.537 [api_impl.cc:2605] 2433625 GetDevicePhyIdByIndex: cloud v2, device_mode=0, device_id=1
[INFO] HCCL(2433625,):2025-03-13-18:01:24.915.860 [network_manager.cc:65][2433625]NetworkManager::GetInstance deviceLogicID[0].
[INFO] HCCL(2433625,):2025-03-13-18:01:24.915.885 [network_manager.cc:108][2433625]NetworkManager: devicePhyId[5], devicePhyId_[5] deviceLogicId_[0].
[INFO] HCCL(2433625,):2025-03-13-18:01:24.915.915 [adapter_hccp.cc:1428][2433625]set host socket whitelist status[0] success.
[INFO] ROCE(2433625,):2025-03-13-18:01:24.916.145 [dl_hal_function.c:233]dl_hal_init(233) : dl_hal_init success!
[INFO] HCCP(2433625,):2025-03-13-18:01:24.916.152 [ra_peer.c:876]tid:2433625,ra_peer_init(876) : [init][ra_peer]ra_peer_init phy_id[5] start
[INFO] ROCE(2433625,):2025-03-13-18:01:24.916.164 [dl_hal_function.c:218]dl_hal_init(218) : dl_hal_init success, no need to dlopen libascend_hal.so!
[INFO] ROCE(2433625,):2025-03-13-18:01:24.916.295 [dl_hal_function.c:233]dl_hal_init(233) : dl_hal_init success!
[INFO] HCCP(2433625,):2025-03-13-18:01:24.916.309 [rs.c:326]tid:2433625,rs_pthread_mutex_init(326) : mutex init ok
[WARNING] ROCE(2433625,):2025-03-13-18:01:24.916.441 [network_comm.c:149]get_saved_tls_config_file_path(149) : conf_path[/home/ma-user/cert] is invalid, reason[-2].
[WARNING] ROCE(2433625,):2025-03-13-18:01:24.916.459 [network_comm.c:160]get_saved_tls_config_file_path(160) : conf_path[/home/HwDmUser/cert] is invalid, reason[-2].
[WARNING] HCCP(2433625,):2025-03-13-18:01:24.916.464 [rs_ssl.c:1508]tid:2433625,rs_ssl_init(1508) : can not get manage data, ret(-2). maybe not set tls
[INFO] HCCP(2433625,):2025-03-13-18:01:24.916.497 [rs_epoll.c:667]tid:2433625,rs_epoll_connect_handle_init(667) : rs_create_epoll ok
[INFO] HCCP(2433625,):2025-03-13-18:01:24.916.647 [rs_epoll.c:462]tid:2434788,rs_epoll_handle(462) : thread begin! thread_id:281265738138080, pid:2433625, ppid:2421928
[INFO] HCCP(2433625,):2025-03-13-18:01:24.916.676 [rs_epoll.c:688]tid:2433625,rs_epoll_connect_handle_init(688) : RS INIT OK!
[INFO] HCCP(2433625,):2025-03-13-18:01:24.916.689 [rs_epoll.c:399]tid:2434788,bind_data_cpu(399) : host not need bind cpu, info[1]
[INFO] HCCP(2433625,):2025-03-13-18:01:24.916.682 [rs.c:365]tid:2433625,rs_init_rscb_cfg(365) : rs_epoll_connect_handle_init ok cost [0.209000] ms
[INFO] HCCP(2433625,):2025-03-13-18:01:24.916.708 [rs_epoll.c:596]tid:2434789,rs_connect_handle(596) : thread begin! thread_id:281265729745376, pid:2433625, ppid:2421928
[INFO] HCCP(2433625,):2025-03-13-18:01:24.917.474 [ra_peer.c:913]tid:2433625,ra_peer_init(913) : [init][ra_peer]ra_peer_init phy_id[5] succ
[INFO] HCCL(2433625,):2025-03-13-18:01:24.917.480 [adapter_hccp.cc:583][2433625]init ra success.
[INFO] HCCL(2433625,):2025-03-13-18:01:24.917.485 [network_manager.cc:170][2433625]NetworkManager nicDeploy[0] deviceLogicId_[0] devicePhyId_[5] init ra OK, nicSocketMap size[0], isHostUseDevNic[0]
[INFO] HCCP(2433625,):2025-03-13-18:01:24.917.506 [ra_peer.c:959]tid:2433625,ra_peer_get_ifnum(959) : [get][ra_peer_ifnum]ra_peer_get_ifnum phy_id[5] start
[INFO] HCCP(2433625,):2025-03-13-18:01:24.917.521 [ra_peer.c:967]tid:2433625,ra_peer_get_ifnum(967) : [get][ra_peer_ifnum]ra_peer_get_ifnum phy_id[5] succ
[INFO] HCCP(2433625,):2025-03-13-18:01:24.917.536 [ra_peer.c:974]tid:2433625,ra_peer_get_ifaddrs(974) : [get][ra_peer_ifaddrs] ra_peer_get_ifaddrs phy_id[5] start
[INFO] HCCP(2433625,):2025-03-13-18:01:24.917.543 [rs_socket.c:2558]tid:2433625,rs_peer_fill_ifaddr_infos(2558) : ifname[lo] addr[0x0100007f]
[INFO] HCCP(2433625,):2025-03-13-18:01:24.917.547 [rs_socket.c:2558]tid:2433625,rs_peer_fill_ifaddr_infos(2558) : ifname[eth0] addr[0xdb0a10ac]
[INFO] HCCP(2433625,):2025-03-13-18:01:24.917.551 [ra_peer.c:982]tid:2433625,ra_peer_get_ifaddrs(982) : [get][ra_peer_ifaddrs] ra_peer_get_ifaddrs phy_id[5] succ
[INFO] HCCL(2433625,):2025-03-13-18:01:24.917.563 [adapter_hccp.cc:1385][2433625][Get][HostIf]hrtGetIfAddress: idx[0] ifname[lo] ip[127.0.0.1%lo]
[INFO] HCCL(2433625,):2025-03-13-18:01:24.917.568 [adapter_hccp.cc:1385][2433625][Get][HostIf]hrtGetIfAddress: idx[1] ifname[eth0] ip[172.16.10.219%eth0]
[INFO] HCCL(2433625,):2025-03-13-18:01:24.917.589 [network_manager.cc:65][2433625]NetworkManager::GetInstance deviceLogicID[0].
[INFO] HCCL(2433625,):2025-03-13-18:01:24.917.618 [adapter_hccp.cc:809][2433625]socket init v1 success, socketHandle[0x4a7077a0]
[INFO] HCCL(2433625,):2025-03-13-18:01:24.917.621 [network_manager.cc:1109][2433625]ip[172.16.10.219%eth0] socket init OK
[INFO] HCCL(2433625,):2025-03-13-18:01:24.917.625 [network_manager.cc:1148][2433625]ip[172.16.10.219%eth0] socket start success socketHandle[0x4a7077a0]
[WARNING] HCCL(2433625,):2025-03-13-18:01:24.917.740 [dlhal_function.cc:112][2433625]dlHalBindCgroup is nullptr, can not use dlHalBindCgroup
[WARNING] HCCL(2433625,):2025-03-13-18:01:24.917.767 [dlhal_function.cc:112][2433625]dlHalBindCgroup is nullptr, can not use dlHalBindCgroup
[INFO] HCCL(2433625,):2025-03-13-18:01:24.917.775 [adapter_hal.cc:289][2433625]Entry-hrtDrvGetPlatformInfo
[WARNING] HCCL(2433625,):2025-03-13-18:01:24.917.797 [dlhal_function.cc:112][2433625]dlHalBindCgroup is nullptr, can not use dlHalBindCgroup
[INFO] HCCL(2433625,):2025-03-13-18:01:24.917.810 [sal.cc:593][2433625]IsHostUseDevNic[1]
[INFO] HCCL(2433625,):2025-03-13-18:01:24.917.813 [network_manager.cc:65][2433625]NetworkManager::GetInstance deviceLogicID[0].
[INFO] HCCL(2433625,):2025-03-13-18:01:24.917.817 [network_manager.cc:108][2433625]NetworkManager: devicePhyId[4294967295], devicePhyId_[5] deviceLogicId_[0].
[INFO] TDT(2433625,):2025-03-13-18:01:24.917.852 [client_manager.cpp:639][ChangeUserDeviceIdToLogicDeviceId][tid:2433625] [TsdClient] change userDevId [0] to logicDevId [1]
[INFO] TDT(2433625,):2025-03-13-18:01:24.917.858 [client_manager.cpp:639][ChangeUserDeviceIdToLogicDeviceId][tid:2433625] [TsdClient] change userDevId [0] to logicDevId [1]
[INFO] TDT(2433625,):2025-03-13-18:01:24.917.866 [process_mode_manager.cpp:903][CheckNeedToOpen][tid:2433625] [TsdClient] get open para startCp[1] startHccp[1] (0:false 1:true)
[INFO] TDT(2433625,):2025-03-13-18:01:24.918.079 [client_manager.cpp:260][CheckPackageExists][tid:2433625] [TsdClient] check file[..], deviceId[1]
[INFO] TDT(2433625,):2025-03-13-18:01:24.918.152 [client_manager.cpp:260][CheckPackageExists][tid:2433625] [TsdClient] check file[scene.info], deviceId[1]
[INFO] TDT(2433625,):2025-03-13-18:01:24.918.197 [client_manager.cpp:260][CheckPackageExists][tid:2433625] [TsdClient] check file[ascend_install.info], deviceId[1]
[INFO] TDT(2433625,):2025-03-13-18:01:24.918.240 [client_manager.cpp:260][CheckPackageExists][tid:2433625] [TsdClient] check file[version.info], deviceId[1]
[INFO] TDT(2433625,):2025-03-13-18:01:24.918.279 [client_manager.cpp:260][CheckPackageExists][tid:2433625] [TsdClient] check file[.], deviceId[1]
[INFO] TDT(2433625,):2025-03-13-18:01:24.918.316 [client_manager.cpp:260][CheckPackageExists][tid:2433625] [TsdClient] check file[Ascend-aicpu_syskernels.tar.gz], deviceId[1]
[INFO] TDT(2433625,):2025-03-13-18:01:24.918.392 [client_manager.cpp:263][CheckPackageExists][tid:2433625] [TsdClient] find aicpu package[Ascend-aicpu_syskernels.tar.gz] in path[/usr/local/Ascend/ascend-toolkit/latest/opp/Ascend/aicpu/], deviceId[1]
[INFO] TDT(2433625,):2025-03-13-18:01:24.918.397 [client_manager.cpp:260][CheckPackageExists][tid:2433625] [TsdClient] check file[Ascend-aicpu_extend_syskernels.tar.gz], deviceId[1]
[INFO] TDT(2433625,):2025-03-13-18:01:24.918.492 [client_manager.cpp:267][CheckPackageExists][tid:2433625] [TsdClient] find extend package[Ascend-aicpu_extend_syskernels.tar.gz] in path[/usr/local/Ascend/ascend-toolkit/latest/opp/Ascend/aicpu/], deviceId[1]
[INFO] TDT(2433625,):2025-03-13-18:01:24.918.511 [client_manager.cpp:282][CheckPackageExists][tid:2433625] [TsdClient][deviceId=1] set extend package path[/usr/local/Ascend/ascend-toolkit/latest/opp/Ascend/aicpu/], name[Ascend-aicpu_extend_syskernels.tar.gz]
[INFO] TDT(2433625,):2025-03-13-18:01:24.918.515 [client_manager.cpp:285][CheckPackageExists][tid:2433625] [TsdClient][deviceId=1] set aicpu package path[/usr/local/Ascend/ascend-toolkit/latest/opp/Ascend/aicpu/], name[Ascend-aicpu_syskernels.tar.gz]
[INFO] TDT(2433625,):2025-03-13-18:01:24.918.560 [process_mode_manager.cpp:528][InitTsdClient][tid:2433625] [TsdClient] tsd client has been inited before
[INFO] TDT(2433625,):2025-03-13-18:01:24.918.623 [hdc_client.cpp:154][TsdRecvData][tid:2433625] recv timeout is:150000 ms
[INFO] RUNTIME(2433625,):2025-03-13-18:01:24.951.453 [runtime.cc:5559] 2434289 isNeedChangeDeviceId: userDevId=1, isSetVisibleDev=1 isDeviceSetResetOp=1 curCtx=0x0 ContextMode=0.
[INFO] RUNTIME(2433625,):2025-03-13-18:01:24.951.468 [runtime.cc:5601] 2434289 GetUserDevIdByDeviceId: userDevId=0, deviceId=1
[INFO] TDT(2433625,):2025-03-13-18:01:24.951.474 [client_manager.cpp:639][ChangeUserDeviceIdToLogicDeviceId][tid:2434289] [TsdClient] change userDevId [0] to logicDevId [1]
[INFO] TDT(2433625,):2025-03-13-18:01:24.951.480 [client_manager.cpp:639][ChangeUserDeviceIdToLogicDeviceId][tid:2434289] [TsdClient] change userDevId [0] to logicDevId [1]
[INFO] TDT(2433625,):2025-03-13-18:01:24.951.484 [client_manager.cpp:639][ChangeUserDeviceIdToLogicDeviceId][tid:2434289] [TsdClient] change userDevId [0] to logicDevId [1]

[INFO] HCCL(2433625,):2025-03-13-18:01:24.081.979 [plugin_manager.cc:25][2433625]hcom ops plugin init start.
[INFO] HCCL(2433625,):2025-03-13-18:01:24.081.993 [plugin_manager.cc:38][2433625]input BUILD_MODE is not set, result autoTune is false.
[INFO] HCCL(2433625,):2025-03-13-18:01:24.082.017 [hcom_plugin.cc:50][2433625]Initialize start.
[INFO] HCCL(2433625,):2025-03-13-18:01:24.082.021 [hcom_plugin.cc:52][2433625]initializeCount_ : 1
[INFO] HCCL(2433625,):2025-03-13-18:01:24.082.053 [hcom_plugin.cc:60][2433625]hvdOpsKernelInfoStoreInfoPtr_ is initialized
[INFO] HCCL(2433625,):2025-03-13-18:01:24.082.066 [hcom_plugin.cc:63][2433625]hvdGraphOptimizerPtr_ is initialized
[INFO] HCCL(2433625,):2025-03-13-18:01:24.082.071 [hcom_plugin.cc:329][2433625]ParserHcclAlgoDesc: there is no key named "HCCL_algorithm" in the options.
[INFO] HCCL(2433625,):2025-03-13-18:01:24.082.076 [hcom_plugin.cc:281][2433625]ParserHcclExecTimeOut: there is no key named "HCCL_execTimeOut" in the options.
[WARNING] HCCL(2433625,):2025-03-13-18:01:24.082.089 [hcom_plugin.cc:217][2433625]option profiling mode is not found.
[INFO] HCCL(2433625,):2025-03-13-18:01:24.082.094 [hcom_plugin.cc:164][2433625][masterInfo]get Master Info:ip or port or deviceId not set.
[INFO] HCCL(2433625,):2025-03-13-18:01:24.082.098 [hcom_plugin.cc:180][2433625][masterInfo]get Rank Info: worker size not set.
[INFO] HCCL(2433625,):2025-03-13-18:01:24.082.102 [hcom_plugin.cc:244][2433625]InitializePlugin: Init without ranktable and masterInfo, please check your mode.
[INFO] HCCL(2433625,):2025-03-13-18:01:24.082.208 [hcom_plugin.cc:82][2433625]hccl ops plugin init success.
[INFO] HCCL(2433625,):2025-03-13-18:01:24.082.213 [plugin_manager.cc:48][2433625]hcom ops plugin init end. ret[0]
[INFO] GE(2433625,):2025-03-13-18:01:24.082.221 [dnnengine_manager.cc:41]2433625 ~InvokeFuncPerfRecorder:[GEPERFTRACE] The time cost of InvokeAll [Initialize] in [libhcom_graph_adaptor.so] is [246] micro second.
[INFO] AICPU(2433625,):2025-03-13-18:01:24.082.353 [util.cc:84][GetSoPath][tid:2433625]:"So file path=/usr/local/Ascend/ascend-toolkit/8.0.0/aarch64-linux/lib64/plugin/opskernel/libhost_cpu_engine.so"

[INFO] ASCENDCL(2433625,):2025-03-13-18:01:16.097.674 [device.cpp:345]2433625 aclrtGetDeviceCount: successfully execute aclrtGetDeviceCount, get device count is 1.
[WARNING] APP(2433625,):2025-03-13-18:01:16.107.463 [log_inner.cpp:79]2433625 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:get_custom_lib_path:87: "[PTA]:"ASCEND_CUSTOM_OPP_PATH is not exists""
[INFO] APP(2433625,):2025-03-13-18:01:16.109.204 [log_inner.cpp:82]2433625 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:GetDispatchTimeout:484: "[PTA]:"set dispatchTimeout_ 600 s.""
[INFO] APP(2433625,):2025-03-13-18:01:17.742.015 [log_inner.cpp:82]2433625 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:ProcessGroupHCCL:675: "[PTA]:"Get env HCCL_EXEC_TIMEOUT value 0, and set op wait timeout to 1868.""
[INFO] ASCENDCL(2433625,):2025-03-13-18:01:17.742.742 [event.cpp:248]2433625 aclrtSetOpWaitTimeout: start to execute aclrtSetOpWaitTimeout, timeout = 1868s
[INFO] RUNTIME(2433625,):2025-03-13-18:01:17.742.781 [api_impl.cc:5183] 2433625 SetOpWaitTimeOut: set OP wait timeout:1868.
[INFO] ASCENDCL(2433625,):2025-03-13-18:01:17.743.844 [event.cpp:250]2433625 aclrtSetOpWaitTimeout: successfully execute aclrtSetOpWaitTimeout, timeout = 1868s
[INFO] APP(2433625,):2025-03-13-18:01:17.743.984 [log_inner.cpp:82]2433625 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:ProcessGroupHCCL:742: "[PTA]:"process group created, group id is 0.""
[INFO] APP(2433625,):2025-03-13-18:01:17.753.782 [log_inner.cpp:82]2433625 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:ProcessGroupHCCL:675: "[PTA]:"Get env HCCL_EXEC_TIMEOUT value 0, and set op wait timeout to 1868.""
[INFO] ASCENDCL(2433625,):2025-03-13-18:01:17.753.789 [event.cpp:248]2433625 aclrtSetOpWaitTimeout: start to execute aclrtSetOpWaitTimeout, timeout = 1868s
[INFO] RUNTIME(2433625,):2025-03-13-18:01:17.753.795 [api_impl.cc:5183] 2433625 SetOpWaitTimeOut: set OP wait timeout:1868.
[INFO] ASCENDCL(2433625,):2025-03-13-18:01:17.753.800 [event.cpp:250]2433625 aclrtSetOpWaitTimeout: successfully execute aclrtSetOpWaitTimeout, timeout = 1868s
[INFO] APP(2433625,):2025-03-13-18:01:17.753.881 [log_inner.cpp:82]2433625 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:ProcessGroupHCCL:742: "[PTA]:"process group created, group id is ray_test_group.""
[INFO] ASCENDCL(2433625,):2025-03-13-18:01:17.789.431 [acl.cpp:257]2433625 aclInit: start to execute aclInit
[INFO] ASCENDCL(2433625,):2025-03-13-18:01:17.789.450 [acl.cpp:266]2433625 aclInit: call ErrorManager.Initialize
[INFO] GE(2433625,):2025-03-13-18:01:17.798.586 [error_manager.cc:636]2433625 ParseJsonFile:Parse json file:/usr/local/Ascend/ascend-toolkit/8.0.0/aarch64-linux/lib64/../conf/error_manager/error_code.json success
[INFO] ASCENDCL(2433625,):2025-03-13-18:01:17.799.265 [acl.cpp:283]2433625 aclInit: set DumpConfig in aclInit
[INFO] ASCENDCL(2433625,):2025-03-13-18:01:17.799.272 [dump.cpp:807]2433625 HandleDumpConfig: start to execute HandleDumpConfig.

[INFO] ASCENDCL(2433619,):2025-03-13-18:01:24.844.203 [memory.cpp:289]2433619 aclrtMemcpy: start to execute aclrtMemcpy, destMaxSize = 12, srcSize = 12, kind = 2
[INFO] APP(2433619,):2025-03-13-18:01:24.846.457 [log_inner.cpp:82]2433619 build/CMakeFiles/torch_npu.dir/compiler_depend.ts:createHCCLCommEx:1215: "[PTA]:"The rank_table_file is not available, switch to original interface.""
[INFO] RUNTIME(2433619,):2025-03-13-18:01:24.846.569 [runtime.cc:5559] 2433619 isNeedChangeDeviceId: userDevId=0, isSetVisibleDev=1 isDeviceSetResetOp=0 curCtx=0x4bb730f0 ContextMode=0.
[INFO] RUNTIME(2433619,):2025-03-13-18:01:24.846.575 [runtime.cc:5601] 2433619 GetUserDevIdByDeviceId: userDevId=0, deviceId=0
[INFO] HCCL(2433619,):2025-03-13-18:01:24.846.591 [adapter_rts.cc:197][2433619][hrtGetDeviceRefresh]deviceLogicId[0]
[INFO] HCCL(2433619,):2025-03-13-18:01:24.846.621 [externalinput.cc:2061][2433619]HCCL_LOG_CONFIG set by default to [0]
[INFO] HCCL(2433619,):2025-03-13-18:01:24.846.661 [externalinput.cc:622][2433619]environmental variable PROFILING_MODE and GE profiling option is not set, default: false
[INFO] HCCL(2433619,):2025-03-13-18:01:24.846.700 [externalinput.cc:728][2433619]HCCL_DEVICE_NIC_DISABLE set by default to [0]
[INFO] HCCL(2433619,):2025-03-13-18:01:24.846.705 [externalinput.cc:758][2433619]HCCL_HOST_RDMA_ENABLE set by default to [0]
[INFO] HCCL(2433619,):2025-03-13-18:01:24.846.721 [externalinput.cc:1672][2433619][CannVersion][Verification]environmental variable LD_LIBRARY_PATH is too long, actual len[1575], maxlen[1024]
[INFO] HCCL(2433619,):2025-03-13-18:01:24.846.726 [externalinput.cc:462][2433619]HCCL_HIGH_PERF_ENABLE set by default to [0]
[WARNING] HCCL(2433619,):2025-03-13-18:01:24.846.861 [dlhal_function.cc:112][2433619]dlHalBindCgroup is nullptr, can not use dlHalBindCgroup
[WARNING] RUNTIME(2433619,):2025-03-13-18:01:24.846.889 [config.cc:834] 2433619 ReadHeterogenousModeFromConfigIni: read ASCEND_LATEST_INSTALL_PATH failed! isHeterogenous=0.
[INFO] RUNTIME(2433619,):2025-03-13-18:01:24.846.895 [api_c.cc:3034] 2433619 rtGetSocVersion: soc version is Ascend910B4
[INFO] HCCL(2433619,):2025-03-13-18:01:24.846.922 [externalinput.cc:1808][2433619]HCCL_OP_BASE_FFTS_MODE_ENABLE set by default to [true]
[INFO] HCCL(2433619,):2025-03-13-18:01:24.846.927 [externalinput.cc:515][2433619]HCCL_DETERMINISTIC set by default to [false]
[INFO] HCCL(2433619,):2025-03-13-18:01:24.846.942 [externalinput.cc:1944][2433619]HCCL_INTER_HCCS_DISABLE is not set, default value is 0.
[INFO] HCCL(2433619,):2025-03-13-18:01:24.846.967 [externalinput.cc:1964][2433619]HCCL_INTER_VNIC_DISABLE is not set, default value is 0.
[WARNING] HCCL(2433619,):2025-03-13-18:01:24.846.991 [dlhal_function.cc:112][2433619]dlHalBindCgroup is nullptr, can not use dlHalBindCgroup
[INFO] HCCL(2433619,):2025-03-13-18:01:24.847.007 [externalinput.cc:2032][2433619]environmental variable ASCEND_MC2_DEBUG_MODE is not set
[INFO] HCCL(2433619,):2025-03-13-18:01:24.847.015 [externalinput.cc:2205][2433619]HCCL_CONCURRENT_ENABLE is not set, default value is [0]
[INFO] HCCL(2433619,):2025-03-13-18:01:24.847.025 [externalinput.cc:889][2433619]HCCL_SEND_CQ_DEPTH set by default to [8192]
[INFO] HCCL(2433619,):2025-03-13-18:01:24.847.030 [externalinput.cc:914][2433619]HCCL_DFT is not set, default value is 1.
[WARNING] HCCL(2433619,):2025-03-13-18:01:24.847.060 [externalinput.cc:672][2433619][Parse][MultiQpSrcPortConfigPath]environmental variable HCCL_RDMA_QP_PORT_CONFIG_PATH is empty,but HCCL_RDMA_QP_PORT_CONFIG_PATH is not set
[INFO] RUNTIME(2433619,):2025-03-13-18:01:24.847.079 [runtime.cc:5559] 2433619 isNeedChangeDeviceId: userDevId=0, isSetVisibleDev=1 isDeviceSetResetOp=0 curCtx=0x4bb730f0 ContextMode=0.
[INFO] RUNTIME(2433619,):2025-03-13-18:01:24.847.083 [runtime.cc:5586] 2433619 ChgUserDevIdToDeviceId: devId=0, realDevId=0
[INFO] RUNTIME(2433619,):2025-03-13-18:01:24.847.086 [api_impl.cc:2575] 2433619 GetDevicePhyIdByIndex: get PhyId by Index=0.
[INFO] RUNTIME(2433619,):2025-03-13-18:01:24.847.090 [api_impl.cc:2586] 2433619 GetDevicePhyIdByIndex: GetDevicePhyIdByIndex, device_mode=0, device_id=0
[INFO] RUNTIME(2433619,):2025-03-13-18:01:24.847.093 [api_impl.cc:2605] 2433619 GetDevicePhyIdByIndex: cloud v2, device_mode=0, device_id=0
[INFO] HCCL(2433619,):2025-03-13-18:01:24.847.107 [topoinfo_detect.cc:123][2433619][Setup][hcclIfBasePort]deviceLogicID_[0], devicePhysicID_[4]
[INFO] HCCL(2433619,):2025-03-13-18:01:24.857.718 [network_manager.cc:65][2433619]NetworkManager::GetInstance deviceLogicID[0].
[INFO] HCCL(2433619,):2025-03-13-18:01:24.857.768 [network_manager.cc:108][2433619]NetworkManager: devicePhyId[4], devicePhyId_[4] deviceLogicId_[0].
[INFO] HCCL(2433619,):2025-03-13-18:01:24.858.151 [adapter_hccp.cc:1428][2433619]set host socket whitelist status[0] success.
[INFO] ROCE(2433619,):2025-03-13-18:01:24.858.421 [dl_hal_function.c:233]dl_hal_init(233) : dl_hal_init success!
[INFO] HCCP(2433619,):2025-03-13-18:01:24.858.426 [ra_peer.c:876]tid:2433619,ra_peer_init(876) : [init][ra_peer]ra_peer_init phy_id[4] start
[INFO] ROCE(2433619,):2025-03-13-18:01:24.858.452 [dl_hal_function.c:218]dl_hal_init(218) : dl_hal_init success, no need to dlopen libascend_hal.so!
[INFO] ROCE(2433619,):2025-03-13-18:01:24.858.895 [dl_hal_function.c:233]dl_hal_init(233) : dl_hal_init success!
[INFO] HCCP(2433619,):2025-03-13-18:01:24.858.913 [rs.c:326]tid:2433619,rs_pthread_mutex_init(326) : mutex init ok
[WARNING] ROCE(2433619,):2025-03-13-18:01:24.859.361 [network_comm.c:149]get_saved_tls_config_file_path(149) : conf_path[/home/ma-user/cert] is invalid, reason[-2].
[WARNING] ROCE(2433619,):2025-03-13-18:01:24.862.772 [network_comm.c:160]get_saved_tls_config_file_path(160) : conf_path[/home/HwDmUser/cert] is invalid, reason[-2].
[WARNING] HCCP(2433619,):2025-03-13-18:01:24.862.783 [rs_ssl.c:1508]tid:2433619,rs_ssl_init(1508) : can not get manage data, ret(-2). maybe not set tls
[INFO] HCCP(2433619,):2025-03-13-18:01:24.862.829 [rs_epoll.c:667]tid:2433619,rs_epoll_connect_handle_init(667) : rs_create_epoll ok
[INFO] HCCP(2433619,):2025-03-13-18:01:24.862.979 [rs_epoll.c:462]tid:2434776,rs_epoll_handle(462) : thread begin! thread_id:281265388868064, pid:2433619, ppid:2421928
[INFO] HCCP(2433619,):2025-03-13-18:01:24.862.991 [rs_epoll.c:688]tid:2433619,rs_epoll_connect_handle_init(688) : RS INIT OK!
[INFO] HCCP(2433619,):2025-03-13-18:01:24.862.997 [rs.c:365]tid:2433619,rs_init_rscb_cfg(365) : rs_epoll_connect_handle_init ok cost [0.205000] ms
[INFO] HCCP(2433619,):2025-03-13-18:01:24.863.029 [rs_epoll.c:596]tid:2434777,rs_connect_handle(596) : thread begin! thread_id:281265380475360, pid:2433619, ppid:2421928
[INFO] HCCP(2433619,):2025-03-13-18:01:24.863.106 [rs_epoll.c:399]tid:2434776,bind_data_cpu(399) : host not need bind cpu, info[1]
[INFO] HCCP(2433619,):2025-03-13-18:01:24.864.003 [ra_peer.c:913]tid:2433619,ra_peer_init(913) : [init][ra_peer]ra_peer_init phy_id[4] succ
[INFO] HCCL(2433619,):2025-03-13-18:01:24.864.015 [adapter_hccp.cc:583][2433619]init ra success.
[INFO] HCCL(2433619,):2025-03-13-18:01:24.864.021 [network_manager.cc:170][2433619]NetworkManager nicDeploy[0] deviceLogicId_[0] devicePhyId_[4] init ra OK, nicSocketMap size[0], isHostUseDevNic[0]
[INFO] HCCP(2433619,):2025-03-13-18:01:24.864.083 [ra_peer.c:959]tid:2433619,ra_peer_get_ifnum(959) : [get][ra_peer_ifnum]ra_peer_get_ifnum phy_id[4] start
[INFO] HCCP(2433619,):2025-03-13-18:01:24.864.107 [ra_peer.c:967]tid:2433619,ra_peer_get_ifnum(967) : [get][ra_peer_ifnum]ra_peer_get_ifnum phy_id[4] succ
[INFO] HCCP(2433619,):2025-03-13-18:01:24.864.137 [ra_peer.c:974]tid:2433619,ra_peer_get_ifaddrs(974) : [get][ra_peer_ifaddrs] ra_peer_get_ifaddrs phy_id[4] start
[INFO] HCCP(2433619,):2025-03-13-18:01:24.864.160 [rs_socket.c:2558]tid:2433619,rs_peer_fill_ifaddr_infos(2558) : ifname[lo] addr[0x0100007f]
[INFO] HCCP(2433619,):2025-03-13-18:01:24.864.164 [rs_socket.c:2558]tid:2433619,rs_peer_fill_ifaddr_infos(2558) : ifname[eth0] addr[0xdb0a10ac]
[INFO] HCCP(2433619,):2025-03-13-18:01:24.864.168 [ra_peer.c:982]tid:2433619,ra_peer_get_ifaddrs(982) : [get][ra_peer_ifaddrs] ra_peer_get_ifaddrs phy_id[4] succ
[INFO] HCCL(2433619,):2025-03-13-18:01:24.864.191 [adapter_hccp.cc:1385][2433619][Get][HostIf]hrtGetIfAddress: idx[0] ifname[lo] ip[127.0.0.1%lo]
[INFO] HCCL(2433619,):2025-03-13-18:01:24.864.197 [adapter_hccp.cc:1385][2433619][Get][HostIf]hrtGetIfAddress: idx[1] ifname[eth0] ip[172.16.10.219%eth0]
[INFO] HCCL(2433619,):2025-03-13-18:01:24.864.239 [topoinfo_detect.cc:180][2433619]rootInfo: ip[172.16.10.219] port[60004] identifier[172.16.10.219%eth0_60004_4_1741860084864228]
[INFO] HCCL(2433619,):2025-03-13-18:01:24.864.248 [network_manager.cc:65][2433619]NetworkManager::GetInstance deviceLogicID[0].
[INFO] HCCL(2433619,):2025-03-13-18:01:24.864.973 [adapter_hccp.cc:809][2433619]socket init v1 success, socketHandle[0x5deabc20]
[INFO] HCCL(2433619,):2025-03-13-18:01:24.864.977 [network_manager.cc:1109][2433619]ip[172.16.10.219%eth0] socket init OK
[INFO] HCCL(2433619,):2025-03-13-18:01:24.864.981 [network_manager.cc:1148][2433619]ip[172.16.10.219%eth0] socket start success socketHandle[0x5deabc20]
[WARNING] HCCL(2433619,):2025-03-13-18:01:24.865.046 [dlhal_function.cc:112][2433619]dlHalBindCgroup is nullptr, can not use dlHalBindCgroup
[INFO] HCCL(2433619,):2025-03-13-18:01:24.865.059 [network_manager.cc:65][2433619]NetworkManager::GetInstance deviceLogicID[0].
[INFO] HCCP(2433619,):2025-03-13-18:01:24.865.100 [rs_socket.c:823]tid:2433619,rs_socket_init_listen(823) : listen [0] IP 0xffffdb0a10ac, ret_vnic -1
[INFO] HCCP(2433619,):2025-03-13-18:01:24.865.106 [rs_socket.c:94]tid:2433619,rs_find_listen_node(94) : listen node for IP(172.16.10.219) not listen!
[INFO] HCCP(2433619,):2025-03-13-18:01:24.865.110 [rs_socket.c:94]tid:2433619,rs_find_listen_node(94) : listen node for IP(172.16.10.219) not listen!
[INFO] HCCP(2433619,):2025-03-13-18:01:24.865.113 [rs_socket.c:145]tid:2433619,rs_listen_node_alloc(145) : create listen node for IP(172.16.10.219)!
[INFO] HCCP(2433619,):2025-03-13-18:01:24.865.134 [rs_socket.c:762]tid:2433619,rs_socket_listen_bind_listen(762) : listen state:1, then bind for IP(172.16.10.219) port 60004
[INFO] HCCP(2433619,):2025-03-13-18:01:24.865.141 [rs_socket.c:771]tid:2433619,rs_socket_listen_bind_listen(771) : socket bind: family 2, port 25834, addr 0xdb0a10ac
[INFO] HCCP(2433619,):2025-03-13-18:01:24.865.153 [rs_socket.c:797]tid:2433619,rs_socket_listen_bind_listen(797) : IP(172.16.10.219) begin listen, fd:232 !
[INFO] HCCL(2433619,):2025-03-13-18:01:24.865.173 [network_manager.cc:640][2433619]HostNet, ip[172.16.10.219%eth0] socket init OK
[INFO] HCCL(2433619,):2025-03-13-18:01:24.865.177 [network_manager.cc:65][2433619]NetworkManager::GetInstance deviceLogicID[0].
[INFO] HCCL(2433619,):2025-03-13-18:01:24.865.186 [hccl_socket.cc:454][2433619][Get][NicHandleInfo]phyId[0], nicSocketMap[1] localIp[[172.16.10.219%eth0]
[INFO] HCCL(2433619,):2025-03-13-18:01:24.865.194 [topoinfo_detect.cc:446][2433619]topo info exchange server start with host ip[172.16.10.219%eth0] and port[60004]
[INFO] HCCL(2433619,):2025-03-13-18:01:24.865.273 [topoinfo_detect.cc:151][2433619]setup topo exchange server complete, identifier[172.16.10.219%eth0_60004_4_1741860084864228]
[INFO] HCCL(2433619,):2025-03-13-18:01:24.865.282 [op_base.cc:82][2433619][HcclGetDeviceId] deviceLogicId[0]
[INFO] HCCL(2433619,):2025-03-13-18:01:24.865.287 [op_base.cc:97][2433619][GetHcclOpInfoCtx] Set device, use g_hcclDeviceId[0]
[INFO] RUNTIME(2433619,):2025-03-13-18:01:24.865.337 [runtime.cc:5559] 2434778 isNeedChangeDeviceId: userDevId=0, isSetVisibleDev=1 isDeviceSetResetOp=1 curCtx=0x0 ContextMode=0.
[INFO] RUNTIME(2433619,):2025-03-13-18:01:24.865.348 [runtime.cc:5586] 2434778 ChgUserDevIdToDeviceId: devId=0, realDevId=0
[INFO] RUNTIME(2433619,):2025-03-13-18:01:24.865.363 [api_impl.cc:2473] 2434778 SetDevice: device_id=0, deviceMode=0.
[INFO] RUNTIME(2433619,):2025-03-13-18:01:24.865.369 [device.cc:306] 2434778 Init: device mode has been already set, mode=0.
[INFO] RUNTIME(2433619,):2025-03-13-18:01:24.865.375 [runtime.cc:3774] 2434778 PrimaryContextRetain: Context inc ref, ts_id=0, count=0x3
[INFO] RUNTIME(2433619,):2025-03-13-18:01:24.865.383 [runtime.cc:3943] 2434778 MsProfNotifyWhenSetDevice: MsprofNotifySetDevice success, devId=0
[INFO] RUNTIME(2433619,):2025-03-13-18:01:24.865.389 [api_impl.cc:2461] 2434778 NewDevice: SetCurRef = 0x4bb730f0,userDeviceId=0,deviceId=0
[INFO] RUNTIME(2433619,):2025-03-13-18:01:24.865.398 [api_impl.cc:2466] 2434778 NewDevice: New device success, user_device_id=0, die_num=1.
[INFO] ATRACE(2433619,):2025-03-13-18:01:24.865.412 awatchdog_core.c:175 create watchdog for id : 65543, tid : 2434778, timeout : 300s
[INFO] RUNTIME(2433619,):2025-03-13-18:01:24.865.418 [api_impl.cc:2499] 2434778 SetDevice: SetDevice success, curCtx=0x4bb730f0, user_device_id=0.
[INFO] HCCL(2433619,):2025-03-13-18:01:24.865.454 [hccl_socket.cc:286][2434778][Accept]localIp[172.16.10.219%eth0], remoteIp[0.0.0.0], socket_handle[0x5deabc20], tag[topo_detect_default_tag_172.16.10.219%eth0_60004_4_1741860084864228_60004]
[INFO] HCCL(2433619,):2025-03-13-18:01:24.865.460 [network_manager.cc:65][2434778]NetworkManager::GetInstance deviceLogicID[0].
[INFO] HCCL(2433619,):2025-03-13-18:01:24.865.470 [hccl_socket.cc:454][2434778][Get][NicHandleInfo]phyId[0], nicSocketMap[1] localIp[[172.16.10.219%eth0]
[INFO] RUNTIME(2433619,):2025-03-13-18:01:24.865.498 [runtime.cc:5559] 2433619 isNeedChangeDeviceId: userDevId=0, isSetVisibleDev=1 isDeviceSetResetOp=0 curCtx=0x4bb730f0 ContextMode=0.
[INFO] RUNTIME(2433619,):2025-03-13-18:01:24.865.504 [runtime.cc:5601] 2433619 GetUserDevIdByDeviceId: userDevId=0, deviceId=0
[INFO] HCCL(2433619,):2025-03-13-18:01:24.865.510 [adapter_rts.cc:197][2433619][hrtGetDeviceRefresh]deviceLogicId[0]
[INFO] HCCL(2433619,):2025-03-13-18:01:24.865.522 [op_base.cc:82][2433619][HcclGetDeviceId] deviceLogicId[0]
[INFO] HCCL(2433619,):2025-03-13-18:01:24.865.545 [rank_consistent.cc:29][2433619][GetInstance] get deviceLogicId[0]
[INFO] HCCL(2433619,):2025-03-13-18:01:24.865.653 [network_manager.cc:65][2433619]NetworkManager::GetInstance deviceLogicID[0].
[INFO] HCCL(2433619,):2025-03-13-18:01:24.865.659 [network_manager.cc:96][2433619]NetworkManager: init nic, nicPosition[0] ref[2], skip.
[INFO] HCCL(2433619,):2025-03-13-18:01:24.865.667 [network_manager.cc:65][2433619]NetworkManager::GetInstance deviceLogicID[0].
[INFO] HCCL(2433619,):2025-03-13-18:01:24.865.672 [network_manager.cc:1148][2433619]ip[172.16.10.219%eth0] socket start success socketHandle[0x5deabc20]
[WARNING] HCCL(2433619,):2025-03-13-18:01:24.865.764 [dlhal_function.cc:112][2433619]dlHalBindCgroup is nullptr, can not use dlHalBindCgroup
[WARNING] HCCL(2433619,):2025-03-13-18:01:24.865.790 [dlhal_function.cc:112][2433619]dlHalBindCgroup is nullptr, can not use dlHalBindCgroup
[INFO] HCCL(2433619,):2025-03-13-18:01:24.865.799 [adapter_hal.cc:289][2433619]Entry-hrtDrvGetPlatformInfo
[WARNING] HCCL(2433619,):2025-03-13-18:01:24.865.823 [dlhal_function.cc:112][2433619]dlHalBindCgroup is nullptr, can not use dlHalBindCgroup
[INFO] HCCL(2433619,):2025-03-13-18:01:24.865.834 [sal.cc:593][2433619]IsHostUseDevNic[1]
[INFO] HCCL(2433619,):2025-03-13-18:01:24.865.837 [network_manager.cc:65][2433619]NetworkManager::GetInstance deviceLogicID[0].
[INFO] HCCL(2433619,):2025-03-13-18:01:24.865.842 [network_manager.cc:108][2433619]NetworkManager: devicePhyId[4294967295], devicePhyId_[4] deviceLogicId_[0].
[INFO] TDT(2433619,):2025-03-13-18:01:24.865.919 [client_manager.cpp:639][ChangeUserDeviceIdToLogicDeviceId][tid:2433619] [TsdClient] change userDevId [0] to logicDevId [0]
[INFO] TDT(2433619,):2025-03-13-18:01:24.865.926 [client_manager.cpp:639][ChangeUserDeviceIdToLogicDeviceId][tid:2433619] [TsdClient] change userDevId [0] to logicDevId [0]
[INFO] TDT(2433619,):2025-03-13-18:01:24.865.935 [process_mode_manager.cpp:903][CheckNeedToOpen][tid:2433619] [TsdClient] get open para startCp[1] startHccp[1] (0:false 1:true)
[INFO] TDT(2433619,):2025-03-13-18:01:24.866.199 [client_manager.cpp:260][CheckPackageExists][tid:2433619] [TsdClient] check file[..], deviceId[0]
[INFO] TDT(2433619,):2025-03-13-18:01:24.866.289 [client_manager.cpp:260][CheckPackageExists][tid:2433619] [TsdClient] check file[scene.info], deviceId[0]
[INFO] TDT(2433619,):2025-03-13-18:01:24.866.335 [client_manager.cpp:260][CheckPackageExists][tid:2433619] [TsdClient] check file[ascend_install.info], deviceId[0]
[INFO] TDT(2433619,):2025-03-13-18:01:24.866.377 [client_manager.cpp:260][CheckPackageExists][tid:2433619] [TsdClient] check file[version.info], deviceId[0]
[INFO] TDT(2433619,):2025-03-13-18:01:24.866.416 [client_manager.cpp:260][CheckPackageExists][tid:2433619] [TsdClient] check file[.], deviceId[0]
[INFO] TDT(2433619,):2025-03-13-18:01:24.866.453 [client_manager.cpp:260][CheckPackageExists][tid:2433619] [TsdClient] check file[Ascend-aicpu_syskernels.tar.gz], deviceId[0]
[INFO] TDT(2433619,):2025-03-13-18:01:24.866.545 [client_manager.cpp:263][CheckPackageExists][tid:2433619] [TsdClient] find aicpu package[Ascend-aicpu_syskernels.tar.gz] in path[/usr/local/Ascend/ascend-toolkit/latest/opp/Ascend/aicpu/], deviceId[0]
[INFO] TDT(2433619,):2025-03-13-18:01:24.866.551 [client_manager.cpp:260][CheckPackageExists][tid:2433619] [TsdClient] check file[Ascend-aicpu_extend_syskernels.tar.gz], deviceId[0]
[INFO] TDT(2433619,):2025-03-13-18:01:24.866.655 [client_manager.cpp:267][CheckPackageExists][tid:2433619] [TsdClient] find extend package[Ascend-aicpu_extend_syskernels.tar.gz] in path[/usr/local/Ascend/ascend-toolkit/latest/opp/Ascend/aicpu/], deviceId[0]
[INFO] TDT(2433619,):2025-03-13-18:01:24.866.672 [client_manager.cpp:282][CheckPackageExists][tid:2433619] [TsdClient][deviceId=0] set extend package path[/usr/local/Ascend/ascend-toolkit/latest/opp/Ascend/aicpu/], name[Ascend-aicpu_extend_syskernels.tar.gz]
[INFO] TDT(2433619,):2025-03-13-18:01:24.866.676 [client_manager.cpp:285][CheckPackageExists][tid:2433619] [TsdClient][deviceId=0] set aicpu package path[/usr/local/Ascend/ascend-toolkit/latest/opp/Ascend/aicpu/], name[Ascend-aicpu_syskernels.tar.gz]
[INFO] TDT(2433619,):2025-03-13-18:01:24.866.735 [process_mode_manager.cpp:528][InitTsdClient][tid:2433619] [TsdClient] tsd client has been inited before
[INFO] TDT(2433619,):2025-03-13-18:01:24.866.821 [hdc_client.cpp:154][TsdRecvData][tid:2433619] recv timeout is:150000 ms
[INFO] RUNTIME(2433619,):2025-03-13-18:01:25.006.731 [runtime.cc:5559] 2434300 isNeedChangeDeviceId: userDevId=0, isSetVisibleDev=1 isDeviceSetResetOp=1 curCtx=0x0 ContextMode=0.
[INFO] RUNTIME(2433619,):2025-03-13-18:01:25.006.755 [runtime.cc:5601] 2434300 GetUserDevIdByDeviceId: userDevId=0, deviceId=0
[INFO] TDT(2433619,):2025-03-13-18:01:25.006.763 [client_manager.cpp:639][ChangeUserDeviceIdToLogicDeviceId][tid:2434300] [TsdClient] change userDevId [0] to logicDevId [0]
[INFO] TDT(2433619,):2025-03-13-18:01:25.006.771 [client_manager.cpp:639][ChangeUserDeviceIdToLogicDeviceId][tid:2434300] [TsdClient] change userDevId [0] to logicDevId [0]
[INFO] TDT(2433619,):2025-03-13-18:01:25.006.775 [client_manager.cpp:639][ChangeUserDeviceIdToLogicDeviceId][tid:2434300] [TsdClient] change userDevId [0] to logicDevId [0]
[INFO] RUNTIME(2433619,):2025-03-13-18:01:27.103.489 [runtime.cc:3469] 2434300 GetHdcConctStatus: tsdGetHdcConctStatus success, userDeviceId=0, devId=0, tdt status=0
[INFO] TDT(2433619,):2025-03-13-18:01:27.103.487 [process_mode_manager.cpp:726][ServerToClientMsgProc][tid:2433619] [ServerToClientMsgProc] [sessionID=1]
[INFO] TDT(2433619,):2025-03-13-18:01:27.103.553 [process_mode_manager.cpp:804][DeviceMsgProcess][tid:2433619] [TsdClient] DeviceMsgProc recvMsg realDeviceId[0] msgType[8] localDevId[0] rspCode[0]heterogeneousSubPid[0], tsdSupportLevel_[0]
[INFO] TDT(2433619,):2025-03-13-18:01:27.103.559 [hdc_client.cpp:172][TsdRecvData][tid:2433619] [deviceId=0] the Tsd recv data already
[INFO] DRV(2433619,):2025-03-13-18:01:27.103.570 [ascend][curpid: 2433619, 2433619][drv][devmm][devmm_setup_device 77]DrvMemDeviceOpen. (devid=0)
[INFO] DRV(2433619,):2025-03-13-18:01:27.103.606 [ascend][curpid: 2433619, 2433619][drv][devmm][devmm_set_device_info 67]Device info. (devid=0; dvpp_size=17179869184; support_bar_mem=1; support_dev_read_only=1; support_dev_mem_map_host=1)
[INFO] HCCL(2433619,):2025-03-13-18:01:27.103.693 [adapter_tdt.cc:23][2433619]Open TsdClient success. deviceLogicId[0], rankSize[2]
[INFO] HCCL(2433619,):2025-03-13-18:01:27.103.700 [network_manager.cc:133][2433619][Init]NetworkManager open tsd success, devicePhyId[4], deviceLogicId_[0]
[INFO] ROCE(2433619,):2025-03-13-18:01:27.103.851 [dl_hal_function.c:218]dl_hal_init(218) : dl_hal_init success, no need to dlopen libascend_hal.so!
[INFO] ROCE(2433619,):2025-03-13-18:01:27.103.862 [dl_hal_function.c:218]dl_hal_init(218) : dl_hal_init success, no need to dlopen libascend_hal.so!
[INFO] DRV(2433619,):2025-03-13-18:01:27.104.054 [ascend][curpid: 2433619, 2433619][drv][hdc][drvHdcSetSessionReference 1809]Get pid number. (session=6214; pid=2433619)
[INFO] HCCL(2433619,):2025-03-13-18:01:27.107.970 [adapter_hccp.cc:583][2433619]init ra success.
[INFO] HCCL(2433619,):2025-03-13-18:01:27.107.978 [network_manager.cc:170][2433619]NetworkManager nicDeploy[1] deviceLogicId_[0] devicePhyId_[4] init ra OK, nicSocketMap size[1], isHostUseDevNic[1]
[INFO] HCCL(2433619,):2025-03-13-18:01:27.108.000 [topoinfo_detect.cc:490][2433619]NetworkManager start host net success! ip[172.16.10.219%eth0]
[INFO] HCCL(2433619,):2025-03-13-18:01:27.108.015 [adapter_hccp.cc:1637][2433619]hrtRaGetInterfaceVersion phyId[4], opCode[33], version[0]
[WARNING] HCCL(2433619,):2025-03-13-18:01:27.108.019 [adapter_hccp.cc:1485][2433619]this package does not support hrtGetIfNum for device, please change new package
[ERROR] HCCL(2433619,):2025-03-13-18:01:27.108.447 [hccl_ip_address.cc:76][2433619]if name is null.
[ERROR] HCCL(2433619,):2025-03-13-18:01:27.108.452 [adapter_hccp.cc:1551][2433619]call trace: hcclRet -> 1
[ERROR] HCCL(2433619,):2025-03-13-18:01:27.108.458 [topoinfo_detect.cc:561][2433619]call trace: hcclRet -> 1
[ERROR] HCCL(2433619,):2025-03-13-18:01:27.108.463 [topoinfo_detect.cc:259][2433619][Setup][Agent]topo detect generate local rank info failed! rank[0]
[ERROR] HCCL(2433619,):2025-03-13-18:01:27.108.469 [op_base.cc:841][2433619][Init][CommRootInfo]errNo[0x0000000005000001] setup topo detect error
[INFO] HCCL(2433619,):2025-03-13-18:01:27.108.477 [topoinfo_detect.cc:195][2433619]TopoInfoDetect TeardownServer ok, identifier[p��]].
[INFO] HCCP(2433619,):2025-03-13-18:01:27.108.490 [ra_peer.c:959]tid:2433619,ra_peer_get_ifnum(959) : [get][ra_peer_ifnum]ra_peer_get_ifnum phy_id[0] start
[ERROR] HCCP(2433619,):2025-03-13-18:01:27.108.495 [rs_socket.c:2744]tid:2433619,rs_peer_get_ifnum(2744) : param error, g_rs_cb is NULL
[ERROR] HCCP(2433619,):2025-03-13-18:01:27.108.499 [ra_peer.c:964]tid:2433619,ra_peer_get_ifnum(964) : [get][ra_peer_ifnum]rs_peer_get_ifnum failed(-22)
[INFO] HCCP(2433619,):2025-03-13-18:01:27.108.502 [ra_peer.c:967]tid:2433619,ra_peer_get_ifnum(967) : [get][ra_peer_ifnum]ra_peer_get_ifnum phy_id[0] succ
[ERROR] HCCP(2433619,):2025-03-13-18:01:27.108.505 [ra_host.c:1930]tid:2433619,ra_get_ifnum(1930) : [get][ra_ifnum]ra_peer_get_ifnum failed, ret(-22)
[INFO] HCCP(2433619,):2025-03-13-18:01:27.108.508 [ra_host.c:2016]tid:2433619,conver_return_code(2016) : conver_return_code: orig_errcode[-22] curr_errcode[128303]
[ERROR] HCCL(2433619,):2025-03-13-18:01:27.108.513 [adapter_hccp.cc:1492][2433619][Get][IfNum]errNo[0x000000000500000b] ra get if num fail. ret[128303], num[0]
[ERROR] HCCL(2433619,):2025-03-13-18:01:27.108.517 [adapter_hccp.cc:1355][2433619]call trace: hcclRet -> 11
[ERROR] HCCL(2433619,):2025-03-13-18:01:27.108.521 [sal.cc:346][2433619]call trace: hcclRet -> 11
[WARNING] HCCL(2433619,):2025-03-13-18:01:27.108.525 [sal.cc:507][2433619][Get][ServerId]GetLocalHostIP Failed, Use INVALID value
[ERROR] HCCL(2433619,):2025-03-13-18:01:27.108.530 [op_base.cc:922][2433619][InitCommRootInfo]Init failed, return[0x0000000005000001], rankNum[2], rank[0], rootInfo identifier[172.16.10.219%eth0_60004_4_1741860084864228], server[0.0.0.0], logicDevId[-1]
[INFO] RUNTIME(2433619,):2025-03-13-18:01:27.108.547 [runtime.cc:5559] 2433619 isNeedChangeDeviceId: userDevId=0, isSetVisibleDev=1 isDeviceSetResetOp=0 curCtx=0x4bb730f0 ContextMode=0.
[INFO] RUNTIME(2433619,):2025-03-13-18:01:27.108.552 [runtime.cc:5601] 2433619 GetUserDevIdByDeviceId: userDevId=0, deviceId=0
[INFO] HCCL(2433619,):2025-03-13-18:01:27.108.557 [adapter_rts.cc:197][2433619][hrtGetDeviceRefresh]deviceLogicId[0]
[ERROR] HCCL(2433619,):2025-03-13-18:01:27.108.568 [hccl_comm.cc:992][2433619]errNo[0x0000000005000002] ptr [communicator_] is nullptr, return HCCL_E_PTR
[INFO] HCCL(2433619,):2025-03-13-18:01:27.108.574 [op_base.cc:82][2433619][HcclGetDeviceId] deviceLogicId[0]
[INFO] HCCL(2433619,):2025-03-13-18:01:27.108.589 [hcom_plugin.cc:381][2433619][Unset][GroupTopoInfo]group[172.16.10.219%eth0_60004_4_1741860084864228].
[ERROR] HCCL(2433619,):2025-03-13-18:01:27.108.611 [op_base.cc:1816][2433619][HcclCommDestroy] comm is not exist, comm=0x5dead820, group=172.16.10.219%eth0_60004_4_1741860084864228, deviceLogicId=0
[INFO] HCCL(2433619,):2025-03-13-18:01:27.108.648 [hccl_comm.cc:1203][2433619]UnRegistTaskAbortHandler begin
[INFO] HCCL(2433619,):2025-03-13-18:01:27.108.660 [task_abort_handler.cc:140][2433619]TaskAbortHandler::DeInit commVector size is [0], ref_ count is [0]
[ERROR] HCCL(2433619,):2025-03-13-18:01:27.108.666 [op_base.cc:974][2433619][Init][CommRootInfoConfig]errNo[0x0000000005000001]HcclCommInitRootInfo failed.
[INFO] ASCENDCL(2433619,):2025-03-13-18:01:27.108.692 [context.cpp:194]2433619 aclrtPeekAtLastError: start to execute aclrtPeekAtLastError, level is 0
[INFO] ASCENDCL(2433619,):2025-03-13-18:01:27.108.702 [context.cpp:194]2433619 aclrtPeekAtLastError: start to execute aclrtPeekAtLastError, level is 0

[INFO] HCCL(2433619,):2025-03-13-18:01:24.140.767 [plugin_manager.cc:25][2433619]hcom ops plugin init start.
[INFO] HCCL(2433619,):2025-03-13-18:01:24.140.781 [plugin_manager.cc:38][2433619]input BUILD_MODE is not set, result autoTune is false.
[INFO] HCCL(2433619,):2025-03-13-18:01:24.140.819 [hcom_plugin.cc:50][2433619]Initialize start.
[INFO] HCCL(2433619,):2025-03-13-18:01:24.140.823 [hcom_plugin.cc:52][2433619]initializeCount_ : 1
[INFO] HCCL(2433619,):2025-03-13-18:01:24.140.870 [hcom_plugin.cc:60][2433619]hvdOpsKernelInfoStoreInfoPtr_ is initialized
[INFO] HCCL(2433619,):2025-03-13-18:01:24.140.891 [hcom_plugin.cc:63][2433619]hvdGraphOptimizerPtr_ is initialized
[INFO] HCCL(2433619,):2025-03-13-18:01:24.140.896 [hcom_plugin.cc:329][2433619]ParserHcclAlgoDesc: there is no key named "HCCL_algorithm" in the options.
[INFO] HCCL(2433619,):2025-03-13-18:01:24.140.903 [hcom_plugin.cc:281][2433619]ParserHcclExecTimeOut: there is no key named "HCCL_execTimeOut" in the options.
[WARNING] HCCL(2433619,):2025-03-13-18:01:24.140.918 [hcom_plugin.cc:217][2433619]option profiling mode is not found.
[INFO] HCCL(2433619,):2025-03-13-18:01:24.140.925 [hcom_plugin.cc:164][2433619][masterInfo]get Master Info:ip or port or deviceId not set.
[INFO] HCCL(2433619,):2025-03-13-18:01:24.140.930 [hcom_plugin.cc:180][2433619][masterInfo]get Rank Info: worker size not set.
[INFO] HCCL(2433619,):2025-03-13-18:01:24.140.934 [hcom_plugin.cc:244][2433619]InitializePlugin: Init without ranktable and masterInfo, please check your mode.
[INFO] HCCL(2433619,):2025-03-13-18:01:24.141.088 [hcom_plugin.cc:82][2433619]hccl ops plugin init success.
[INFO] HCCL(2433619,):2025-03-13-18:01:24.141.095 [plugin_manager.cc:48][2433619]hcom ops plugin init end. ret[0]
[INFO] GE(2433619,):2025-03-13-18:01:24.141.102 [dnnengine_manager.cc:41]2433619 ~InvokeFuncPerfRecorder:[GEPERFTRACE] The time cost of InvokeAll [Initialize] in [libhcom_graph_adaptor.so] is [341] micro second.
[INFO] AICPU(2433619,):2025-03-13-18:01:24.141.341 [util.cc:84][GetSoPath][tid:2433619]:"So file path=/usr/local/Ascend/ascend-toolkit/8.0.0/aarch64-linux/lib64/plugin/opskernel/libhost_cpu_engine.so"
[INFO] AICPU(2433619,):2025-03-13-18:01:24.141.437 [util.cc:100][GetSoPath][tid:2433619]:"Real config File path is /usr/local/Ascend/ascend-toolkit/8.0.0/aarch64-linux/lib64/plugin/opskernel/"
[INFO] AICPU(2433619,):2025-03-13-18:01:24.141.443 [base_engine.cc:30][LoadConfigFile][tid:2433619]:"Configuration file path is /usr/local/Ascend/ascend-toolkit/8.0.0/aarch64-linux/lib64/plugin/opskernel/config/init.conf."
[INFO] GE(2433619,):2025-03-13-18:01:24.141.457 [dnnengine_manager.cc:41]2433619 ~InvokeFuncPerfRecorder:[GEPERFTRACE] The time cost of InvokeAll [Initialize] in [libhost_cpu_engine.so] is [346] micro second.
[INFO] GE(2433619,):2025-03-13-18:01:24.141.541 [dnnengine_manager.cc:41]2433619 ~InvokeFuncPerfRecorder:[GEPERFTRACE] The time cost of InvokeAll [Initialize] in [librts_engine.so] is [75] micro second.
[INFO] GE(2433619,):2025-03-13-18:01:24.141.605 [dnnengine_manager.cc:41]2433619 ~InvokeFuncPerfRecorder:[GEPERFTRACE] The time cost of InvokeAll [GetOpsKernelInfoStores] in [libaicpu_ascend_engine.so] is [48] micro second.
[INFO] GE(2433619,):2025-03-13-18:01:24.141.638 [dnnengine_manager.cc:41]2433619 ~InvokeFuncPerfRecorder:[GEPERFTRACE] The time cost of InvokeAll [GetOpsKernelInfoStores] in [libaicpu_tf_engine.so] is [24] micro second.
[INFO] GE(2433619,):2025-03-13-18:01:24.141.669 [dnnengine_manager.cc:41]2433619 ~InvokeFuncPerfRecorder:[GEPERFTRACE] The time cost of InvokeAll [GetOpsKernelInfoStores] in [libdvpp_engine.so] is [23] micro second.
[INFO] GE(2433619,):2025-03-13-18:01:24.141.714 [dnnengine_manager.cc:41]2433619 ~InvokeFuncPerfRecorder:[GEPERFTRACE] The time cost of InvokeAll [GetOpsKernelInfoStores] in [libfe.so] is [36] micro second.
[INFO] GE(2433619,):2025-03-13-18:01:24.141.724 [dnnengine_manager.cc:41]2433619 ~InvokeFuncPerfRecorder:[GEPERFTRACE] The time cost of InvokeAll [GetOpsKernelInfoStores] in [libffts.so] is [4] micro second.
[INFO] GE(2433619,):2025-03-13-18:01:24.141.743 [dnnengine_manager.cc:41]2433619 ~InvokeFuncPerfRecorder:[GEPERFTRACE] The time cost of InvokeAll [GetOpsKernelInfoStores] in [libge_local_engine.so] is [12] micro second.
[INFO] HCCL(2433619,):2025-03-13-18:01:24.141.754 [hcom_plugin.cc:126][2433619]get hccl kernel info store start.
[INFO] HCCL(2433619,):2025-03-13-18:01:24.141.776 [hcom_plugin.cc:132][2433619]get hccl kernel info store finished.
[INFO] GE(2433619,):2025-03-13-18:01:24.141.784 [dnnengine_manager.cc:41]2433619 ~InvokeFuncPerfRecorder:[GEPERFTRACE] The time cost of Invo

该问题一般是cann包和hdk包不兼容导致的,请升级hdk到和cann兼容的版本
https://www.hiascend.com/document/detail/zh/canncommercial/800/quickstart/quickstart/releasenote_0000.html

huangyunlong changed issue state from TODO to WIP 3 months ago

我有HDK 23.0.X和CANN 8.0.0,似乎两个版本是兼容的。你知道还有什么可能导致这个错误吗?

ASCEND TOOLKIT
version: 1.0
runtime_running_version=[7.6.0.1.220:8.0.0]
compiler_running_version=[7.6.0.1.220:8.0.0]
hccl_running_version=[7.6.0.1.220:8.0.0]
opp_running_version=[7.6.0.1.220:8.0.0]
toolkit_running_version=[7.6.0.1.220:8.0.0]
aoe_running_version=[7.6.0.1.220:8.0.0]
ncs_running_version=[7.6.0.1.220:8.0.0]
opp_kernel_running_version=[7.6.0.1.220:8.0.0]
hccl_upgrade_version=[7.6.0.1.220:8.0.0]
ncs_upgrade_version=[7.6.0.1.220:8.0.0]
runtime_upgrade_version=[7.6.0.1.220:8.0.0]
compiler_upgrade_version=[7.6.0.1.220:8.0.0]
opp_upgrade_version=[7.6.0.1.220:8.0.0]
toolkit_upgrade_version=[7.6.0.1.220:8.0.0]
aoe_upgrade_version=[7.6.0.1.220:8.0.0]
opp_kernel_upgrade_version=[7.6.0.1.220:8.0.0]

Firmware
Version=23.0.rc2.2
ascendhal_version=7.25.6
aicpu_version=1.0
tdt_version=1.0
log_version=1.0
prof_version=2.0
dvppkernels_version=1.1
tsfw_version=1.0
Innerversion=V100R001C30SPC002B220
compatible_version=[V100R001C30]
compatible_version_fw=[6.4.0,6.4.99]
package_version=23.0.rc2.2

这里看到版本是23.0.rc2.2,不属于兼容性中的23.0.x,具体的请咨询兼容性提供方
这里建议升级到兼容版本

我在文档看到这个测试。https://www.hiascend.com/document/detail/zh/canncommercial/800/developmentguide/hccl/hcclug/hcclug_000051.html
这个结果是什么意思?

for i in {0..7}; do hccn_tool -i $i -process -g ; done
Device ID is invalid, device [0] is not in the device list!
Command execute failed!
Device ID is invalid, device [1] is not in the device list!
Command execute failed!
Device ID is invalid, device [2] is not in the device list!
Command execute failed!
Device ID is invalid, device [3] is not in the device list!
Command execute failed!
device hccp process: not exist
device hccp process: not exist
device hccp process: not exist
device hccp process: not exist

看了下面的Npu-smi信息 这边0~3卡是否存在掉卡的的情况? 4~7卡是正常结果 卡上现在没有HCCP的进程 查询结果是正常的

huangyunlong changed issue state from WIP to DONE 3 months ago

Sign in to comment

Status
Assignees
Projects
Milestones
Pull Requests
Successfully merging a pull request will close this issue.
Branches
Priority
Duration (hours)
Planed to start   -   Planed to end
-
Top level
参与者(3)
Ethan Kallett-ethan-kallett huangyunlong-huangyunlong2022 adaixxl-adaixxl
Python
1
https://gitee.com/ascend/pytorch.git
git@gitee.com:ascend/pytorch.git
ascend
pytorch
pytorch

Search