登录
注册
开源
企业版
高校版
搜索
帮助中心
使用条款
关于我们
开源
企业版
高校版
私有云
模力方舟
AI 队友
登录
注册
轻量养虾,开箱即用!低 Token + 稳定算力,Gitee & 模力方舟联合出品的 PocketClaw 正式开售!点击了解详情~
代码拉取完成,页面将自动刷新
仓库状态说明
开源项目
>
工业软件
>
芯片开发
&&
捐赠
捐赠前请先登录
取消
前往登录
扫描微信二维码支付
取消
支付完成
支付提示
将跳转至支付宝完成支付
确定
取消
Watch
不关注
关注所有动态
仅关注版本发行动态
关注但不提醒动态
58
Star
187
Fork
149
Ascend
/
cann-hccl
暂停
代码
Issues
10
Pull Requests
34
Wiki
统计
流水线
服务
JavaDoc
PHPDoc
质量分析
Jenkins for Gitee
腾讯云托管
腾讯云 Serverless
悬镜安全
阿里云 SAE
Codeblitz
SBOM
开发画像分析
我知道了,不再自动展开
更新失败,请稍后重试!
移除标识
内容风险标识
本任务被
标识为内容中包含有代码安全 Bug 、隐私泄露等敏感信息,仓库外成员不可访问
[Question|问题咨询]: 在vnpu环境中mindie启动时报错HcclGetRootInfo fail, error:19, rank:0
DONE
#ICGSLC
phona
创建于
2025-06-21 11:10
### 问题描述 mindie版本信息 Ascend-mindie : 2.0.RC1 Ascend-mindie-service Version : 2.0.RC1 Platform : aarch64 mindie镜像里没找到hccl编译的版本信息,通过官方文档猜测可能版本是这个v1.1-8.1.RC1.alpha002 我根据官方文档[Ascend Docker使用vNPU](https://www.hiascend.com/document/detail/zh/computepoweralloca/300/cpaug/cpaug/cpaug_00013.html),使用ASCEND_VISIBLE_DEVICES环境变量指定设备后,通过ASCEND_VNPU_SPECS环境变量指定模板进行算力切分。  vnpu划分成功,但是mindie启动后报错 ``` 2025-06-21 10:16:17.123 8895 LLM log default format: [yyyy-mm-dd hh:mm:ss.uuuuuu] [processid] [threadid] [llm] [loglevel] [file:line] [status code] msg [2025-06-21 10:16:17.123] [8688] [281461549560160] [llm] [INFO] [post_processing_manager.cpp:161] Get post processing manager [2025-06-21 10:16:17.123] [8688] [281461549560160] [llm] [INFO] [thread_pool.cpp:51] Create pthread id: 281460375875936 [2025-06-21 10:16:17.123] [8688] [281461549560160] [llm] [INFO] [thread_pool.cpp:51] Create pthread id: 281450137973088 [2025-06-21 10:16:17.123] [8688] [281461549560160] [llm] [INFO] [thread_pool.cpp:51] Create pthread id: 281450129518944 [2025-06-21 10:16:17.123] [8688] [281461549560160] [llm] [INFO] [thread_pool.cpp:51] Create pthread id: 281450121064800 [2025-06-21 10:16:17.123] [8688] [281461549560160] [llm] [INFO] [thread_pool.cpp:51] Create pthread id: 281450112610656 [2025-06-21 10:16:17.123] [8688] [281461549560160] [llm] [INFO] [thread_pool.cpp:51] Create pthread id: 281450104156512 [2025-06-21 10:16:17.123] [8688] [281461549560160] [llm] [INFO] [thread_pool.cpp:51] Create pthread id: 281450095702368 [2025-06-21 10:16:17.123] [8688] [281461549560160] [llm] [INFO] [thread_pool.cpp:51] Create pthread id: 281450012209504 [2025-06-21 10:16:17,126] [8688] [281461549560160] [llm] [INFO] [generator.py-546] : warmup params: max_prefill_tokens=10000, max_seq_len=9900, max_input_len=8400, max_iter_times=512 [2025-06-21 10:16:17,130] [8688] [281461549560160] [llm] [INFO] [generator.py-422] : `Prefill blocks` during warmup needs npu memory(GB): 0.0068359375 [2025-06-21 10:16:17.545038] [info] [8895] [share_memory.cpp:32] ATB_SHARE_MEMORY_NAME_SUFFIX is validate, value: [2025-06-21 10:16:17.545414] [info] [8895] [share_memory.cpp:40] create share memory begin, fullName:hcclShareMem [2025-06-21 10:16:17.545561] [info] [8895] [share_memory.cpp:84] key: 1859736976 shmid: 196608 [2025-06-21 10:16:17.545620] [info] [8895] [share_memory.cpp:44] create share memory success [2025-06-21 10:16:17.545640] [info] [8895] [comm.cpp:125] create share memory success, rank:0 [2025-06-21 10:16:17.545660] [info] [8895] [comm.cpp:127] rankRoot:0 [2025-06-21 10:16:17.549221] [error] [8895] [comm.cpp:130] HcclGetRootInfo fail, error:19, rank:0 [2025-06-21 10:16:17.021+0800] [8688] [281470681739616] [batchscheduler] [ERROR] [model.py:59] : [Model] >>> Exception:External Comm Manager: Create the hccl communication group failed. export ASDOPS_LOG_LEVEL=ERROR, export ASDOPS_LOG_TO_STDOUT=1 to see more details. Default log path is $HOME/atb/log. Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/model_wrapper/model.py", line 57, in initialize return self.python_model.initialize(config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/model_wrapper/standard_model.py", line 133, in initialize self.generator = Generator( ^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/mindie_llm/text_generator/generator.py", line 233, in __init__ self.cache_manager = self.warm_up( ^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/mindie_llm/text_generator/generator.py", line 436, in warm_up raise e File "/usr/local/lib/python3.11/site-packages/mindie_llm/text_generator/generator.py", line 429, in warm_up self._generate_inputs_warm_up_backend(cache_manager, input_metadata, inference_mode, dummy=True) File "/usr/local/lib/python3.11/site-packages/mindie_llm/text_generator/generator.py", line 532, in _generate_inputs_warm_up_backend self.generator_backend._warm_up(model_inputs, inference_mode=inference_mode) File "/usr/local/lib/python3.11/site-packages/mindie_llm/text_generator/adapter/generator_torch.py", line 519, in _warm_up super()._warm_up(model_inputs) File "/usr/local/lib/python3.11/site-packages/mindie_llm/text_generator/adapter/generator_backend.py", line 219, in _warm_up _ = self.forward(model_inputs, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/mindie_llm/utils/decorators/time_decorator.py", line 69, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/mindie_llm/text_generator/adapter/generator_torch.py", line 209, in forward logits = self._forward(model_inputs, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/mindie_llm/text_generator/adapter/generator_torch.py", line 539, in _forward logits = self.model_wrapper.forward(model_inputs, self.cache_pool.npu_cache, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/mindie_llm/modeling/model_wrapper/atb/atb_model_wrapper.py", line 164, in forward result = self.forward_tensor( ^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/mindie_llm/modeling/model_wrapper/atb/atb_model_wrapper.py", line 204, in forward_tensor result = self.model_runner.forward( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/Ascend/atb-models/atb_llm/runner/model_runner.py", line 297, in forward res = self.model.forward(**kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/Ascend/atb-models/atb_llm/models/base/flash_causal_lm.py", line 493, in forward self.init_ascend_weight() File "/usr/local/Ascend/atb-models/atb_llm/models/qwen2/flash_causal_qwen2.py", line 282, in init_ascend_weight self.acl_encoder_operation.set_param(json.dumps({**encoder_param})) RuntimeError: External Comm Manager: Create the hccl communication group failed. export ASDOPS_LOG_LEVEL=ERROR, export ASDOPS_LOG_TO_STDOUT=1 to see more details. Default log path is $HOME/atb/log. [2025-06-21 10:16:17.021+0800] [8688] [281470681739616] [batchscheduler] [ERROR] [model.py:62] : [MIE04E13030A] [Model] >>> return initialize error result: {'status': 'error', 'npuBlockNum': '0', 'cpuBlockNum': '0', 'memPoolId': '-1'} [2025-06-21 10:16:17.624+08:00] [8484] [8486] [server] [WARN] [llm_daemon.cpp:74] : [MIE04W01011A] [daemon] Received exit signal[17] [2025-06-21 10:16:17.625+08:00] [8484] [8486] [server] [ERROR] [llm_daemon.cpp:90] : [MIE04E010109] [daemon] Process 8688 was terminated by signal 9 (Killed) [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! /usr/lib64/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 30 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' ``` hccl打印的日志 ``` /root/ascend/log/run/plog/plog-5994_20250621095856036.log:[INFO] HCCL(5994,mindie_llm_backend_connector):2025-06-21-09:59:00.337.886 [adapter_rts.cc:2720] [6002][adapter_rts.cc][CallBackInitRts] g_deviceType [6] g_deviceLogicId [-1] g_devicePhyId [-1] /root/ascend/log/run/plog/plog-5994_20250621095856036.log:[INFO] GE(5994,mindie_llm_backend_connector):2025-06-21-09:59:04.625.045 [dnnengine_manager.cc:128]6201 Initialize:[GEPERFTRACE] The time cost of DNNEngineManager::Initialize[DNN_HCCL] is [0] micro seconds. /root/ascend/log/run/plog/plog-5994_20250621095856036.log:[INFO] HCCL(5994,mindie_llm_backend_connector):2025-06-21-09:59:09.007.330 [plugin_manager.cc:42] [6201]hcom running normal mode. /root/ascend/log/run/plog/plog-8484_20250621101522944.log:[INFO] HCCL(8484,mindieservice_daemon):2025-06-21-10:15:23.011.983 [adapter_rts.cc:2720] [8486][adapter_rts.cc][CallBackInitRts] g_deviceType [6] g_deviceLogicId [-1] g_devicePhyId [-1] /root/ascend/log/run/plog/plog-5790_20250621095854861.log:[INFO] HCCL(5790,mindieservice_daemon):2025-06-21-09:58:54.924.989 [adapter_rts.cc:2720] [5792][adapter_rts.cc][CallBackInitRts] g_deviceType [6] g_deviceLogicId [-1] g_devicePhyId [-1] /root/ascend/log/run/plog/plog-8688_20250621101524105.log:[INFO] HCCL(8688,mindie_llm_backend_connector):2025-06-21-10:15:28.355.674 [adapter_rts.cc:2720] [8696][adapter_rts.cc][CallBackInitRts] g_deviceType [6] g_deviceLogicId [-1] g_devicePhyId [-1] /root/ascend/log/run/plog/plog-8688_20250621101524105.log:[INFO] GE(8688,mindie_llm_backend_connector):2025-06-21-10:15:32.746.522 [dnnengine_manager.cc:128]8895 Initialize:[GEPERFTRACE] The time cost of DNNEngineManager::Initialize[DNN_HCCL] is [0] micro seconds. /root/ascend/log/run/plog/plog-8688_20250621101524105.log:[INFO] HCCL(8688,mindie_llm_backend_connector):2025-06-21-10:15:37.029.276 [plugin_manager.cc:42] [8895]hcom running normal mode. /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:01.747.552 [adapter_rts.cc:2720] [7350][adapter_rts.cc][CallBackInitRts] g_deviceType [6] g_deviceLogicId [-1] g_devicePhyId [-1] /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] GE(7342,mindie_llm_backend_connector):2025-06-21-10:12:06.045.454 [dnnengine_manager.cc:128]7549 Initialize:[GEPERFTRACE] The time cost of DNNEngineManager::Initialize[DNN_HCCL] is [0] micro seconds. /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:10.415.547 [plugin_manager.cc:42] [7549]hcom running normal mode. /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.694.946 [op_base.cc:803] [7549]Entry-HcclGetRootInfo:rootInfo[0xfffc7a8397f8], deviceLogicId[0] /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.007 [externalinput.cc:493] [7549]HCCL_CONNECT_TIMEOUT set by default to [120]s /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.012 [externalinput.cc:430] [7549]HCCL_EXEC_TIMEOUT set by default to [1836]s /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.018 [externalinput.cc:544] [7549]HCCL_INTRA_PCIE_ENABLE set by default to [1], HCCL_INTRA_ROCE_ENABLE set by default to [0] /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.025 [externalinput.cc:718] [7549]HCCL_WHITELIST_DISABLE set by default to [1] /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.038 [externalinput.cc:820] [7549]HCCL_IF_IP is set to [127.0.0.1], ip[127.0.0.1]. /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.043 [externalinput.cc:878] [7549]HCCL_SOCKET_IFNAME set by default to [EmptyString] /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.047 [externalinput.cc:845] [7549]HCCL_SOCKET_FAMILY is not set and is used by default [AF_INET] /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.053 [externalinput.cc:807] [7549]HCCL_IF_BASE_PORT set by default to [60000] /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.056 [externalinput.cc:1509] [7549]HCCL_ALGO is not set /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.066 [externalinput.cc:1520] [7549]HCCL_RDMA_TC set by default to [132] /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.070 [externalinput.cc:1555] [7549]HCCL_RDMA_SL set by default to [4] /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.234 [externalinput.cc:1600] [7549]HCCL_RDMA_TIMEOUT set by default to [20] /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.247 [externalinput.cc:1634] [7549]HCCL_RDMA_RETRY_CNT set by default to [7] /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.253 [externalinput.cc:1800] [7549]HCCL_BUFFSIZE set by environment to [120]M /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.258 [externalinput.cc:1842] [7549]HCCL_DIAGNOSE_ENABLE set by default to [0] /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.263 [externalinput.cc:1942] [7549]HCCL_ENTRY_LOG_ENABLE set by environment to [1] /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.277 [externalinput.cc:2006] [7549]HCCL_OP_EXPANSION_MODE is not set, aicpuUnfold is [0], aivMode is [0] /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.287 [externalinput.cc:2179] [7549][ParseRetryEnable] HCCL_OP_RETRY_ENABLE is not set. The retryEnable of all levels is set to false. /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.293 [externalinput.cc:2241] [7549]HCCL_OP_COUNTER_ENABLE set by default to [1] /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.296 [externalinput.cc:2309] [7549]HCCL_OP_RETRY_PARAMS is not set, default value MaxCnt is 1, HoldTime is 5000, IntervalTime is 1000 /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.300 [externalinput.cc:2340] [7549]HCCL_LOGIC_SUPERPOD_ID is not set, default value[] /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.304 [externalinput.cc:2297] [7549]HCCL_STUCK_DETECT_TIME is set default [612]s. /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.308 [externalinput.cc:408] [7549]HCCL_RDMA_PCIE_DIRECT_POST_NOSTRICT set by default to [EmptyString], rdmaFastPost is [0] /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[WARNING] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.376 [env_config.cc:127] [7549]HCCL_HOST_SOCKET_PORT_RANGE is not set. Multi-process will not be supported! /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[WARNING] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.381 [env_config.cc:131] [7549]HCCL_NPU_SOCKET_PORT_RANGE is not set. Multi-process will not be supported! /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.386 [env_config.cc:321] [7549]HCCL_RDMA_TC set by default to [132] /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.390 [env_config.cc:321] [7549]HCCL_RDMA_SL set by default to [4] /root/ascend/log/run/plog/plog-7138_20250621101156430.log:[INFO] HCCL(7138,mindieservice_daemon):2025-06-21-10:11:56.496.386 [adapter_rts.cc:2720] [7140][adapter_rts.cc][CallBackInitRts] g_deviceType [6] g_deviceLogicId [-1] g_devicePhyId [-1] /root/ascend/log/debug/plog/plog-7342_20250621101248705.log:[ERROR] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.698.493 [adapter_hccp.cc:579] [7549][Init][Ra]errNo[0x0000000005000013] ra init fail.ret[128003] phy_id[212] nic_position[0] hdc_type[6] /root/ascend/log/debug/plog/plog-7342_20250621101248705.log:[ERROR] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.698.583 [network_manager.cc:218] [7549][NetworkManager][Init]errNo[0x0000000005000013] ra init failed,return[19] devicePhyId_[212], nic_position[0] /root/ascend/log/debug/plog/plog-7342_20250621101248705.log:[ERROR] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.698.588 [hccl_network.cc:66] [7549][HcclNetInit]call trace: hcclRet -> 19 /root/ascend/log/debug/plog/plog-7342_20250621101248705.log:[ERROR] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.698.592 [topoinfo_detect.cc:145] [7549][SetupServer]call trace: hcclRet -> 19 /root/ascend/log/debug/plog/plog-7342_20250621101248705.log:[ERROR] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.698.597 [op_base.cc:812] [7549][HcclGetRootInfo]call trace: hcclRet -> 19 ```
### 问题描述 mindie版本信息 Ascend-mindie : 2.0.RC1 Ascend-mindie-service Version : 2.0.RC1 Platform : aarch64 mindie镜像里没找到hccl编译的版本信息,通过官方文档猜测可能版本是这个v1.1-8.1.RC1.alpha002 我根据官方文档[Ascend Docker使用vNPU](https://www.hiascend.com/document/detail/zh/computepoweralloca/300/cpaug/cpaug/cpaug_00013.html),使用ASCEND_VISIBLE_DEVICES环境变量指定设备后,通过ASCEND_VNPU_SPECS环境变量指定模板进行算力切分。  vnpu划分成功,但是mindie启动后报错 ``` 2025-06-21 10:16:17.123 8895 LLM log default format: [yyyy-mm-dd hh:mm:ss.uuuuuu] [processid] [threadid] [llm] [loglevel] [file:line] [status code] msg [2025-06-21 10:16:17.123] [8688] [281461549560160] [llm] [INFO] [post_processing_manager.cpp:161] Get post processing manager [2025-06-21 10:16:17.123] [8688] [281461549560160] [llm] [INFO] [thread_pool.cpp:51] Create pthread id: 281460375875936 [2025-06-21 10:16:17.123] [8688] [281461549560160] [llm] [INFO] [thread_pool.cpp:51] Create pthread id: 281450137973088 [2025-06-21 10:16:17.123] [8688] [281461549560160] [llm] [INFO] [thread_pool.cpp:51] Create pthread id: 281450129518944 [2025-06-21 10:16:17.123] [8688] [281461549560160] [llm] [INFO] [thread_pool.cpp:51] Create pthread id: 281450121064800 [2025-06-21 10:16:17.123] [8688] [281461549560160] [llm] [INFO] [thread_pool.cpp:51] Create pthread id: 281450112610656 [2025-06-21 10:16:17.123] [8688] [281461549560160] [llm] [INFO] [thread_pool.cpp:51] Create pthread id: 281450104156512 [2025-06-21 10:16:17.123] [8688] [281461549560160] [llm] [INFO] [thread_pool.cpp:51] Create pthread id: 281450095702368 [2025-06-21 10:16:17.123] [8688] [281461549560160] [llm] [INFO] [thread_pool.cpp:51] Create pthread id: 281450012209504 [2025-06-21 10:16:17,126] [8688] [281461549560160] [llm] [INFO] [generator.py-546] : warmup params: max_prefill_tokens=10000, max_seq_len=9900, max_input_len=8400, max_iter_times=512 [2025-06-21 10:16:17,130] [8688] [281461549560160] [llm] [INFO] [generator.py-422] : `Prefill blocks` during warmup needs npu memory(GB): 0.0068359375 [2025-06-21 10:16:17.545038] [info] [8895] [share_memory.cpp:32] ATB_SHARE_MEMORY_NAME_SUFFIX is validate, value: [2025-06-21 10:16:17.545414] [info] [8895] [share_memory.cpp:40] create share memory begin, fullName:hcclShareMem [2025-06-21 10:16:17.545561] [info] [8895] [share_memory.cpp:84] key: 1859736976 shmid: 196608 [2025-06-21 10:16:17.545620] [info] [8895] [share_memory.cpp:44] create share memory success [2025-06-21 10:16:17.545640] [info] [8895] [comm.cpp:125] create share memory success, rank:0 [2025-06-21 10:16:17.545660] [info] [8895] [comm.cpp:127] rankRoot:0 [2025-06-21 10:16:17.549221] [error] [8895] [comm.cpp:130] HcclGetRootInfo fail, error:19, rank:0 [2025-06-21 10:16:17.021+0800] [8688] [281470681739616] [batchscheduler] [ERROR] [model.py:59] : [Model] >>> Exception:External Comm Manager: Create the hccl communication group failed. export ASDOPS_LOG_LEVEL=ERROR, export ASDOPS_LOG_TO_STDOUT=1 to see more details. Default log path is $HOME/atb/log. Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/model_wrapper/model.py", line 57, in initialize return self.python_model.initialize(config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/model_wrapper/standard_model.py", line 133, in initialize self.generator = Generator( ^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/mindie_llm/text_generator/generator.py", line 233, in __init__ self.cache_manager = self.warm_up( ^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/mindie_llm/text_generator/generator.py", line 436, in warm_up raise e File "/usr/local/lib/python3.11/site-packages/mindie_llm/text_generator/generator.py", line 429, in warm_up self._generate_inputs_warm_up_backend(cache_manager, input_metadata, inference_mode, dummy=True) File "/usr/local/lib/python3.11/site-packages/mindie_llm/text_generator/generator.py", line 532, in _generate_inputs_warm_up_backend self.generator_backend._warm_up(model_inputs, inference_mode=inference_mode) File "/usr/local/lib/python3.11/site-packages/mindie_llm/text_generator/adapter/generator_torch.py", line 519, in _warm_up super()._warm_up(model_inputs) File "/usr/local/lib/python3.11/site-packages/mindie_llm/text_generator/adapter/generator_backend.py", line 219, in _warm_up _ = self.forward(model_inputs, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/mindie_llm/utils/decorators/time_decorator.py", line 69, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/mindie_llm/text_generator/adapter/generator_torch.py", line 209, in forward logits = self._forward(model_inputs, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/mindie_llm/text_generator/adapter/generator_torch.py", line 539, in _forward logits = self.model_wrapper.forward(model_inputs, self.cache_pool.npu_cache, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/mindie_llm/modeling/model_wrapper/atb/atb_model_wrapper.py", line 164, in forward result = self.forward_tensor( ^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/mindie_llm/modeling/model_wrapper/atb/atb_model_wrapper.py", line 204, in forward_tensor result = self.model_runner.forward( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/Ascend/atb-models/atb_llm/runner/model_runner.py", line 297, in forward res = self.model.forward(**kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/Ascend/atb-models/atb_llm/models/base/flash_causal_lm.py", line 493, in forward self.init_ascend_weight() File "/usr/local/Ascend/atb-models/atb_llm/models/qwen2/flash_causal_qwen2.py", line 282, in init_ascend_weight self.acl_encoder_operation.set_param(json.dumps({**encoder_param})) RuntimeError: External Comm Manager: Create the hccl communication group failed. export ASDOPS_LOG_LEVEL=ERROR, export ASDOPS_LOG_TO_STDOUT=1 to see more details. Default log path is $HOME/atb/log. [2025-06-21 10:16:17.021+0800] [8688] [281470681739616] [batchscheduler] [ERROR] [model.py:62] : [MIE04E13030A] [Model] >>> return initialize error result: {'status': 'error', 'npuBlockNum': '0', 'cpuBlockNum': '0', 'memPoolId': '-1'} [2025-06-21 10:16:17.624+08:00] [8484] [8486] [server] [WARN] [llm_daemon.cpp:74] : [MIE04W01011A] [daemon] Received exit signal[17] [2025-06-21 10:16:17.625+08:00] [8484] [8486] [server] [ERROR] [llm_daemon.cpp:90] : [MIE04E010109] [daemon] Process 8688 was terminated by signal 9 (Killed) [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! /usr/lib64/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 30 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' ``` hccl打印的日志 ``` /root/ascend/log/run/plog/plog-5994_20250621095856036.log:[INFO] HCCL(5994,mindie_llm_backend_connector):2025-06-21-09:59:00.337.886 [adapter_rts.cc:2720] [6002][adapter_rts.cc][CallBackInitRts] g_deviceType [6] g_deviceLogicId [-1] g_devicePhyId [-1] /root/ascend/log/run/plog/plog-5994_20250621095856036.log:[INFO] GE(5994,mindie_llm_backend_connector):2025-06-21-09:59:04.625.045 [dnnengine_manager.cc:128]6201 Initialize:[GEPERFTRACE] The time cost of DNNEngineManager::Initialize[DNN_HCCL] is [0] micro seconds. /root/ascend/log/run/plog/plog-5994_20250621095856036.log:[INFO] HCCL(5994,mindie_llm_backend_connector):2025-06-21-09:59:09.007.330 [plugin_manager.cc:42] [6201]hcom running normal mode. /root/ascend/log/run/plog/plog-8484_20250621101522944.log:[INFO] HCCL(8484,mindieservice_daemon):2025-06-21-10:15:23.011.983 [adapter_rts.cc:2720] [8486][adapter_rts.cc][CallBackInitRts] g_deviceType [6] g_deviceLogicId [-1] g_devicePhyId [-1] /root/ascend/log/run/plog/plog-5790_20250621095854861.log:[INFO] HCCL(5790,mindieservice_daemon):2025-06-21-09:58:54.924.989 [adapter_rts.cc:2720] [5792][adapter_rts.cc][CallBackInitRts] g_deviceType [6] g_deviceLogicId [-1] g_devicePhyId [-1] /root/ascend/log/run/plog/plog-8688_20250621101524105.log:[INFO] HCCL(8688,mindie_llm_backend_connector):2025-06-21-10:15:28.355.674 [adapter_rts.cc:2720] [8696][adapter_rts.cc][CallBackInitRts] g_deviceType [6] g_deviceLogicId [-1] g_devicePhyId [-1] /root/ascend/log/run/plog/plog-8688_20250621101524105.log:[INFO] GE(8688,mindie_llm_backend_connector):2025-06-21-10:15:32.746.522 [dnnengine_manager.cc:128]8895 Initialize:[GEPERFTRACE] The time cost of DNNEngineManager::Initialize[DNN_HCCL] is [0] micro seconds. /root/ascend/log/run/plog/plog-8688_20250621101524105.log:[INFO] HCCL(8688,mindie_llm_backend_connector):2025-06-21-10:15:37.029.276 [plugin_manager.cc:42] [8895]hcom running normal mode. /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:01.747.552 [adapter_rts.cc:2720] [7350][adapter_rts.cc][CallBackInitRts] g_deviceType [6] g_deviceLogicId [-1] g_devicePhyId [-1] /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] GE(7342,mindie_llm_backend_connector):2025-06-21-10:12:06.045.454 [dnnengine_manager.cc:128]7549 Initialize:[GEPERFTRACE] The time cost of DNNEngineManager::Initialize[DNN_HCCL] is [0] micro seconds. /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:10.415.547 [plugin_manager.cc:42] [7549]hcom running normal mode. /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.694.946 [op_base.cc:803] [7549]Entry-HcclGetRootInfo:rootInfo[0xfffc7a8397f8], deviceLogicId[0] /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.007 [externalinput.cc:493] [7549]HCCL_CONNECT_TIMEOUT set by default to [120]s /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.012 [externalinput.cc:430] [7549]HCCL_EXEC_TIMEOUT set by default to [1836]s /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.018 [externalinput.cc:544] [7549]HCCL_INTRA_PCIE_ENABLE set by default to [1], HCCL_INTRA_ROCE_ENABLE set by default to [0] /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.025 [externalinput.cc:718] [7549]HCCL_WHITELIST_DISABLE set by default to [1] /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.038 [externalinput.cc:820] [7549]HCCL_IF_IP is set to [127.0.0.1], ip[127.0.0.1]. /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.043 [externalinput.cc:878] [7549]HCCL_SOCKET_IFNAME set by default to [EmptyString] /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.047 [externalinput.cc:845] [7549]HCCL_SOCKET_FAMILY is not set and is used by default [AF_INET] /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.053 [externalinput.cc:807] [7549]HCCL_IF_BASE_PORT set by default to [60000] /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.056 [externalinput.cc:1509] [7549]HCCL_ALGO is not set /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.066 [externalinput.cc:1520] [7549]HCCL_RDMA_TC set by default to [132] /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.070 [externalinput.cc:1555] [7549]HCCL_RDMA_SL set by default to [4] /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.234 [externalinput.cc:1600] [7549]HCCL_RDMA_TIMEOUT set by default to [20] /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.247 [externalinput.cc:1634] [7549]HCCL_RDMA_RETRY_CNT set by default to [7] /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.253 [externalinput.cc:1800] [7549]HCCL_BUFFSIZE set by environment to [120]M /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.258 [externalinput.cc:1842] [7549]HCCL_DIAGNOSE_ENABLE set by default to [0] /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.263 [externalinput.cc:1942] [7549]HCCL_ENTRY_LOG_ENABLE set by environment to [1] /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.277 [externalinput.cc:2006] [7549]HCCL_OP_EXPANSION_MODE is not set, aicpuUnfold is [0], aivMode is [0] /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.287 [externalinput.cc:2179] [7549][ParseRetryEnable] HCCL_OP_RETRY_ENABLE is not set. The retryEnable of all levels is set to false. /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.293 [externalinput.cc:2241] [7549]HCCL_OP_COUNTER_ENABLE set by default to [1] /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.296 [externalinput.cc:2309] [7549]HCCL_OP_RETRY_PARAMS is not set, default value MaxCnt is 1, HoldTime is 5000, IntervalTime is 1000 /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.300 [externalinput.cc:2340] [7549]HCCL_LOGIC_SUPERPOD_ID is not set, default value[] /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.304 [externalinput.cc:2297] [7549]HCCL_STUCK_DETECT_TIME is set default [612]s. /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.308 [externalinput.cc:408] [7549]HCCL_RDMA_PCIE_DIRECT_POST_NOSTRICT set by default to [EmptyString], rdmaFastPost is [0] /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[WARNING] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.376 [env_config.cc:127] [7549]HCCL_HOST_SOCKET_PORT_RANGE is not set. Multi-process will not be supported! /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[WARNING] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.381 [env_config.cc:131] [7549]HCCL_NPU_SOCKET_PORT_RANGE is not set. Multi-process will not be supported! /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.386 [env_config.cc:321] [7549]HCCL_RDMA_TC set by default to [132] /root/ascend/log/run/plog/plog-7342_20250621101157613.log:[INFO] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.695.390 [env_config.cc:321] [7549]HCCL_RDMA_SL set by default to [4] /root/ascend/log/run/plog/plog-7138_20250621101156430.log:[INFO] HCCL(7138,mindieservice_daemon):2025-06-21-10:11:56.496.386 [adapter_rts.cc:2720] [7140][adapter_rts.cc][CallBackInitRts] g_deviceType [6] g_deviceLogicId [-1] g_devicePhyId [-1] /root/ascend/log/debug/plog/plog-7342_20250621101248705.log:[ERROR] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.698.493 [adapter_hccp.cc:579] [7549][Init][Ra]errNo[0x0000000005000013] ra init fail.ret[128003] phy_id[212] nic_position[0] hdc_type[6] /root/ascend/log/debug/plog/plog-7342_20250621101248705.log:[ERROR] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.698.583 [network_manager.cc:218] [7549][NetworkManager][Init]errNo[0x0000000005000013] ra init failed,return[19] devicePhyId_[212], nic_position[0] /root/ascend/log/debug/plog/plog-7342_20250621101248705.log:[ERROR] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.698.588 [hccl_network.cc:66] [7549][HcclNetInit]call trace: hcclRet -> 19 /root/ascend/log/debug/plog/plog-7342_20250621101248705.log:[ERROR] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.698.592 [topoinfo_detect.cc:145] [7549][SetupServer]call trace: hcclRet -> 19 /root/ascend/log/debug/plog/plog-7342_20250621101248705.log:[ERROR] HCCL(7342,mindie_llm_backend_connector):2025-06-21-10:12:48.698.597 [op_base.cc:812] [7549][HcclGetRootInfo]call trace: hcclRet -> 19 ```
评论 (
3
)
登录
后才可以发表评论
状态
DONE
TODO
WIP
DONE
CLOSED
REJECTED
负责人
未设置
杨邵华
noth1n9
负责人
协作者
+负责人
+协作者
标签
question
未设置
项目
未立项任务
未立项任务
里程碑
未关联里程碑
未关联里程碑
Pull Requests
未关联
未关联
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
未关联
分支 (
-
)
标签 (
-
)
开始日期   -   截止日期
-
置顶选项
不置顶
置顶等级:高
置顶等级:中
置顶等级:低
优先级
不指定
严重
主要
次要
不重要
预计工期
(小时)
参与者(2)
1
https://gitee.com/ascend/cann-hccl.git
git@gitee.com:ascend/cann-hccl.git
ascend
cann-hccl
cann-hccl
点此查找更多帮助
搜索帮助
Git 命令在线学习
如何在 Gitee 导入 GitHub 仓库
Git 仓库基础操作
企业版和社区版功能对比
SSH 公钥设置
如何处理代码冲突
仓库体积过大,如何减小?
如何找回被删除的仓库数据
Gitee 产品配额说明
GitHub仓库快速导入Gitee及同步更新
什么是 Release(发行版)
将 PHP 项目自动发布到 packagist.org
评论
仓库举报
回到顶部
登录提示
该操作需登录 Gitee 帐号,请先登录后再操作。
立即登录
没有帐号,去注册