在910A上，纯模型推理Qwen3_moe，报错RuntimeError: [enforce fail at alloc_cpu.cpp:83] err == 0有大佬知道怎么解决吗

一、问题现象（附报错日志上下文）：
在910A上，纯模型推理Qwen3_moe，报错，RuntimeError: [enforce fail at alloc_cpu.cpp:83] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 68719476736 bytes. Error code 12 (Cannot allocate memory)，有大佬知道怎么解决吗？

二、软件版本:
-- CANN 版本 :8.1.T18  
--Pytorch 版本:2.1.0
--Python 版本 (e.g., Python 3.7.5):3.11.6
--操作系统版本 (e.g., Ubuntu 18.04):OpenEuler 24.03

三、测试步骤：
使用T18镜像，执行命令为：
torchrun --nproc_per_node 8 \
--master_port 20037 \
-m examples.run_pa \
--model_path /data/Qwen3-30B-A3B \
--trust_remote_code \
--max_output_length 256

四、完成错误日志信息如下:

[root@bms-41ba-0001 ascend-toolkit]# torchrun --nproc_per_node 8 --master_port 20037 -m examples.run_pa --model_path /data/Qwen3-30B-A3B --trust_remote_code --max_output_length 256
[2025-05-27 11:13:16,345] torch.distributed.run: [WARNING] 
[2025-05-27 11:13:16,345] torch.distributed.run: [WARNING] *****************************************
[2025-05-27 11:13:16,345] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2025-05-27 11:13:16,345] torch.distributed.run: [WARNING] *****************************************
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_moe to instantiate a model of type qwen2. This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_moe to instantiate a model of type qwen2. This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_moe to instantiate a model of type qwen2. This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_moe to instantiate a model of type qwen2. This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_moe to instantiate a model of type qwen2. This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_moe to instantiate a model of type qwen2. This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_moe to instantiate a model of type qwen2. This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_moe to instantiate a model of type qwen2. This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_moe to instantiate a model of type qwen2. This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_moe to instantiate a model of type qwen2. This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_moe to instantiate a model of type qwen2. This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_moe to instantiate a model of type qwen2. This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_moe to instantiate a model of type qwen2. This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_moe to instantiate a model of type qwen2. This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_moe to instantiate a model of type qwen2. This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_moe to instantiate a model of type qwen2. This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_moe to instantiate a model of type qwen2. This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_moe to instantiate a model of type qwen2. This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_moe to instantiate a model of type qwen2. This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_moe to instantiate a model of type qwen2. This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_moe to instantiate a model of type qwen2. This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_moe to instantiate a model of type qwen2. This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_moe to instantiate a model of type qwen2. This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_moe to instantiate a model of type qwen2. This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_moe to instantiate a model of type qwen2. This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_moe to instantiate a model of type qwen2. This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_moe to instantiate a model of type qwen2. This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_moe to instantiate a model of type qwen2. This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_moe to instantiate a model of type qwen2. This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_moe to instantiate a model of type qwen2. This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_moe to instantiate a model of type qwen2. This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_moe to instantiate a model of type qwen2. This is not supported for all configurations of models and can yield errors.
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/Ascend/atb-models/examples/run_pa.py", line 547, in <module>
    pa_runner.warm_up()
  File "/usr/local/Ascend/atb-models/examples/run_pa.py", line 273, in warm_up
    generate_req(req_list, self.model, self.max_batch_size, self.max_prefill_tokens, self.cache_manager)
  File "/usr/local/Ascend/atb-models/examples/server/generate.py", line 1143, in generate_req
    generate_token_with_clocking(model, cache_manager, batch)
  File "/usr/local/Ascend/atb-models/examples/server/generate.py", line 810, in generate_token_with_clocking
    res = generate_token(model, cache_manager, input_batch_in)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/Ascend/atb-models/examples/server/generate.py", line 587, in generate_token
    logits = model.forward(
             ^^^^^^^^^^^^^^
  File "/usr/local/Ascend/atb-models/atb_llm/runner/model_runner.py", line 297, in forward
    res = self.model.forward(**kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/Ascend/atb-models/atb_llm/models/base/flash_causal_lm.py", line 497, in forward
    acl_inputs, acl_param = self.prepare_inputs_for_ascend(input_ids, position_ids, is_prefill, kv_cache,
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/Ascend/atb-models/atb_llm/models/qwen2_moe/flash_causal_qwen2_moe.py", line 217, in prepare_inputs_for_ascend
    attention_mask = self.attn_mask.get_attn_mask(pad_maxs, kv_cache[0][0].dtype, kv_cache[0][0].device)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/Ascend/atb-models/atb_llm/utils/layers/attention/attention_mask.py", line 64, in get_attn_mask
    self.update_attn_cache(dtype, device, max_s, mini_type)
  File "/usr/local/Ascend/atb-models/atb_llm/utils/layers/attention/attention_mask.py", line 56, in update_attn_cache
    mask_atten_cache = torch.masked_fill(torch.zeros(size=(seqlen, seqlen)), bias_cache, mask_value)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: [enforce fail at alloc_cpu.cpp:83] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 68719476736 bytes. Error code 12 (Cannot allocate memory)
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/Ascend/atb-models/examples/run_pa.py", line 547, in <module>
    pa_runner.warm_up()
  File "/usr/local/Ascend/atb-models/examples/run_pa.py", line 273, in warm_up
    generate_req(req_list, self.model, self.max_batch_size, self.max_prefill_tokens, self.cache_manager)
  File "/usr/local/Ascend/atb-models/examples/server/generate.py", line 1143, in generate_req
    generate_token_with_clocking(model, cache_manager, batch)
  File "/usr/local/Ascend/atb-models/examples/server/generate.py", line 810, in generate_token_with_clocking
    res = generate_token(model, cache_manager, input_batch_in)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/Ascend/atb-models/examples/server/generate.py", line 587, in generate_token
    logits = model.forward(
             ^^^^^^^^^^^^^^
  File "/usr/local/Ascend/atb-models/atb_llm/runner/model_runner.py", line 297, in forward
    res = self.model.forward(**kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/Ascend/atb-models/atb_llm/models/base/flash_causal_lm.py", line 497, in forward
    acl_inputs, acl_param = self.prepare_inputs_for_ascend(input_ids, position_ids, is_prefill, kv_cache,
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/Ascend/atb-models/atb_llm/models/qwen2_moe/flash_causal_qwen2_moe.py", line 217, in prepare_inputs_for_ascend
    attention_mask = self.attn_mask.get_attn_mask(pad_maxs, kv_cache[0][0].dtype, kv_cache[0][0].device)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/Ascend/atb-models/atb_llm/utils/layers/attention/attention_mask.py", line 64, in get_attn_mask
    self.update_attn_cache(dtype, device, max_s, mini_type)
  File "/usr/local/Ascend/atb-models/atb_llm/utils/layers/attention/attention_mask.py", line 56, in update_attn_cache
    mask_atten_cache = torch.masked_fill(torch.zeros(size=(seqlen, seqlen)), bias_cache, mask_value)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: [enforce fail at alloc_cpu.cpp:83] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 68719476736 bytes. Error code 12 (Cannot allocate memory)
[ERROR] 2025-05-27-11:17:48 (PID:304, Device:5, RankID:-1) ERR99999 UNKNOWN application exception
[ERROR] 2025-05-27-11:17:49 (PID:301, Device:2, RankID:-1) ERR99999 UNKNOWN application exception
[2025-05-27 11:17:56,376] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 299 closing signal SIGTERM
[2025-05-27 11:17:56,376] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 300 closing signal SIGTERM
[2025-05-27 11:17:56,376] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 301 closing signal SIGTERM
[2025-05-27 11:17:56,377] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 302 closing signal SIGTERM
[2025-05-27 11:17:56,377] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 303 closing signal SIGTERM
[2025-05-27 11:17:56,377] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 305 closing signal SIGTERM
[2025-05-27 11:17:56,377] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 306 closing signal SIGTERM
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
/usr/lib64/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 30 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/usr/lib64/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 30 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/usr/lib64/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 30 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/usr/lib64/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 30 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
[2025-05-27 11:18:14,933] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 5 (pid: 304) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib64/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib64/python3.11/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/usr/local/lib64/python3.11/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/usr/local/lib64/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib64/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
examples.run_pa FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-05-27_11:17:56
  host      : bms-41ba-0001
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 304)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
/usr/lib64/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 30 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
[root@bms-41ba-0001 ascend-toolkit]# /usr/lib64/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 30 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Ascend/ModelZoo-PyTorch
暂停

内容风险标识

评论 (1)

Ascend/ModelZoo-PyTorch暂停 .gitee-modal { width: 500px !important; }

内容风险标识