910b单机八卡推理Qwen72b报OOM

一、问题现象（附报错日志上下文）：
910b单机八卡推理Qwen-72b，模型准备阶段npu0 oom，其他卡无负载。
/home/ljl/miniconda3/envs/Qwen/lib/python3.9/site-packages/torch_npu/dynamo/init.py:18: UserWarning: Register eager implementation for the 'npu' backend of dynamo, as torch_npu was not compiled with torchair.
warnings.warn(
/home/ljl/miniconda3/envs/Qwen/lib/python3.9/site-packages/torch_npu/utils/path_manager.py:77: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/latest owner does not match the current user.
warnings.warn(f"Warning: The {path} owner does not match the current user.")
/home/ljl/miniconda3/envs/Qwen/lib/python3.9/site-packages/torch_npu/utils/path_manager.py:77: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/7.0.0/aarch64-linux/ascend_toolkit_install.info owner does not match the current user.
warnings.warn(f"Warning: The {path} owner does not match the current user.")
Try importing flash-attention for faster inference...
Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
Loading checkpoint shards: 12%|█████████████▋ Loading checkpoint shards: 13%|███████████████
Loading checkpoint shards: 15%|████████████████▍
Loading checkpoint shards: 16%|█████████████████▊
Loading checkpoint shards: 17%|███████████████████
Loading checkpoint shards: 22%|████████████████████████▌ | 18/82 [00:19<01:08, 1.07s/it]
Traceback (most recent call last):0:17<00:38, 1.76it/s]
File "/home/ljl/Qwen-main/predict.py", line 19, in
model = AutoModelForCausalLM.from_pretrained("/home/ljl/qwen-72b",device_map="auto",max_memory=max_memory, trust_remote_code=True,bf16=True).eval()
File "/home/ljl/miniconda3/envs/Qwen/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
return model_class.from_pretrained(
File "/home/ljl/miniconda3/envs/Qwen/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3852, in from_pretrained
) = cls._load_pretrained_model(
File "/home/ljl/miniconda3/envs/Qwen/lib/python3.9/site-packages/transformers/modeling_utils.py", line 4286, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/home/ljl/miniconda3/envs/Qwen/lib/python3.9/site-packages/transformers/modeling_utils.py", line 807, in _load_state_dict_into_meta_model
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
File "/home/ljl/miniconda3/envs/Qwen/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 369, in set_module_tensor_to_device
new_value = param_cls(new_value, requires_grad=old_value.requires_grad).to(device)
RuntimeError: NPU out of memory. Tried to allocate 386.00 MiB (NPU 0; 32.00 GiB total capacity; 31.57 GiB already allocated; 31.57 GiB current active; 265.07 MiB free; 31.72 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

二、软件版本:
-- CANN 版本 (e.g., CANN 3.0.x，5.x.x): 7.0.0
--Tensorflow/Pytorch/MindSpore 版本:pytorch 2.1.0,transfomers 4.37.1,accelerate 0.26.1
--Python 版本 (e.g., Python 3.7.5):3.9
-- MindStudio版本 (e.g., MindStudio 2.0.0 (beta3)):
--操作系统版本 (e.g., Ubuntu 18.04):centos7

三、测试步骤：
直接运行Qwen官方推理脚本

四、日志信息:
xxxx
请根据自己的运行环境参考以下方式搜集日志信息，如果涉及到算子开发相关的问题，建议也提供UT/ST测试和单算子集成测试相关的日志。

日志提供方式:
将日志打包后作为附件上传。若日志大小超出附件限制，则可上传至外部网盘后提供链接。

获取方法请参考wiki：
https://gitee.com/ascend/modelzoo/wikis/如何获取日志和计算图?sort_id=4097825

您好，使用的是modelzoo模型么，如果是非modelzoo模型，请优先咨询您的FAE或者PAE

你好 modelzoo里没有找到 qwen的模型请问地址在哪来呢

你好，我也想试这个模型，请问问题解决了吗

不能多卡推理，原因应该是不能切换gpu,有什么解决方案么？

6.0.rc1之后，2.1.0及以上版本已经支持单进程多卡

Ascend/pytorch

内容风险标识

评论 (5)

Ascend/pytorch .gitee-modal { width: 500px !important; }

内容风险标识