hf2mcore时候报错bus error

在尝试将DeepSeek-R1-Distill-Llama-70B转mcore用于训练（微调）的过程中出现mcore

# 指令:
```bash
python convert_ckpt.py --use-mcore-models --model-type-hf llama2 --model-type GPT --load-model-type hf --save-model-type mg --params-dtype bf16 --target-tensor-parallel-size 1 --target-pipeline-parallel-size 1 --load-dir /deepseek-ai/DeepSeek-R1-Distill-Llama-70B/ --save-dir /deepseek-ai/DeepSeek-R1-Distill-Llama-70B-mcore/ --tokenizer-model /deepseek-ai/DeepSeek-R1-Distill-Llama-70B/tokenizer.json
```

# 现象:
cpu未完全跑满；内存仅占用约500G，空闲内存约0.8～1T

# 完整终端结果:
```
[root@host-5jl98c MindSpeed-LLM]# python convert_ckpt.py --use-mcore-models --model-type-hf llama2 --model-type GPT --load-model-type hf --save-model-type mg --params-dtype bf16 --target-tensor-parallel-size 1 --target-pipeline-parallel-size 1 --load-dir /deepseek-ai/DeepSeek-R1-Distill-Llama-70B/ --save-dir /deepseek-ai/DeepSeek-R1-Distill-Llama-70B-mcore/ --tokenizer-model /deepseek-ai/DeepSeek-R1-Distill-Llama-70B/tokenizer.json
/usr/local/lib64/python3.11/site-packages/torch_npu/contrib/transfer_to_npu.py:294: ImportWarning: 
    *************************************************************************************************************
    The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now..
    The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now..
    The backend in torch.distributed.init_process_group set to hccl now..
    The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now..
    The device parameters have been replaced with npu in the function below:
    torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.Generator, torch.set_default_device, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.nn.Module.to, torch.nn.Module.to_empty
    *************************************************************************************************************
    
  warnings.warn(msg, ImportWarning)
/usr/local/lib64/python3.11/site-packages/torch_npu/contrib/transfer_to_npu.py:249: RuntimeWarning: torch.jit.script and torch.jit.script_method will be disabled by transfer_to_npu, which currently does not support them, if you need to enable them, please do not use transfer_to_npu.
  warnings.warn(msg, RuntimeWarning)
/usr/local/lib64/python3.11/site-packages/torch/utils/cpp_extension.py:28: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  from pkg_resources import packaging  # type: ignore[attr-defined]
/usr/lib/python3.11/site-packages/pkg_resources/__init__.py:2871: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
/usr/lib/python3.11/site-packages/pkg_resources/__init__.py:2871: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('zope')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
Using /root/.cache/torch_extensions/py311_cpu as PyTorch extensions root...
INFO:root:Loaded mindspeed_llm.tasks.checkpoint.loader_hf as the loader.
INFO:root:Loaded mindspeed_llm.tasks.checkpoint.saver as the saver.
INFO:root:Starting saver...
INFO:root:Starting loader...
using world size: 1, data-parallel size: 1, context-parallel size: 1 tensor-model-parallel size: 1, pipeline-model-parallel size: 1 
WARNING: Please specify --split when using --data-path. Using legacy default value of "969, 30, 1"
WARNING: Setting args.overlap_p2p_comm to False since non-interleaved schedule does not support overlapping p2p communication
accumulate and all-reduce gradients in fp32 for bfloat16 data type.
using torch.bfloat16 for parameters ...
When context_parallel is not activated, kv_head_repeat_before_uly_alltoall would be set to False for reducing memory usage.
[INFO] Setting args.create_attention_mask_in_dataloader to False since reset_data=False or alibi_without_flash_attn=False or args.tokenizer_padding_side=right
------------------------ MindSpeed-LLM Arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. True
  adam_beta1 ...................................... 0.9
...
...
...
  wgrad_deferral_limit ............................ 0
  world_size ...................................... 1
  yaml_cfg ........................................ None
-------------------- end of MindSpeed-LLM Arguments ---------------------
 > padded vocab (size: 128256) with 0 dummy tokens (new size: 128256)
building GPT model ...
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████| 17/17 [00:40<00:00,  2.39s/it]
building GPT model ...
set layer states: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [00:28<00:00,  2.76it/s]
INFO:root:sending embeddings
INFO:root:sending transformer layer 0
INFO:root:sending transformer layer 1
INFO:root:sending transformer layer 2
INFO:root:sending transformer layer 3
INFO:root:sending transformer layer 4
Bus error (core dumped)
```

Ascend/MindSpeed-LLM
暂停

内容风险标识

评论 (2)

Ascend/MindSpeed-LLM暂停 .gitee-modal { width: 500px !important; }

内容风险标识