2.4K Star 8.2K Fork 4.4K

GVPMindSpore / mindspore

 / 详情

[ST][MS][MF][2.2.12.B001][llama2/qwen/baichuan2/internlm][910B 8p]网络训练批量失败,RuntimeError: Failure:operator RmsNorm init failed

DONE
Bug-Report
创建于  
2024-03-14 14:41
name about labels
Bug Report Use this template for reporting a bug kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

llama2/qwen/baichuan2网络在910B环境上训练批量失败
模型仓地址:https://gitee.com/mindspore/mindformers/blob/r1.0/docs/model_cards/llama2.md

Environment / 环境信息 (Mandatory / 必填)

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device ascend

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) :
    -- Python version (e.g., Python 3.7.5) :
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):
    -- GCC/Compiler version (if compiled from source):

cann:Milan_C15/20240302
MF:r1.0_20240122200644_7f9b584852
MS:r2.2.12_20240313072958_20aaac50a2

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode pynative
/mode graph

Related testcase / 关联用例 (Mandatory / 必填)

用例仓地址:MindFormers_Test/cases/llama2/7b/train/
用例:
test_mf_llama2_7b_lora_train_alpaca_8p_0001
test_mf_qwen_7b_train_infer_alpaca_8p_0001
test_mf_baichuan2_7b_lora_train_belle_8p_0001
test_mf_qwen_14b_lora_train_infer_alpaca_8p_0001
test_mf_qwen_14b_train_infer_alpaca_8p_0001
test_mf_internlm_7b_train_alpaca_8p_0001
test_mf_baichuan2_13b_train_belle_8p_0001
test_mf_llama2_13b_train_alpaca_8p_0001

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

  1. get code from mindformers
  2. cd mindformers/scripts
  3. 修改配置文件中的数据集、权重路径为本地路径
  4. bash run_distribute.sh /home/workspace/config/hccl_8p.json ./configs/llama2/run_llama2_7b_lora_910b.yaml [0,8] finetune
  5. 验证网络是否训练成功

Describe the expected behavior / 预期结果 (Mandatory / 必填)

网络训练成功

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

[ERROR] 2024-03-14 11:01:19,682 [mindformers/tools/cloud_adapter/cloud_monitor.py:43] wrapper: Traceback (most recent call last):
 File "/data/jenkins_workspace/TDT_deployment/MindFormers_Test/cases/llama2/7b/train/test_mf_llama2_7b_lora_train_alpaca_8p_0001/scripts/mf_parallel6/mindformers/tools/cloud_adapter/cloud_monitor.py", line 34, in wrapper
   result = run_func(*args, **kwargs)
 File "run_mindformer.py", line 143, in main
   create_task_trainer(config)
 File "run_mindformer.py", line 85, in create_task_trainer
   trainer.train(config, is_full_config=True)
 File "/data/jenkins_workspace/TDT_deployment/MindFormers_Test/cases/llama2/7b/train/test_mf_llama2_7b_lora_train_alpaca_8p_0001/scripts/mf_parallel6/mindformers/trainer/causal_language_modeling/causal_language_modeling.py", line 104, in train
   **kwargs)
 File "/data/jenkins_workspace/TDT_deployment/MindFormers_Test/cases/llama2/7b/train/test_mf_llama2_7b_lora_train_alpaca_8p_0001/scripts/mf_parallel6/mindformers/trainer/base_trainer.py", line 738, in training_process
   initial_epoch=config.runner_config.initial_epoch)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 1073, in train
   initial_epoch=initial_epoch)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 114, in wrapper
   func(self, *args, **kwargs)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 624, in _train
   cb_params, sink_size, initial_epoch, valid_infos)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 708, in _train_dataset_sink_process
   outputs = train_network(*inputs)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 680, in __call__
   out = self.compile_and_run(*args, **kwargs)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 1020, in compile_and_run
   self.compile(*args, **kwargs)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 998, in compile
   jit_config_dict=self._jit_config_dict, *args, **kwargs)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/common/api.py", line 1547, in compile
   result = self._graph_executor.compile(obj, args, kwargs, phase, self._use_vm_mode())
RuntimeError: Failure:operator RmsNorm init failed

----------------------------------------------------
- The Function Call Stack: (For framework developers)
----------------------------------------------------

Special notes for this issue/备注 (Optional / 选填)

走给何青林

评论 (6)

zhangjie18 创建了Bug-Report
zhangjie18 添加了
 
kind/bug
标签
zhangjie18 添加了
 
v2.2.13
标签
zhangjie18 添加了
 
attr/function
标签
zhangjie18 添加了
 
stage/func-debug
标签
zhangjie18 添加了
 
sig/mindformers
标签
zhangjie18 添加了
 
device/ascend
标签
展开全部操作日志

Please assign maintainer to check this issue.
请为此issue分配处理人。
@zhangjie18

感谢您的反馈,您可以评论//mindspore-assistant更快获取帮助,更多标签可以查看标签列表

  1. 如果您刚刚接触MindSpore,或许您可以在教程找到答案
  2. 如果您是资深Pytorch用户,您或许需要:
    与PyTorch典型区别 / PyTorch与MindSpore API映射表
  3. 如果您遇到动态图问题,可以设置mindspore.set_context(pynative_synchronize=True)查看报错栈协助定位
  4. 模型精度调优问题可参考官网调优指南
  5. 如果您反馈的是框架BUG,请确认您在ISSUE中提供了MindSpore版本、使用的后端类型(CPU、GPU、Ascend)、环境、训练的代码官方链接以及可以复现报错的代码的启动方式等必要的定位信息
  6. 如果您已经定位出问题根因,欢迎提交PR参与MindSpore开源社区,我们会尽快review
zhangjie18 修改了描述
zhangjie18 修改了标题

部件影响20+用例

test_mindformers_parallel_export_predict_checkpoint_none_use_past_true
test_mindformers_parallel_load_lora_and_base_train_auto_trans_ckpt_true
。。。

zhangjie18 移除了
 
v2.2.13
标签
zhangjie18 移除了
 
v2.2.13
标签
zhangjie18 添加了
 
v2.2.12
标签
xiangminshan 负责人xiangminshan 修改为Lin
xiangminshan 移除了
 
v2.2.12
标签
xiangminshan 移除了
 
v2.2.12
标签
xiangminshan 添加了
 
v2.2.12
标签
xiangminshan 添加了
 
v2.2.12
标签

使用r1.0最新构建包,适配2.2.12RMSNorm不能被切分的问题

Lin 添加协作者Lin
Lin 负责人Lin 修改为xiangminshan
Lin 取消协作者xiangminshan
Lin 里程碑B-SIG-MindFormers 修改为B-SolutionTest
Lin 负责人xiangminshan 修改为zhangjie18
Lin 添加协作者xiangminshan
Lin 移除了
 
kind/bug
标签
Lin 移除了
 
kind/bug
标签
Lin 移除了
 
attr/function
标签
Lin 移除了
 
attr/function
标签
Lin 添加了
 
ctl/codereview
标签
zhunaipan 任务状态TODO 修改为VALIDATION

回归版本:
Mindspore 2.2.12.B002

回归步骤:参考issue复现步骤
基本功能:问题已解决
输入图片说明
输入图片说明
输入图片说明
测试结论:回归通过
回归人员:zhongjicheng
回归时间: 2024-03-16

i-robot 添加了
 
foruda
标签
i-robot 添加了
 
foruda
标签
i-robot 添加了
 
foruda
标签
zhongjicheng 任务状态VALIDATION 修改为DONE
sunjiawei999 复制了任务 I9AXO6

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(6)
11016979 xiangmd 1654824581
Python
1
https://gitee.com/mindspore/mindspore.git
git@gitee.com:mindspore/mindspore.git
mindspore
mindspore
mindspore

搜索帮助