name | about | labels |
---|---|---|
Bug Report | Use this template for reporting a bug | kind/bug |
llama2/qwen/baichuan2网络在910B环境上训练批量失败
模型仓地址:https://gitee.com/mindspore/mindformers/blob/r1.0/docs/model_cards/llama2.md
Ascend
/GPU
/CPU
) / 硬件环境:Please delete the backend not involved / 请删除不涉及的后端:
/device ascend
cann:Milan_C15/20240302
MF:r1.0_20240122200644_7f9b584852
MS:r2.2.12_20240313072958_20aaac50a2
PyNative
/Graph
):Please delete the mode not involved / 请删除不涉及的模式:
/mode pynative
/mode graph
用例仓地址:MindFormers_Test/cases/llama2/7b/train/
用例:
test_mf_llama2_7b_lora_train_alpaca_8p_0001
test_mf_qwen_7b_train_infer_alpaca_8p_0001
test_mf_baichuan2_7b_lora_train_belle_8p_0001
test_mf_qwen_14b_lora_train_infer_alpaca_8p_0001
test_mf_qwen_14b_train_infer_alpaca_8p_0001
test_mf_internlm_7b_train_alpaca_8p_0001
test_mf_baichuan2_13b_train_belle_8p_0001
test_mf_llama2_13b_train_alpaca_8p_0001
网络训练成功
[ERROR] 2024-03-14 11:01:19,682 [mindformers/tools/cloud_adapter/cloud_monitor.py:43] wrapper: Traceback (most recent call last):
File "/data/jenkins_workspace/TDT_deployment/MindFormers_Test/cases/llama2/7b/train/test_mf_llama2_7b_lora_train_alpaca_8p_0001/scripts/mf_parallel6/mindformers/tools/cloud_adapter/cloud_monitor.py", line 34, in wrapper
result = run_func(*args, **kwargs)
File "run_mindformer.py", line 143, in main
create_task_trainer(config)
File "run_mindformer.py", line 85, in create_task_trainer
trainer.train(config, is_full_config=True)
File "/data/jenkins_workspace/TDT_deployment/MindFormers_Test/cases/llama2/7b/train/test_mf_llama2_7b_lora_train_alpaca_8p_0001/scripts/mf_parallel6/mindformers/trainer/causal_language_modeling/causal_language_modeling.py", line 104, in train
**kwargs)
File "/data/jenkins_workspace/TDT_deployment/MindFormers_Test/cases/llama2/7b/train/test_mf_llama2_7b_lora_train_alpaca_8p_0001/scripts/mf_parallel6/mindformers/trainer/base_trainer.py", line 738, in training_process
initial_epoch=config.runner_config.initial_epoch)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 1073, in train
initial_epoch=initial_epoch)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 114, in wrapper
func(self, *args, **kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 624, in _train
cb_params, sink_size, initial_epoch, valid_infos)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 708, in _train_dataset_sink_process
outputs = train_network(*inputs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 680, in __call__
out = self.compile_and_run(*args, **kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 1020, in compile_and_run
self.compile(*args, **kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 998, in compile
jit_config_dict=self._jit_config_dict, *args, **kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/common/api.py", line 1547, in compile
result = self._graph_executor.compile(obj, args, kwargs, phase, self._use_vm_mode())
RuntimeError: Failure:operator RmsNorm init failed
----------------------------------------------------
- The Function Call Stack: (For framework developers)
----------------------------------------------------
走给何青林
Please assign maintainer to check this issue.
请为此issue分配处理人。
@zhangjie18
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
感谢您的反馈,您可以评论//mindspore-assistant更快获取帮助,更多标签可以查看标签列表:
部件影响20+用例
test_mindformers_parallel_export_predict_checkpoint_none_use_past_true
test_mindformers_parallel_load_lora_and_base_train_auto_trans_ckpt_true
。。。
使用r1.0最新构建包,适配2.2.12RMSNorm不能被切分的问题
回归版本:
Mindspore 2.2.12.B002
回归步骤:参考issue复现步骤
基本功能:问题已解决
测试结论:回归通过
回归人员:zhongjicheng
回归时间: 2024-03-16
登录 后才可以发表评论