2.4K Star 8.2K Fork 4.4K

GVPMindSpore / mindspore

 / 详情

[ST][MS][MF][baichuan2_13b][910b1 8p]网络训练,前面6张卡loss为0,后面2张卡loss为nan

DONE
Bug-Report
创建于  
2023-12-25 17:47
name about labels
Bug Report Use this template for reporting a bug kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

baichuan2_13b网络在910B1环境上训练,单机8卡训练,前面6张卡loss为0,后面2张卡loss为nan
模型仓地址:https://gitee.com/mindspore/mindformers/tree/dev/research/baichuan2

Environment / 环境信息 (Mandatory / 必填)

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device ascend/

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) :
    -- Python version (e.g., Python 3.7.5) :
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):
    -- GCC/Compiler version (if compiled from source):

失败版本:
run:Milan_C15/20231215
mindspore版本:r2.2_f44abe5c4aad3ad647ec3390346c315e20a5e4c8_20231222041529
mindformers版本:dev_b97a2f162bc005b173d2a28dc1fcb5791f9a6833_20231222121530

ok版本:
2.2.10.b100版本还是OK得

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode pynative
/mode graph

Related testcase / 关联用例 (Mandatory / 必填)

模型仓地址:MindFormers_Test/cases/baichuan2/13b/train/
用例:
test_mf_baichuan2_13b_train_belle_8p_0001

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

1.get code from mindformers
2.cd mindformers/research
3.bash run_singlenode.sh '''python ./baichuan2/run_baichuan2.py --config ./research/baichuan2/run_baichuan2_13b_910b.yaml --load_checkpoint /home/workspace/large_model_ckpt//baichuan2/13b/train/ --auto_trans_ckpt True --use_parallel True --run_mode finetune --train_data /home/workspace/large_model_dataset//baichuan2/belle512.mindrecord''' /home/workspace/config/hccl_8p.json [0,8] 8
4.验证网络是否训练成功
5.验证网络loss是否达标

Describe the expected behavior / 预期结果 (Mandatory / 必填)

网络训练成功、loss达标

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

后面2张卡
2023-12-24 13:33:28,235 - mindformers[mindformers/core/callback/callback.py:316] - INFO - { Epoch:[  1/  1], step:[  304/  312], loss:   nan, per_step_time: 3072ms, lr: 1.0413064e-06, overflow cond: False, loss_scale: 65536.0
2023-12-24 13:33:34,385 - mindformers[mindformers/core/callback/callback.py:250] - WARNING - pipeline stages: 4 > 1, the loss on the last card is valid.
2023-12-24 13:33:34,385 - mindformers[mindformers/core/callback/callback.py:316] - INFO - { Epoch:[  1/  1], step:[  306/  312], loss:   nan, per_step_time: 3070ms, lr: 1.0249815e-06, overflow cond: False, loss_scale: 65536.0
2023-12-24 13:33:40,539 - mindformers[mindformers/core/callback/callback.py:250] - WARNING - pipeline stages: 4 > 1, the loss on the last card is valid.
2023-12-24 13:33:40,540 - mindformers[mindformers/core/callback/callback.py:316] - INFO - { Epoch:[  1/  1], step:[  308/  312], loss:   nan, per_step_time: 3073ms, lr: 1.0127337e-06, overflow cond: False, loss_scale: 65536.0
2023-12-24 13:33:46,693 - mindformers[mindformers/core/callback/callback.py:250] - WARNING - pipeline stages: 4 > 1, the loss on the last card is valid.
2023-12-24 13:33:46,694 - mindformers[mindformers/core/callback/callback.py:316] - INFO - { Epoch:[  1/  1], step:[  310/  312], loss:   nan, per_step_time: 3073ms, lr: 1.0045605e-06, overflow cond: False, loss_scale: 65536.0
2023-12-24 13:33:52,845 - mindformers[mindformers/core/callback/callback.py:250] - WARNING - pipeline stages: 4 > 1, the loss on the last card is valid.
2023-12-24 13:33:52,846 - mindformers[mindformers/core/callback/callback.py:316] - INFO - { Epoch:[  1/  1], step:[  312/  312], loss:   nan, per_step_time: 3072ms, lr: 1.0004757e-06, overflow cond: False, loss_scale: 65536.0


前面几张卡
2023-12-24 13:33:22,056 - mindformers[mindformers/core/callback/callback.py:250] - WARNING - pipeline stages: 4 > 1, the loss on the last card is valid.
2023-12-24 13:33:22,057 - mindformers[mindformers/core/callback/callback.py:316] - INFO - { Epoch:[  1/  1], step:[  302/  312], loss: 0.000, per_step_time: 3076ms, lr: 1.0616963e-06, overflow cond: False, loss_scale: 65536.0
2023-12-24 13:33:28,210 - mindformers[mindformers/core/callback/callback.py:250] - WARNING - pipeline stages: 4 > 1, the loss on the last card is valid.
2023-12-24 13:33:28,210 - mindformers[mindformers/core/callback/callback.py:316] - INFO - { Epoch:[  1/  1], step:[  304/  312], loss: 0.000, per_step_time: 3072ms, lr: 1.0413064e-06, overflow cond: False, loss_scale: 65536.0
2023-12-24 13:33:34,361 - mindformers[mindformers/core/callback/callback.py:250] - WARNING - pipeline stages: 4 > 1, the loss on the last card is valid.
2023-12-24 13:33:34,362 - mindformers[mindformers/core/callback/callback.py:316] - INFO - { Epoch:[  1/  1], step:[  306/  312], loss: 0.000, per_step_time: 3072ms, lr: 1.0249815e-06, overflow cond: False, loss_scale: 65536.0
2023-12-24 13:33:40,518 - mindformers[mindformers/core/callback/callback.py:250] - WARNING - pipeline stages: 4 > 1, the loss on the last card is valid.
2023-12-24 13:33:40,519 - mindformers[mindformers/core/callback/callback.py:316] - INFO - { Epoch:[  1/  1], step:[  308/  312], loss: 0.000, per_step_time: 3074ms, lr: 1.0127337e-06, overflow cond: False, loss_scale: 65536.0
2023-12-24 13:33:46,669 - mindformers[mindformers/core/callback/callback.py:250] - WARNING - pipeline stages: 4 > 1, the loss on the last card is valid.
2023-12-24 13:33:46,669 - mindformers[mindformers/core/callback/callback.py:316] - INFO - { Epoch:[  1/  1], step:[  310/  312], loss: 0.000, per_step_time: 3071ms, lr: 1.0045605e-06, overflow cond: False, loss_scale: 65536.0
2023-12-24 13:33:52,821 - mindformers[mindformers/core/callback/callback.py:250] - WARNING - pipeline stages: 4 > 1, the loss on the last card is valid.
2023-12-24 13:33:52,822 - mindformers[mindformers/core/callback/callback.py:316] - INFO - { Epoch:[  1/  1], step:[  312/  312], loss: 0.000, per_step_time: 3072ms, lr: 1.0004757e-06, overflow cond: False, loss_scale: 65536.0


Special notes for this issue/备注 (Optional / 选填)

走给张森镇

评论 (7)

zhangjie18 创建了Bug-Report
zhangjie18 添加了
 
kind/bug
标签
zhangjie18 添加了
 
v2.2.0
标签
zhangjie18 添加了
 
attr/function
标签
zhangjie18 添加了
 
stage/func-debug
标签
zhangjie18 添加了
 
sig/mindformers
标签
展开全部操作日志

Please assign maintainer to check this issue.
请为此issue分配处理人。
@zhangjie18

感谢您的反馈,您可以评论//mindspore-assistant更快获取帮助,更多标签可以查看标签列表

  1. 如果您刚刚接触MindSpore,或许您可以在教程找到答案
  2. 如果您是资深Pytorch用户,您或许需要:
    与PyTorch典型区别 / PyTorch与MindSpore API映射表
  3. 如果您遇到动态图问题,可以设置mindspore.set_context(pynative_synchronize=True)查看报错栈协助定位
  4. 模型精度调优问题可参考官网调优指南
  5. 如果您反馈的是框架BUG,请确认您在ISSUE中提供了MindSpore版本、使用的后端类型(CPU、GPU、Ascend)、环境、训练的代码官方链接以及可以复现报错的代码的启动方式等必要的定位信息
  6. 如果您已经定位出问题根因,欢迎提交PR参与MindSpore开源社区,我们会尽快review
xiangminshan 负责人xiangminshan 修改为森镇
linzhengshu 优先级主要 修改为严重

910B2相同ms环境测试八卡训练正常,loss正常

2023-12-27 22:30:06,541 - mindformers[mindformers/core/callback/callback.py:313] - INFO - { Epoch:[  1/  1], step:[  304/  312], loss: 1.906, per_step_time: 2824ms, lr: 1.0326355e-06, overflow cond: False, loss_scale: 65536.0
2023-12-27 22:30:06,541 - mindformers[mindformers/core/callback/callback.py:323] - INFO -   97.4% |████████████████████████████████████████████████  | 1.42 samples/s/p  0:00:22 }
2023-12-27 22:30:12,254 - mindformers[mindformers/core/callback/callback.py:249] - WARNING - pipeline stages: 4 > 1, the loss on the last card is valid.
2023-12-27 22:30:12,255 - mindformers[mindformers/core/callback/callback.py:313] - INFO - { Epoch:[  1/  1], step:[  306/  312], loss: 2.106, per_step_time: 2853ms, lr: 1.0183458e-06, overflow cond: False, loss_scale: 65536.0
2023-12-27 22:30:12,255 - mindformers[mindformers/core/callback/callback.py:323] - INFO -   98.1% |█████████████████████████████████████████████████ | 1.40 samples/s/p  0:00:17 }
2023-12-27 22:30:17,918 - mindformers[mindformers/core/callback/callback.py:249] - WARNING - pipeline stages: 4 > 1, the loss on the last card is valid.
2023-12-27 22:30:17,919 - mindformers[mindformers/core/callback/callback.py:313] - INFO - { Epoch:[  1/  1], step:[  308/  312], loss: 2.021, per_step_time: 2829ms, lr: 1.0081363e-06, overflow cond: False, loss_scale: 65536.0
2023-12-27 22:30:17,919 - mindformers[mindformers/core/callback/callback.py:323] - INFO -   98.7% |█████████████████████████████████████████████████ | 1.41 samples/s/p  0:00:11 }
2023-12-27 22:30:23,584 - mindformers[mindformers/core/callback/callback.py:249] - WARNING - pipeline stages: 4 > 1, the loss on the last card is valid.
2023-12-27 22:30:23,585 - mindformers[mindformers/core/callback/callback.py:313] - INFO - { Epoch:[  1/  1], step:[  310/  312], loss: 1.970, per_step_time: 2829ms, lr: 1.0020068e-06, overflow cond: False, loss_scale: 65536.0
2023-12-27 22:30:23,585 - mindformers[mindformers/core/callback/callback.py:323] - INFO -   99.4% |█████████████████████████████████████████████████ | 1.41 samples/s/p  0:00:05 }
2023-12-27 22:30:29,236 - mindformers[mindformers/core/callback/callback.py:249] - WARNING - pipeline stages: 4 > 1, the loss on the last card is valid.
2023-12-27 22:30:29,237 - mindformers[mindformers/core/callback/callback.py:313] - INFO - { Epoch:[  1/  1], step:[  312/  312], loss: 1.909, per_step_time: 2822ms, lr: 1e-06, overflow cond: False, loss_scale: 65536.0
2023-12-27 22:30:29,237 - mindformers[mindformers/core/callback/callback.py:323] - INFO -   100.0% |██████████████████████████████████████████████████| 1.42 samples/s/p  0:00:00 }
2023-12-27 22:30:29,242 - mindformers[mindformers/core/callback/callback.py:559] - INFO - ......Saving ckpt......
2023-12-27 22:32:27,106 - mindformers[mindformers/trainer/base_trainer.py:725] - INFO - .........Training Over!.............
森镇 任务状态TODO 修改为VALIDATION
森镇 添加协作者森镇
森镇 负责人森镇 修改为zhangjie18
森镇 添加了
 
rca/others
标签
森镇 添加了
 
rct/oldrelease
标签
森镇 添加了
 
ctl/rdselftest
标签
森镇 添加了
 
ctl/rdselftest
标签
森镇 里程碑B-SIG-MindFormers 修改为B-SolutionTest

run:Milan_C15/20231221
mindspore版本:r2.2_4d0de24c0d3ba58c156dbb298b26ef1a953e9ec8_20231228161526
mindformers版本:dev_5d344b6c9dfa9aa3934b38bf889e79b792d5c259_20231229121530
这个版本还是有相同的问题

zhangjie18 添加协作者zhangjie18
zhangjie18 负责人zhangjie18 修改为森镇
zhangjie18 取消协作者森镇
zhangjie18 任务状态VALIDATION 修改为TODO
Xinrui Chen 负责人森镇 修改为huangxinliang
Xinrui Chen 添加协作者森镇

1\前6张卡loss为0是因为开了pp并行,只有最后一个stage打印loss,是正常的
2\loss为nan是因为开pp时用了infnan环境变量,将infnan环境变量取消设置即可

huangxinliang 负责人huangxinliang 修改为zhangjie18
huangxinliang 任务状态TODO 修改为VALIDATION

run:Milan_C15/20231221
mindspore版本:https://cmc-rnd.tools.huawei.com/cmcversion/index/releaseView?deltaId=9770601392965250&isSelect=Software
mindformers版本:dev_daily最新版本
loss正常

i-robot 添加了
 
cmc-rnd
标签

回归版本:run:Milan_C15/20231227
mindspore:version/202401/20240105/r2.2_20240105123212_99bcf63b3d758e6494be568ea9ab5d95bf82751a
mindformers版本:dev_dad6614dd3e7cdef8a81740797ea2885d586db3c_20240105121526
回归步骤:参考issue步骤
基本问题:已解决
输入图片说明
loss持续溢出,在此issue跟踪,#I8UH6Q:[ST][MS][MF][baichuan2_13b][910b1 8p]开FA之后,loss持续溢出
测试结论:回归通过
回归时间:2024.1.8

i-robot 添加了
 
foruda
标签
zhangjie18 任务状态VALIDATION 修改为DONE

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(5)
11515480 xinlianglalala 1670467955 11016979 xiangmd 1654824581
Python
1
https://gitee.com/mindspore/mindspore.git
git@gitee.com:mindspore/mindspore.git
mindspore
mindspore
mindspore

搜索帮助