name | about | labels |
---|---|---|
Bug Report | Use this template for reporting a bug | kind/bug |
baichuan2_13b网络在910B1环境上训练,单机8卡训练,前面6张卡loss为0,后面2张卡loss为nan
模型仓地址:https://gitee.com/mindspore/mindformers/tree/dev/research/baichuan2
Ascend
/GPU
/CPU
) / 硬件环境:Please delete the backend not involved / 请删除不涉及的后端:
/device ascend/
失败版本:
run:Milan_C15/20231215
mindspore版本:r2.2_f44abe5c4aad3ad647ec3390346c315e20a5e4c8_20231222041529
mindformers版本:dev_b97a2f162bc005b173d2a28dc1fcb5791f9a6833_20231222121530
ok版本:
2.2.10.b100版本还是OK得
PyNative
/Graph
):Please delete the mode not involved / 请删除不涉及的模式:
/mode pynative
/mode graph
模型仓地址:MindFormers_Test/cases/baichuan2/13b/train/
用例:
test_mf_baichuan2_13b_train_belle_8p_0001
1.get code from mindformers
2.cd mindformers/research
3.bash run_singlenode.sh '''python ./baichuan2/run_baichuan2.py --config ./research/baichuan2/run_baichuan2_13b_910b.yaml --load_checkpoint /home/workspace/large_model_ckpt//baichuan2/13b/train/ --auto_trans_ckpt True --use_parallel True --run_mode finetune --train_data /home/workspace/large_model_dataset//baichuan2/belle512.mindrecord''' /home/workspace/config/hccl_8p.json [0,8] 8
4.验证网络是否训练成功
5.验证网络loss是否达标
网络训练成功、loss达标
后面2张卡
2023-12-24 13:33:28,235 - mindformers[mindformers/core/callback/callback.py:316] - INFO - { Epoch:[ 1/ 1], step:[ 304/ 312], loss: nan, per_step_time: 3072ms, lr: 1.0413064e-06, overflow cond: False, loss_scale: 65536.0
2023-12-24 13:33:34,385 - mindformers[mindformers/core/callback/callback.py:250] - WARNING - pipeline stages: 4 > 1, the loss on the last card is valid.
2023-12-24 13:33:34,385 - mindformers[mindformers/core/callback/callback.py:316] - INFO - { Epoch:[ 1/ 1], step:[ 306/ 312], loss: nan, per_step_time: 3070ms, lr: 1.0249815e-06, overflow cond: False, loss_scale: 65536.0
2023-12-24 13:33:40,539 - mindformers[mindformers/core/callback/callback.py:250] - WARNING - pipeline stages: 4 > 1, the loss on the last card is valid.
2023-12-24 13:33:40,540 - mindformers[mindformers/core/callback/callback.py:316] - INFO - { Epoch:[ 1/ 1], step:[ 308/ 312], loss: nan, per_step_time: 3073ms, lr: 1.0127337e-06, overflow cond: False, loss_scale: 65536.0
2023-12-24 13:33:46,693 - mindformers[mindformers/core/callback/callback.py:250] - WARNING - pipeline stages: 4 > 1, the loss on the last card is valid.
2023-12-24 13:33:46,694 - mindformers[mindformers/core/callback/callback.py:316] - INFO - { Epoch:[ 1/ 1], step:[ 310/ 312], loss: nan, per_step_time: 3073ms, lr: 1.0045605e-06, overflow cond: False, loss_scale: 65536.0
2023-12-24 13:33:52,845 - mindformers[mindformers/core/callback/callback.py:250] - WARNING - pipeline stages: 4 > 1, the loss on the last card is valid.
2023-12-24 13:33:52,846 - mindformers[mindformers/core/callback/callback.py:316] - INFO - { Epoch:[ 1/ 1], step:[ 312/ 312], loss: nan, per_step_time: 3072ms, lr: 1.0004757e-06, overflow cond: False, loss_scale: 65536.0
前面几张卡
2023-12-24 13:33:22,056 - mindformers[mindformers/core/callback/callback.py:250] - WARNING - pipeline stages: 4 > 1, the loss on the last card is valid.
2023-12-24 13:33:22,057 - mindformers[mindformers/core/callback/callback.py:316] - INFO - { Epoch:[ 1/ 1], step:[ 302/ 312], loss: 0.000, per_step_time: 3076ms, lr: 1.0616963e-06, overflow cond: False, loss_scale: 65536.0
2023-12-24 13:33:28,210 - mindformers[mindformers/core/callback/callback.py:250] - WARNING - pipeline stages: 4 > 1, the loss on the last card is valid.
2023-12-24 13:33:28,210 - mindformers[mindformers/core/callback/callback.py:316] - INFO - { Epoch:[ 1/ 1], step:[ 304/ 312], loss: 0.000, per_step_time: 3072ms, lr: 1.0413064e-06, overflow cond: False, loss_scale: 65536.0
2023-12-24 13:33:34,361 - mindformers[mindformers/core/callback/callback.py:250] - WARNING - pipeline stages: 4 > 1, the loss on the last card is valid.
2023-12-24 13:33:34,362 - mindformers[mindformers/core/callback/callback.py:316] - INFO - { Epoch:[ 1/ 1], step:[ 306/ 312], loss: 0.000, per_step_time: 3072ms, lr: 1.0249815e-06, overflow cond: False, loss_scale: 65536.0
2023-12-24 13:33:40,518 - mindformers[mindformers/core/callback/callback.py:250] - WARNING - pipeline stages: 4 > 1, the loss on the last card is valid.
2023-12-24 13:33:40,519 - mindformers[mindformers/core/callback/callback.py:316] - INFO - { Epoch:[ 1/ 1], step:[ 308/ 312], loss: 0.000, per_step_time: 3074ms, lr: 1.0127337e-06, overflow cond: False, loss_scale: 65536.0
2023-12-24 13:33:46,669 - mindformers[mindformers/core/callback/callback.py:250] - WARNING - pipeline stages: 4 > 1, the loss on the last card is valid.
2023-12-24 13:33:46,669 - mindformers[mindformers/core/callback/callback.py:316] - INFO - { Epoch:[ 1/ 1], step:[ 310/ 312], loss: 0.000, per_step_time: 3071ms, lr: 1.0045605e-06, overflow cond: False, loss_scale: 65536.0
2023-12-24 13:33:52,821 - mindformers[mindformers/core/callback/callback.py:250] - WARNING - pipeline stages: 4 > 1, the loss on the last card is valid.
2023-12-24 13:33:52,822 - mindformers[mindformers/core/callback/callback.py:316] - INFO - { Epoch:[ 1/ 1], step:[ 312/ 312], loss: 0.000, per_step_time: 3072ms, lr: 1.0004757e-06, overflow cond: False, loss_scale: 65536.0
走给张森镇
Please assign maintainer to check this issue.
请为此issue分配处理人。
@zhangjie18
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
感谢您的反馈,您可以评论//mindspore-assistant更快获取帮助,更多标签可以查看标签列表:
910B2相同ms环境测试八卡训练正常,loss正常
2023-12-27 22:30:06,541 - mindformers[mindformers/core/callback/callback.py:313] - INFO - { Epoch:[ 1/ 1], step:[ 304/ 312], loss: 1.906, per_step_time: 2824ms, lr: 1.0326355e-06, overflow cond: False, loss_scale: 65536.0
2023-12-27 22:30:06,541 - mindformers[mindformers/core/callback/callback.py:323] - INFO - 97.4% |████████████████████████████████████████████████ | 1.42 samples/s/p 0:00:22 }
2023-12-27 22:30:12,254 - mindformers[mindformers/core/callback/callback.py:249] - WARNING - pipeline stages: 4 > 1, the loss on the last card is valid.
2023-12-27 22:30:12,255 - mindformers[mindformers/core/callback/callback.py:313] - INFO - { Epoch:[ 1/ 1], step:[ 306/ 312], loss: 2.106, per_step_time: 2853ms, lr: 1.0183458e-06, overflow cond: False, loss_scale: 65536.0
2023-12-27 22:30:12,255 - mindformers[mindformers/core/callback/callback.py:323] - INFO - 98.1% |█████████████████████████████████████████████████ | 1.40 samples/s/p 0:00:17 }
2023-12-27 22:30:17,918 - mindformers[mindformers/core/callback/callback.py:249] - WARNING - pipeline stages: 4 > 1, the loss on the last card is valid.
2023-12-27 22:30:17,919 - mindformers[mindformers/core/callback/callback.py:313] - INFO - { Epoch:[ 1/ 1], step:[ 308/ 312], loss: 2.021, per_step_time: 2829ms, lr: 1.0081363e-06, overflow cond: False, loss_scale: 65536.0
2023-12-27 22:30:17,919 - mindformers[mindformers/core/callback/callback.py:323] - INFO - 98.7% |█████████████████████████████████████████████████ | 1.41 samples/s/p 0:00:11 }
2023-12-27 22:30:23,584 - mindformers[mindformers/core/callback/callback.py:249] - WARNING - pipeline stages: 4 > 1, the loss on the last card is valid.
2023-12-27 22:30:23,585 - mindformers[mindformers/core/callback/callback.py:313] - INFO - { Epoch:[ 1/ 1], step:[ 310/ 312], loss: 1.970, per_step_time: 2829ms, lr: 1.0020068e-06, overflow cond: False, loss_scale: 65536.0
2023-12-27 22:30:23,585 - mindformers[mindformers/core/callback/callback.py:323] - INFO - 99.4% |█████████████████████████████████████████████████ | 1.41 samples/s/p 0:00:05 }
2023-12-27 22:30:29,236 - mindformers[mindformers/core/callback/callback.py:249] - WARNING - pipeline stages: 4 > 1, the loss on the last card is valid.
2023-12-27 22:30:29,237 - mindformers[mindformers/core/callback/callback.py:313] - INFO - { Epoch:[ 1/ 1], step:[ 312/ 312], loss: 1.909, per_step_time: 2822ms, lr: 1e-06, overflow cond: False, loss_scale: 65536.0
2023-12-27 22:30:29,237 - mindformers[mindformers/core/callback/callback.py:323] - INFO - 100.0% |██████████████████████████████████████████████████| 1.42 samples/s/p 0:00:00 }
2023-12-27 22:30:29,242 - mindformers[mindformers/core/callback/callback.py:559] - INFO - ......Saving ckpt......
2023-12-27 22:32:27,106 - mindformers[mindformers/trainer/base_trainer.py:725] - INFO - .........Training Over!.............
run:Milan_C15/20231221
mindspore版本:r2.2_4d0de24c0d3ba58c156dbb298b26ef1a953e9ec8_20231228161526
mindformers版本:dev_5d344b6c9dfa9aa3934b38bf889e79b792d5c259_20231229121530
这个版本还是有相同的问题
1\前6张卡loss为0是因为开了pp并行,只有最后一个stage打印loss,是正常的
2\loss为nan是因为开pp时用了infnan环境变量,将infnan环境变量取消设置即可
run:Milan_C15/20231221
mindspore版本:https://cmc-rnd.tools.huawei.com/cmcversion/index/releaseView?deltaId=9770601392965250&isSelect=Software
mindformers版本:dev_daily最新版本
loss正常
回归版本:run:Milan_C15/20231227
mindspore:version/202401/20240105/r2.2_20240105123212_99bcf63b3d758e6494be568ea9ab5d95bf82751a
mindformers版本:dev_dad6614dd3e7cdef8a81740797ea2885d586db3c_20240105121526
回归步骤:参考issue步骤
基本问题:已解决
loss持续溢出,在此issue跟踪,#I8UH6Q:[ST][MS][MF][baichuan2_13b][910b1 8p]开FA之后,loss持续溢出
测试结论:回归通过
回归时间:2024.1.8
登录 后才可以发表评论