登录
注册
开源
企业版
高校版
搜索
帮助中心
使用条款
关于我们
开源
企业版
高校版
私有云
模力方舟
AI 队友
登录
注册
轻量养虾,开箱即用!低 Token + 稳定算力,Gitee & 模力方舟联合出品的 PocketClaw 正式开售!点击了解详情
代码拉取完成,页面将自动刷新
仓库状态说明
开源项目
>
人工智能
>
AI-人工智能
&&
捐赠
捐赠前请先登录
取消
前往登录
扫描微信二维码支付
取消
支付完成
支付提示
将跳转至支付宝完成支付
确定
取消
Watch
不关注
关注所有动态
仅关注版本发行动态
关注但不提醒动态
105
Star
1.4K
Fork
971
GVP
MindSpore
/
mindformers
关闭
代码
Issues
159
Pull Requests
103
Wiki
统计
流水线
服务
质量分析
Jenkins for Gitee
腾讯云托管
腾讯云 Serverless
悬镜安全
阿里云 SAE
Codeblitz
SBOM
我知道了,不再自动展开
7623
【master】权重2.0均衡加载
已合并
AAA碧根果批发赵少:master
MindSpore:master
AAA碧根果批发赵少
创建于 2025-11-11 15:21
克隆/下载
HTTPS
SSH
复制
下载 Email Patch
下载 Diff 文件
1. 权重2.0 --- ### PR来源 - [ ] issue单(请关联issue) - [x] 需求特性 - [ ] 问题单 - [ ] 其他(社区开发者等) ### 修改描述(做了什么,变更了什么) - **修改原因:** 权重2.0均衡加载 - **修改内容:** - 增加`get_rank_params`, `parse_total_shard_metadata`函数 - 主流程均衡加载适配 ### 功能验证 保存时的loss: ```text 2025-11-10 09:20:14,266 - mindformers/work/affinity/output/log[mindformers/checkpoint/checkpoint.py:351] - INFO - ....... Start to save model weight ....... 2025-11-10 09:20:23,785 - mindformers/work/affinity/output/log[mindformers/checkpoint/checkpoint.py:374] - INFO - Model checkpoint successfully saved at '/work/affinity/output/checkpoint/iteration_00000040/deepseekv3-model-0000000-0000004.safetensors'. 2025-11-10 09:20:23,786 - mindformers/work/affinity/output/log[mindformers/checkpoint/checkpoint.py:390] - WARNING - ....... Start to save optimizer weight ....... 2025-11-10 09:20:43,523 - mindformers/work/affinity/output/log[mindformers/checkpoint/checkpoint.py:401] - INFO - Optimizer checkpoint successfully saved at '/work/affinity/output/checkpoint/iteration_00000040/deepseekv3-opt-0000000-0000004.safetensors'. 2025-11-10 09:20:43,524 - mindformers/work/affinity/output/log[mindformers/checkpoint/checkpoint.py:407] - INFO - ...... Start saving common info ...... 2025-11-10 09:20:43,525 - mindformers/work/affinity/output/log[mindformers/checkpoint/checkpoint.py:103] - INFO - Saving common info to '/work/affinity/output/checkpoint/iteration_00000040/common.json'. 2025-11-10 09:20:43,527 - mindformers/work/affinity/output/log[mindformers/checkpoint/checkpoint.py:114] - INFO - 'common.json' successfully saved at: '/work/affinity/output/checkpoint/iteration_00000040/common.json'. 2025-11-10 09:20:43,527 - mindformers/work/affinity/output/log[mindformers/checkpoint/checkpoint.py:413] - INFO - The 'common.json' is saved at '/work/affinity/output/checkpoint/iteration_00000040/common.json'. 2025-11-10 09:20:43,528 - mindformers/work/affinity/output/log[mindformers/checkpoint/checkpoint.py:414] - INFO - Save common info cost time: 0.003s. 2025-11-10 09:20:43,528 - mindformers/work/affinity/output/log[mindformers/checkpoint/checkpoint.py:433] - INFO - ...... Start saving metadata ...... 2025-11-10 09:20:43,642 - mindformers/work/affinity/output/log[mindformers/tools/utils.py:744] - INFO - Wait Rank_0 is saving 'metadata.json' ... 2025-11-10 09:20:43,816 - mindformers/work/affinity/output/log[mindformers/checkpoint/checkpoint.py:445] - INFO - The 'metadata.json' saved successfully at '/work/affinity/output/checkpoint/iteration_00000040/metadata.json'. 2025-11-10 09:20:43,818 - mindformers/work/affinity/output/log[mindformers/tools/utils.py:744] - INFO - Wait All ranks for sync save checkpoint. 2025-11-10 09:20:43,819 - mindformers/work/affinity/output/log[mindformers/checkpoint/checkpoint.py:424] - INFO - Rank_0 execute finalize func. 2025-11-10 09:20:43,819 - mindformers/work/affinity/output/log[mindformers/checkpoint/checkpoint.py:323] - INFO - save checkpoint tracker file to /work/affinity/output/checkpoint/latest_checkpointed_iteration.txt 2025-11-10 09:20:43,820 - mindformers/work/affinity/output/log[mindformers/checkpoint/checkpoint.py:330] - INFO - successfully sync saved checkpoint from step '40' to '/work/affinity/output/checkpoint'. 2025-11-10 09:20:43,820 - mindformers/work/affinity/output/log[mindformers/checkpoint/checkpoint.py:427] - INFO - Save checkpoint cost time: 29.552s. 2025-11-10 09:20:47,568 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:561] - INFO - { Epoch:[ 1/ 1], step:[ 41/40000], loss: 14.950890, lm_loss: 11.429537, load_balancing_loss: 1.451740, mtp_loss: 3.516998, per_step_time: 3739ms, lr: 1e-06, overflow cond: False, loss_scale: 1.0, global_norm: [9.341642] 2025-11-10 09:20:47,569 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:578] - INFO - 0.1% | | 0.06686 samples/s/p 1 day, 17:30:12 } 2025-11-10 09:20:50,944 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:561] - INFO - { Epoch:[ 1/ 1], step:[ 42/40000], loss: 14.988756, lm_loss: 11.460757, load_balancing_loss: 1.460897, mtp_loss: 3.523616, per_step_time: 3357ms, lr: 1e-06, overflow cond: False, loss_scale: 1.0, global_norm: [8.174157] 2025-11-10 09:20:50,945 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:578] - INFO - 0.1% | | 0.07446 samples/s/p 1 day, 13:16:01 } 2025-11-10 09:20:54,524 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:561] - INFO - { Epoch:[ 1/ 1], step:[ 43/40000], loss: 14.917091, lm_loss: 11.395867, load_balancing_loss: 1.481625, mtp_loss: 3.516779, per_step_time: 3561ms, lr: 1e-06, overflow cond: False, loss_scale: 1.0, global_norm: [9.453582] 2025-11-10 09:20:54,525 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:578] - INFO - 0.1% | | 0.07020 samples/s/p 1 day, 15:31:37 } 2025-11-10 09:20:58,328 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:561] - INFO - { Epoch:[ 1/ 1], step:[ 44/40000], loss: 15.058564, lm_loss: 11.534687, load_balancing_loss: 1.463607, mtp_loss: 3.519486, per_step_time: 3786ms, lr: 1e-06, overflow cond: False, loss_scale: 1.0, global_norm: [9.328163] 2025-11-10 09:20:58,329 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:578] - INFO - 0.1% | | 0.06603 samples/s/p 1 day, 18:01:23 } 2025-11-10 09:21:01,940 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:561] - INFO - { Epoch:[ 1/ 1], step:[ 45/40000], loss: 14.992240, lm_loss: 11.464610, load_balancing_loss: 1.453902, mtp_loss: 3.523268, per_step_time: 3594ms, lr: 1e-06, overflow cond: False, loss_scale: 1.0, global_norm: [9.703107] 2025-11-10 09:21:01,942 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:578] - INFO - 0.1% | | 0.06956 samples/s/p 1 day, 15:53:29 } ``` - 断点续训(不更改并行配置) ```text 2025-11-11 15:14:32,772 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:561] - INFO - { Epoch:[ 1/ 1], step:[ 41/40000], loss: 14.950890, lm_loss: 11.429537, load_balancing_loss: 1.451740, mtp_loss: 3.516998, per_step_time: 19189ms, lr: 1e-06, overflow cond: False, loss_scale: 1.0, global_norm: [9.341642] 2025-11-11 15:14:32,773 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:578] - INFO - 0.1% | | 0.01303 samples/s/p 8 days, 20:59:47 } 2025-11-11 15:14:33,687 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:561] - INFO - { Epoch:[ 1/ 1], step:[ 42/40000], loss: 14.988756, lm_loss: 11.460757, load_balancing_loss: 1.460897, mtp_loss: 3.523616, per_step_time: 556ms, lr: 1e-06, overflow cond: False, loss_scale: 1.0, global_norm: [8.174157] 2025-11-11 15:14:33,689 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:578] - INFO - 0.1% | | 0.44906 samples/s/p 6:10:45 } 2025-11-11 15:14:34,260 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:561] - INFO - { Epoch:[ 1/ 1], step:[ 43/40000], loss: 14.917091, lm_loss: 11.395867, load_balancing_loss: 1.481625, mtp_loss: 3.516779, per_step_time: 556ms, lr: 1e-06, overflow cond: False, loss_scale: 1.0, global_norm: [9.453582] 2025-11-11 15:14:34,261 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:578] - INFO - 0.1% | | 0.44952 samples/s/p 6:10:22 } 2025-11-11 15:14:34,833 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:561] - INFO - { Epoch:[ 1/ 1], step:[ 44/40000], loss: 15.058564, lm_loss: 11.534687, load_balancing_loss: 1.463607, mtp_loss: 3.519486, per_step_time: 556ms, lr: 1e-06, overflow cond: False, loss_scale: 1.0, global_norm: [9.328163] 2025-11-11 15:14:34,835 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:578] - INFO - 0.1% | | 0.44918 samples/s/p 6:10:38 } 2025-11-11 15:14:35,406 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:561] - INFO - { Epoch:[ 1/ 1], step:[ 45/40000], loss: 14.992240, lm_loss: 11.464610, load_balancing_loss: 1.453902, mtp_loss: 3.523268, per_step_time: 555ms, lr: 1e-06, overflow cond: False, loss_scale: 1.0, global_norm: [9.703107] ``` - 断点续训(mp4 -> mp2pp2, 基准) ```text 2025-11-12 10:12:46,393 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:561] - INFO - { Epoch:[ 1/ 1], step:[ 21/20000], loss: 14.975267, lm_loss: 11.450785, load_balancing_loss: 1.454789, mtp_loss: 3.520119, per_step_time: 19876ms, lr: 1e-06, overflow cond: False, loss_scale: 1.0, global_norm: [7.2264824] 2025-11-12 10:12:46,395 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:578] - INFO - 0.1% | | 0.02516 samples/s/p 4 days, 14:18:23 } 2025-11-12 10:12:47,875 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:391] - WARNING - pipeline stages: 2 > 1, the loss on the last card is valid. 2025-11-12 10:12:47,878 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:561] - INFO - { Epoch:[ 1/ 1], step:[ 22/20000], loss: 14.998457, lm_loss: 11.473292, load_balancing_loss: 1.473627, mtp_loss: 3.520744, per_step_time: 1134ms, lr: 1e-06, overflow cond: False, loss_scale: 1.0, global_norm: [7.6079373] 2025-11-12 10:12:47,880 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:578] - INFO - 0.1% | | 0.44058 samples/s/p 6:17:52 } 2025-11-12 10:12:48,776 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:391] - WARNING - pipeline stages: 2 > 1, the loss on the last card is valid. 2025-11-12 10:12:48,779 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:561] - INFO - { Epoch:[ 1/ 1], step:[ 23/20000], loss: 14.970339, lm_loss: 11.447798, load_balancing_loss: 1.459102, mtp_loss: 3.518163, per_step_time: 884ms, lr: 1e-06, overflow cond: False, loss_scale: 1.0, global_norm: [7.128496] 2025-11-12 10:12:48,780 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:578] - INFO - 0.1% | | 0.56501 samples/s/p 4:54:38 } 2025-11-12 10:12:49,676 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:391] - WARNING - pipeline stages: 2 > 1, the loss on the last card is valid. 2025-11-12 10:12:49,679 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:561] - INFO - { Epoch:[ 1/ 1], step:[ 24/20000], loss: 14.963200, lm_loss: 11.439596, load_balancing_loss: 1.465183, mtp_loss: 3.519207, per_step_time: 885ms, lr: 1e-06, overflow cond: False, loss_scale: 1.0, global_norm: [6.8164854] 2025-11-12 10:12:49,680 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:578] - INFO - 0.1% | | 0.56484 samples/s/p 4:54:42 } 2025-11-12 10:12:50,577 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:391] - WARNING - pipeline stages: 2 > 1, the loss on the last card is valid. 2025-11-12 10:12:50,579 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:561] - INFO - { Epoch:[ 1/ 1], step:[ 25/20000], loss: 14.956818, lm_loss: 11.432056, load_balancing_loss: 1.461017, mtp_loss: 3.520378, per_step_time: 884ms, lr: 1e-06, overflow cond: False, loss_scale: 1.0, global_norm: [7.8051414] 2025-11-12 10:12:50,581 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:578] - INFO - 0.1% | | 0.56513 samples/s/p 4:54:33 } ``` - 断点续训(mp4 -> pp2) ```text 2025-11-12 10:05:23,452 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:561] - INFO - { Epoch:[ 1/ 1], step:[ 21/20000], loss: 14.975267, lm_loss: 11.450785, load_balancing_loss: 1.454789, mtp_loss: 3.520119, per_step_time: 23134ms, lr: 1e-06, overflow cond: False, loss_scale: 1.0, global_norm: [7.2264824] 2025-11-12 10:05:23,453 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:578] - INFO - 0.1% | | 0.02161 samples/s/p 5 days, 8:23:28 } 2025-11-12 10:05:24,903 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:391] - WARNING - pipeline stages: 2 > 1, the loss on the last card is valid. 2025-11-12 10:05:24,906 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:561] - INFO - { Epoch:[ 1/ 1], step:[ 22/20000], loss: 14.998457, lm_loss: 11.473292, load_balancing_loss: 1.473627, mtp_loss: 3.520744, per_step_time: 1076ms, lr: 1e-06, overflow cond: False, loss_scale: 1.0, global_norm: [7.6079373] 2025-11-12 10:05:24,908 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:578] - INFO - 0.1% | | 0.46441 samples/s/p 5:58:28 } 2025-11-12 10:05:25,803 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:391] - WARNING - pipeline stages: 2 > 1, the loss on the last card is valid. 2025-11-12 10:05:25,806 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:561] - INFO - { Epoch:[ 1/ 1], step:[ 23/20000], loss: 14.970339, lm_loss: 11.447798, load_balancing_loss: 1.459102, mtp_loss: 3.518163, per_step_time: 884ms, lr: 1e-06, overflow cond: False, loss_scale: 1.0, global_norm: [7.128496] 2025-11-12 10:05:25,807 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:578] - INFO - 0.1% | | 0.56545 samples/s/p 4:54:24 } 2025-11-12 10:05:26,703 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:391] - WARNING - pipeline stages: 2 > 1, the loss on the last card is valid. 2025-11-12 10:05:26,706 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:561] - INFO - { Epoch:[ 1/ 1], step:[ 24/20000], loss: 14.963200, lm_loss: 11.439596, load_balancing_loss: 1.465183, mtp_loss: 3.519207, per_step_time: 884ms, lr: 1e-06, overflow cond: False, loss_scale: 1.0, global_norm: [6.8164854] 2025-11-12 10:05:26,707 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:578] - INFO - 0.1% | | 0.56552 samples/s/p 4:54:21 } 2025-11-12 10:05:27,603 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:391] - WARNING - pipeline stages: 2 > 1, the loss on the last card is valid. 2025-11-12 10:05:27,606 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:561] - INFO - { Epoch:[ 1/ 1], step:[ 25/20000], loss: 14.956818, lm_loss: 11.432056, load_balancing_loss: 1.461017, mtp_loss: 3.520378, per_step_time: 884ms, lr: 1e-06, overflow cond: False, loss_scale: 1.0, global_norm: [7.8051414] 2025-11-12 10:05:27,607 - mindformers/work/affinity/output/log[mindformers/core/callback/callback.py:578] - INFO - 0.1% | | 0.56551 samples/s/p 4:54:21 } ``` ### check list - [ ] **是否经过代码检视** - [ ] **是否具备UT测试用例看护**(如不符合,请说明原因:____________________) - [ ] **是否涉及对外接口变更**(若涉及则完成变更说明) - [ ] **是否涉及公共组件或对外接口修改,涉及时需给出修改范围和影响评估**(请详细描述) - [安全编码checklist](https://gitee.com/mindspore/mindformers/wikis/%E5%AE%89%E5%85%A8%E7%BC%96%E7%A0%81%E6%A3%80%E8%A7%86) - [x] 通过 - [ ] 不通过 - 网络红线:通信矩阵 | 全0监听 | 未公开接口 | 未公开公网地址 - 隐私数据:个人姓名、工号等 - 不安全函数: eval/pickle/yaml的使用 | subprocess.run和os.system的使用 - 文件校验:文件权限 | 临时文件保存 | 路径校验 ### 代码检视 - **要求:** - 合入代码超过1000行,需组织会议检视并附上检视结论 - 无功能验证不允许合入 - 自检项未完成,不允许合入 - 无UT看护,原则上不允许合入,需单独说明原因 - 来源未标识,修改描述不清晰,不允许合入 ### 变更说明 - [ ] **资料修改** - [ ] **变更通知(原有接口变动时涉及,版本接口公告):**
此 Pull Request 需要通过一些审核项
类型
指派人员
状态
审查
fft1374
suhaibo
Lin
hsshuai
wangjialin
已完成
(0/0人)
怎样手动合并此 Pull Request
git checkout master
git pull https://gitee.com/yiyison/mindformers.git master
git push origin master
评论
153
提交
1
文件
12
检查
代码问题
0
批量操作
展开设置
折叠设置
审查
Code Owner
审查人员
melody
cmy_melody
fft1374
fft1374
hsshuai
hss-shuai
wangjialin
wjlflyer
zyw_hw
zyw-hw
Xinrui Chen
chenrayray
jinyidou
jinyidou
yanghaoran
nicholas_yhr
Lin
Lin-Bert
suhaibo
suhaibo
zhulinhong
zhulinhong
AtlasAccount
atlasaccount
hpp034
hpp034
i-robot
I-am-a-robot
openLiBing-bot
openLiBing-bot
未设置
最少人数
0
测试
melody
cmy_melody
fft1374
fft1374
hsshuai
hss-shuai
wangjialin
wjlflyer
zyw_hw
zyw-hw
Xinrui Chen
chenrayray
jinyidou
jinyidou
yanghaoran
nicholas_yhr
Lin
Lin-Bert
suhaibo
suhaibo
zhulinhong
zhulinhong
AtlasAccount
atlasaccount
hpp034
hpp034
i-robot
I-am-a-robot
openLiBing-bot
openLiBing-bot
未设置
最少人数
0
优先级
不指定
严重
主要
次要
不重要
标签
approved
lgtm
mindspore-cla/yes
ci-pipeline-passed
SC-SUCC
pr-check-pass
ai-reviewed
关联 Issue
未关联
里程碑
未关联里程碑
ZR-NEIYUAN
1.6.0
MindFormersTest
参与者
(8)
Python
1
https://gitee.com/mindspore/mindformers.git
git@gitee.com:mindspore/mindformers.git
mindspore
mindformers
mindformers
点此查找更多帮助
搜索帮助
Git 命令在线学习
如何在 Gitee 导入 GitHub 仓库
Git 仓库基础操作
企业版和社区版功能对比
SSH 公钥设置
如何处理代码冲突
仓库体积过大,如何减小?
如何找回被删除的仓库数据
Gitee 产品配额说明
GitHub仓库快速导入Gitee及同步更新
什么是 Release(发行版)
将 PHP 项目自动发布到 packagist.org
评论
仓库举报
回到顶部
登录提示
该操作需登录 Gitee 帐号,请先登录后再操作。
立即登录
没有帐号,去注册