2.3K Star 8K Fork 4.2K

GVPMindSpore / mindspore

 / 详情

[ST][MS][master][pangu系列][ascend][多机]网络训练失败

DONE
Bug-Report
创建于  
2023-06-06 20:11
name about labels
Bug Report Use this template for reporting a bug kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

[master][pangu_alpha][gpu][8p]网络训练失败
https://e.gitee.com/mind_spore/repos/mindspore/models/tree/master/official/nlp/Pangu_alpha

Environment / 环境信息 (Mandatory / 必填)

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device /ascend/

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) :
    -- Python version (e.g., Python 3.7.5) :
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):
    -- GCC/Compiler version (if compiled from source):

pass版本:
master_20230521222838_e260d4c8
问题 Mindspore版本:
_r2.1_master_20230602121730_cf123905
Run包:
HISI_C30/20230518

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode graph

Related testcase / 关联用例 (Mandatory / 必填)

用例仓地址:solution_test/remaining/test_scripts/mindspore/net/pangu_alpha/network/
用例:
ms_pangu_alpha_ascend_train_32p_0001
ms_pangu_moe_alltoall_train_check_loss_910_32p_0001
ms_pangu_moe_pipline_alltoall_train_check_loss_910_32p_0001
ms_pangu_alpha_pipeline_ascend_train_32p_0001

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

  1. get code from models
  2. cd /official/nlp/pangu_alpha
  3. bash scripts/run_distributed_train_gpu.sh RANK_SIZE HOSTFILE DATASET PER_BATCH MOD
  4. 网络pangu正常运行,性能达到6500
  5. 验证网络训练是否成功

Describe the expected behavior / 预期结果 (Mandatory / 必填)

网络pangu_alpha正常运行,性能达到6500

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)


[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.679 [exchanger_network.cc:280][64810][136514][Wait][AllServerSocketEstab]errNo[0x000000000500000b] server : device[0] rank[0] get socket timeout, total[4] remain[1]
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.735 [exchanger_network.cc:344][64810][136514]call trace: hcclRet -> 4
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.749 [exchanger_network.cc:107][64810][136514][ExchangerNetwork][Init]rank[0] device[0] wait all socket establish 1200 second failed
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.777 [exchanger_network.cc:686][64810][136514]Some NPUs get socket timeout, the details are as follows:
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.791 [exchanger_network.cc:687][64810][136514]   _____________________________________________________
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.802 [exchanger_network.cc:688][64810][136514]   |device[0] userrank[0] exchanger Status: run_step[2]|
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.812 [exchanger_network.cc:689][64810][136514]   |  dest_dev  |  userrank  |    Role    | connStatus |
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.836 [exchanger_network.cc:690][64810][136514]   |------------|------------|------------|------------|
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.855 [exchanger_network.cc:709][64810][136514]   |         0  |         0  |     NA     |     NA     |
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.867 [exchanger_network.cc:709][64810][136514]   |         1  |         1  |   server   |     YES     |
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.877 [exchanger_network.cc:709][64810][136514]   |         2  |         2  |   client   |     YES     |
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.886 [exchanger_network.cc:709][64810][136514]   |         3  |         3  |   server   |     YES     |
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.910 [exchanger_network.cc:709][64810][136514]   |         4  |         4  |   client   |     YES     |
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.919 [exchanger_network.cc:709][64810][136514]   |         5  |         5  |   server   |     NO     |
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.928 [exchanger_network.cc:709][64810][136514]   |         6  |         6  |   client   |     YES     |
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.936 [exchanger_network.cc:709][64810][136514]   |         7  |         7  |   server   |     YES     |
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.945 [exchanger_network.cc:714][64810][136514]   ___________________________________________________________________
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.954 [exchanger_network.cc:715][64810][136514]the connection failure between this device and the target device may be due to the following reasons:
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.963 [exchanger_network.cc:716][64810][136514]1. the connection between this device and the target device is abnormal.
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.972 [exchanger_network.cc:717][64810][136514]2. an exception occurred at the target devices.
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.982 [exchanger_network.cc:719][64810][136514]3. the time difference between the execution of hcom on this device and the target device exceeds the timeout threshold. make sure this by keyworld [Entry-]
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.992 [exchanger_network.cc:721][64810][136514]4. the behavior of executing the calculation graph on this device and the target device is inconsistent.
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.896.004 [comm_factory.cc:1470][64810][136514][Get][ExchangerNetwork]exchanger init failed
[ERROR] HCCL(64810,python):2023-06-02-13:38:18.027.621 [comm_factory.cc:223][64810][136514][Create][CommOuter]exchangerNetwork create failed
[ERROR] HCCL(64810,python):2023-06-02-13:38:18.027.652 [hccl_impl.cc:3820][64810][136514][Create][OuterComm]errNo[0x0000000005000006] tag[HcomAllReduce_6629421139219749105_8], created commOuter fail. commOuter[0] is null
[ERROR] HCCL(64810,python):2023-06-02-13:38:18.027.771 [hccl_impl.cc:3675][hccl-64810-2-1685711347-4-7943117906144662061][0][Create][CommByAlg]CreateInnerComm [0] or CreateOuterComm[6] failed. commType[2]
[ERROR] HCCL(64810,python):2023-06-02-13:38:18.027.808 [hccl_impl.cc:3542][hccl-64810-2-1685711347-4-7943117906144662061][0]call trace: hcclRet -> 4
[ERROR] HCCL(64810,python):2023-06-02-13:38:18.029.604 [hccl_impl.cc:1301][hccl-64810-2-1685711347-4-7943117906144662061][0][HcclImpl][AllReduce]errNo[0x0000000005000004]  tag[HcomAllReduce_6629421139219749105_8],all reduce create comm failed
[ERROR] HCCL(64810,python):2023-06-02-13:38:18.029.623 [hccl_comm.cc:290][hccl-64810-2-1685711347-4-7943117906144662061][0]call trace: hcclRet -> 4
[ERROR] HCCL(64810,python):2023-06-02-13:38:18.029.641 [hcom.cc:575][hccl-64810-2-1685711347-4-7943117906144662061][0][AllReduce][Result]errNo[0x0000000005010004] hcclComm all reduce error, tag[HcomAllReduce_6629421139219749105_8], input_ptr[0x1241e6128a00], output_ptr[0x1241e6129000], count[128], data_type[2], op[0]
[ERROR] HCCL(64810,python):2023-06-02-13:38:18.029.659 [hcom_ops_kernel_info_store.cc:766][hccl-64810-2-1685711347-4-7943117906144662061][0]call trace: hcclRet -> 4
[ERROR] HCCL(64810,python):2023-06-02-13:38:18.029.708 [hcom_ops_kernel_info_store.cc:279][hccl-64810-2-1685711347-4-7943117906144662061][0]call trace: hcclRet -> 4
[ERROR] HCCL(64810,python):2023-06-02-13:38:18.029.722 [hcom_ops_kernel_info_store.cc:1791][hccl-64810-2-1685711347-4-7943117906144662061][0][Load][Task]errNo[0x0000000005010004] load task failed. (load op[HcomAllReduce] fail)
[CRITICAL] GE(64810,ffff8df40cf0,python):2023-06-02-13:38:18.030.405 [mindspore/ccsrc/plugin/device/ascend/hal/device/ge_runtime/task/hccl_task.cc:122] Distribute] davinci_model : load task fail, return ret: 1343225860, the task is Default/network-PanguAlphaTrainOneStepWithLossScaleCell/AllReduce-op19888_4645130
EI0006: Getting socket times out. Reason: 1. The remote does not initiate a connect request. some NPUs in the cluster are abnormal.    2. The remote does not initiate a connect request because the collective communication operator is started too late or is not started by some NPU in the cluster.    3. The communication link is disconnected. (For example, the IP addresses are not on the same network segment or the TLS configurations are inconsistent.)
        Solution: 1. Check the rank service processes with other errors or no errors in the cluster.2. If this error is reported for all NPUs, check whether the time difference between the earliest and latest errors is greater than the connect timeout interval (120s by default). If so, adjust the timeout interval by using the HCCL_CONNECT_TIMEOUT environment variable.3. Check the connectivity of the communication link between nodes. (For details, see the TLS command and HCCN connectivity check examples.)

(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)
[CRITICAL] DEVICE(64810,ffff8df40cf0,python):2023-06-02-13:40:53.843.711 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:592] LoadTask] Distribute Task Failed,
error msg: davinci_model : load task fail, return ret: 1343225860, the task is Default/network-PanguAlphaTrainOneStepWithLossScaleCell/AllReduce-op19888_4645130

EI0006: Getting socket times out. Reason: 1. The remote does not initiate a connect request. some NPUs in the cluster are abnormal.    2. The remote does not initiate a connect request because the collective communication operator is started too late or is not started by some NPU in the cluster.    3. The communication link is disconnected. (For example, the IP addresses are not on the same network segment or the TLS configurations are inconsistent.)
        Solution: 1. Check the rank service processes with other errors or no errors in the cluster.2. If this error is reported for all NPUs, check whether the time difference between the earliest and latest errors is greater than the connect timeout interval (120s by default). If so, adjust the timeout interval by using the HCCL_CONNECT_TIMEOUT environment variable.3. Check the connectivity of the communication link between nodes. (For details, see the TLS command and HCCN connectivity check examples.)

(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)

mindspore/ccsrc/plugin/device/ascend/hal/device/ge_runtime/task/hccl_task.cc:122 Distribute

EI0006: Getting socket times out. Reason: 1. The remote does not initiate a connect request. some NPUs in the cluster are abnormal.    2. The remote does not initiate a connect request because the collective communication operator is started too late or is not started by some NPU in the cluster.    3. The communication link is disconnected. (For example, the IP addresses are not on the same network segment or the TLS configurations are inconsistent.)
        Solution: 1. Check the rank service processes with other errors or no errors in the cluster.2. If this error is reported for all NPUs, check whether the time difference between the earliest and latest errors is greater than the connect timeout interval (120s by default). If so, adjust the timeout interval by using the HCCL_CONNECT_TIMEOUT environment variable.3. Check the connectivity of the communication link between nodes. (For details, see the TLS command and HCCN connectivity check examples.)

(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)
[CRITICAL] DEVICE(64810,ffff8df40cf0,python):2023-06-02-13:40:53.844.690 [mindspore/ccsrc/plugin/device/ascend/hal/hardware/ascend_kernel_executor.cc:244] PreprocessBeforeRunGraph] Preprocess failed before run graph 1.
Distribute Task Failed,
error msg: davinci_model : load task fail, return ret: 1343225860, the task is Default/network-PanguAlphaTrainOneStepWithLossScaleCell/AllReduce-op19888_4645130

EI0006: Getting socket times out. Reason: 1. The remote does not initiate a connect request. some NPUs in the cluster are abnormal.    2. The remote does not initiate a connect request because the collective communication operator is started too late or is not started by some NPU in the cluster.    3. The communication link is disconnected. (For example, the IP addresses are not on the same network segment or the TLS configurations are inconsistent.)
        Solution: 1. Check the rank service processes with other errors or no errors in the cluster.2. If this error is reported for all NPUs, check whether the time difference between the earliest and latest errors is greater than the connect timeout interval (120s by default). If so, adjust the timeout interval by using the HCCL_CONNECT_TIMEOUT environment variable.3. Check the connectivity of the communication link between nodes. (For details, see the TLS command and HCCN connectivity check examples.)

(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)
EI0006: Getting socket times out. Reason: 1. The remote does not initiate a connect request. some NPUs in the cluster are abnormal.    2. The remote does not initiate a connect request because the collective communication operator is started too late or is not started by some NPU in the cluster.    3. The communication link is disconnected. (For example, the IP addresses are not on the same network segment or the TLS configurations are inconsistent.)
        Solution: 1. Check the rank service processes with other errors or no errors in the cluster.2. If this error is reported for all NPUs, check whether the time difference between the earliest and latest errors is greater than the connect timeout interval (120s by default). If so, adjust the timeout interval by using the HCCL_CONNECT_TIMEOUT environment variable.3. Check the connectivity of the communication link between nodes. (For details, see the TLS command and HCCN connectivity check examples.)

(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)

mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:592 LoadTask
mindspore/ccsrc/plugin/device/ascend/hal/device/ge_runtime/task/hccl_task.cc:122 Distribute

EI0006: Getting socket times out. Reason: 1. The remote does not initiate a connect request. some NPUs in the cluster are abnormal.    2. The remote does not initiate a connect request because the collective communication operator is started too late or is not started by some NPU in the cluster.    3. The communication link is disconnected. (For example, the IP addresses are not on the same network segment or the TLS configurations are inconsistent.)
        Solution: 1. Check the rank service processes with other errors or no errors in the cluster.2. If this error is reported for all NPUs, check whether the time difference between the earliest and latest errors is greater than the connect timeout interval (120s by default). If so, adjust the timeout interval by using the HCCL_CONNECT_TIMEOUT environment variable.3. Check the connectivity of the communication link between nodes. (For details, see the TLS command and HCCN connectivity check examples.)

(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)
[TRACE] HCCL(64810,python):2023-06-02-13:41:02.710.607 [status:stop] [hcom.cc:452][hccl-64810-2-1685711347-4-7943117906144662061][0]hcom destroy complete,take time [678041]us, rankNum[32], rank[0]
Traceback (most recent call last):
  File "/home/jenkins/workspace/TDT_deployment/solution_test/remaining/test_scripts/mindspore/net/pangu_alpha/network/test_ms_pangu_alpha_ascend_train_32p_0001/train.py", line 557, in <module>
    run_train(opt)
  File "/home/jenkins/workspace/TDT_deployment/solution_test/remaining/test_scripts/mindspore/net/pangu_alpha/network/test_ms_pangu_alpha_ascend_train_32p_0001/train.py", line 254, in run_train
    model.train(actual_epoch_num, ds, callbacks=callback, sink_size=args_opt.sink_size, dataset_sink_mode=True)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 1048, in train
    initial_epoch=initial_epoch)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 102, in wrapper
    func(self, *args, **kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 600, in _train
    cb_params, sink_size, initial_epoch, valid_infos)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 684, in _train_dataset_sink_process
    outputs = train_network(*inputs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 626, in __call__
    out = self.compile_and_run(*args, **kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 945, in compile_and_run
    self.compile(*args, **kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 923, in compile
    jit_config_dict=self._jit_config_dict, *args, **kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/common/api.py", line 1401, in compile
    result = self._graph_executor.compile(obj, args, kwargs, phase, self._use_vm_mode())
RuntimeError: Preprocess failed before run graph 1.

----------------------------------------------------
- Ascend Error Message:
----------------------------------------------------
EI0006: Getting socket times out. Reason: 1. The remote does not initiate a connect request. some NPUs in the cluster are abnormal.    2. The remote does not initiate a connect request because the collective communication operator is started too late or is not started by some NPU in the cluster.    3. The communication link is disconnected. (For example, the IP addresses are not on the same network segment or the TLS configurations are inconsistent.)
        Solution: 1. Check the rank service processes with other errors or no errors in the cluster.2. If this error is reported for all NPUs, check whether the time difference between the earliest and latest errors is greater than the connect timeout interval (120s by default). If so, adjust the timeout interval by using the HCCL_CONNECT_TIMEOUT environment variable.3. Check the connectivity of the communication link between nodes. (For details, see the TLS command and HCCN connectivity check examples.)

(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)

----------------------------------------------------
- Framework Error Message: (For framework developers)
----------------------------------------------------
Distribute Task Failed,
error msg: davinci_model : load task fail, return ret: 1343225860, the task is Default/network-PanguAlphaTrainOneStepWithLossScaleCell/AllReduce-op19888_4645130

----------------------------------------------------
- Ascend Error Message:
----------------------------------------------------
EI0006: Getting socket times out. Reason: 1. The remote does not initiate a connect request. some NPUs in the cluster are abnormal.    2. The remote does not initiate a connect request because the collective communication operator is started too late or is not started by some NPU in the cluster.    3. The communication link is disconnected. (For example, the IP addresses are not on the same network segment or the TLS configurations are inconsistent.)
        Solution: 1. Check the rank service processes with other errors or no errors in the cluster.2. If this error is reported for all NPUs, check whether the time difference between the earliest and latest errors is greater than the connect timeout interval (120s by default). If so, adjust the timeout interval by using the HCCL_CONNECT_TIMEOUT environment variable.3. Check the connectivity of the communication link between nodes. (For details, see the TLS command and HCCN connectivity check examples.)

(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)
EI0006: Getting socket times out. Reason: 1. The remote does not initiate a connect request. some NPUs in the cluster are abnormal.    2. The remote does not initiate a connect request because the collective communication operator is started too late or is not started by some NPU in the cluster.    3. The communication link is disconnected. (For example, the IP addresses are not on the same network segment or the TLS configurations are inconsistent.)
        Solution: 1. Check the rank service processes with other errors or no errors in the cluster.2. If this error is reported for all NPUs, check whether the time difference between the earliest and latest errors is greater than the connect timeout interval (120s by default). If so, adjust the timeout interval by using the HCCL_CONNECT_TIMEOUT environment variable.3. Check the connectivity of the communication link between nodes. (For details, see the TLS command and HCCN connectivity check examples.)

(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)

----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/plugin/device/ascend/hal/hardware/ascend_kernel_executor.cc:244 PreprocessBeforeRunGraph
mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:592 LoadTask
mindspore/ccsrc/plugin/device/ascend/hal/device/ge_runtime/task/hccl_task.cc:122 Distribute

Special notes for this issue/备注 (Optional / 选填)

走给黄信静

评论 (6)

sunjiawei999 创建了Bug-Report
sunjiawei999 添加了
 
attr/function
标签
sunjiawei999 添加了
 
kind/bug
标签
sunjiawei999 添加了
 
stage/func-debug
标签
sunjiawei999 添加了
 
v2.1.0
标签
sunjiawei999 添加了
 
sig/parallel
标签
展开全部操作日志

Please assign maintainer to check this issue.
请为此issue分配处理人。
@sunjiawei999

Please add labels (comp or sig), also you can visit https://gitee.com/mindspore/community/blob/master/sigs/dx/docs/labels.md to find more.
为了让代码尽快被审核,请您为Pull Request打上 组件(comp)或兴趣组(sig) 标签,打上标签的PR可直接推送给责任人进行审核。
更多的标签可以查看https://gitee.com/mindspore/community/blob/master/sigs/dx/docs/labels.md
以组件相关代码提交为例,如果你提交的是data组件代码,你可以这样评论:
//comp/data
当然你也可以邀请data SIG组来审核代码,可以这样写:
//sig/data
另外你还可以给这个PR标记类型,例如是bugfix或者是特性需求:
//kind/bug or //kind/feature
恭喜你,你已经学会了使用命令来打标签,接下来就在下面的评论里打上标签吧!

sunjiawei999 修改了标题
sunjiawei999 修改了描述
zhongjicheng 负责人zhongjicheng 修改为huangxinjing
wuweikang 负责人huangxinjing 修改为wangshengnan123
wuweikang 添加协作者huangxinjing
wuweikang 负责人wangshengnan123 修改为kisnwang

从报错看是环境问题,请测试确保环境ok后再验证一下

与(#I7H95P:[ST][MS][NET][pangu moe+alltoall/pangu moe+pipeline+alltoall][910 32p]network train failed)相同用例,已验证跑通无报错
问题:pangu_moe模型在开启alltoall的情况下通信报错
根因:cann包版本较低
解决方法:用6月29号以后的cann包和驱动包就能顺利跑通,且loss正常
自验结果如下
time: Thu Jul 6 20:15:04 2023 local_rank: 0, epoch: 0, step: 2, loss is 10.600865364074707, overflow is True, loss scale is 1073741824.0
Train epoch time: 101803.830 ms, per step time: 50901.915 ms
time: Thu Jul 6 20:15:04 2023 local_rank: 0, epoch: 0, step: 4, loss is 10.601276397705078, overflow is True, loss scale is 268435456.0
Train epoch time: 205.847 ms, per step time: 102.924 ms
time: Thu Jul 6 20:15:05 2023 local_rank: 0, epoch: 0, step: 6, loss is 10.602696418762207, overflow is True, loss scale is 67108864.0
Train epoch time: 174.017 ms, per step time: 87.008 ms
time: Thu Jul 6 20:15:05 2023 local_rank: 0, epoch: 0, step: 8, loss is 10.601007461547852, overflow is True, loss scale is 16777216.0
Train epoch time: 163.112 ms, per step time: 81.556 ms
time: Thu Jul 6 20:15:05 2023 local_rank: 0, epoch: 0, step: 10, loss is 10.59855842590332, overflow is True, loss scale is 4194304.0
Train epoch time: 162.162 ms, per step time: 81.081 ms
time: Thu Jul 6 20:15:05 2023 local_rank: 0, epoch: 0, step: 12, loss is 10.602463722229004, overflow is False, loss scale is 2097152.0
Train epoch time: 162.915 ms, per step time: 81.458 ms
time: Thu Jul 6 20:15:05 2023 local_rank: 0, epoch: 0, step: 14, loss is 10.602855682373047, overflow is False, loss scale is 2097152.0
Train epoch time: 163.778 ms, per step time: 81.889 ms
time: Thu Jul 6 20:15:05 2023 local_rank: 0, epoch: 0, step: 16, loss is 10.601556777954102, overflow is False, loss scale is 2097152.0
Train epoch time: 162.952 ms, per step time: 81.476 ms
time: Thu Jul 6 20:15:06 2023 local_rank: 0, epoch: 0, step: 18, loss is 10.601717948913574, overflow is False, loss scale is 2097152.0
Train epoch time: 163.209 ms, per step time: 81.605 ms
time: Thu Jul 6 20:15:06 2023 local_rank: 0, epoch: 0, step: 20, loss is 10.600467681884766, overflow is False, loss scale is 2097152.0
Train epoch time: 163.583 ms, per step time: 81.792 ms
time: Thu Jul 6 20:15:06 2023 local_rank: 0, epoch: 0, step: 22, loss is 10.602459907531738, overflow is False, loss scale is 2097152.0
Train epoch time: 163.259 ms, per step time: 81.630 ms
time: Thu Jul 6 20:15:06 2023 local_rank: 0, epoch: 0, step: 24, loss is 10.600919723510742, overflow is False, loss scale is 2097152.0
Train epoch time: 162.826 ms, per step time: 81.413 ms
time: Thu Jul 6 20:15:06 2023 local_rank: 0, epoch: 0, step: 26, loss is 10.600872993469238, overflow is False, loss scale is 2097152.0
Train epoch time: 163.225 ms, per step time: 81.613 ms
time: Thu Jul 6 20:15:06 2023 local_rank: 0, epoch: 0, step: 28, loss is 10.601860046386719, overflow is False, loss scale is 2097152.0
Train epoch time: 166.621 ms, per step time: 83.311 ms
time: Thu Jul 6 20:15:06 2023 local_rank: 0, epoch: 0, step: 30, loss is 10.599332809448242, overflow is False, loss scale is 2097152.0

kisnwang 添加了
 
rct/cann
标签
kisnwang 任务状态TODO 修改为WIP
fangwenyi 添加了
 
v2.2.0
标签
yao_yf 任务状态WIP 修改为VALIDATION
yao_yf 里程碑B-SIG-Parallel 修改为B-SolutionTest

已经验证cann包问题,待测试基于最新CANN包测试

kisnwang 移除了
 
attr/function
标签
kisnwang 移除了
 
attr/function
标签
kisnwang 添加了
 
ctl/versionbuild
标签
kisnwang 添加了
 
rca/others
标签
kisnwang 添加协作者kisnwang
kisnwang 负责人kisnwang 修改为zhongjicheng
kisnwang 取消协作者zhongjicheng
zhongjicheng 负责人zhongjicheng 修改为sunjiawei999

回归版本: r2.2.0_master_20230927_44c3a76a

编译时间: 20230927

回归步骤:参考issue复现步骤

基本功能:问题已解决

测试结论:回归通过

回归人员:孙佳伟

输入图片说明

i-robot 添加了
 
foruda
标签
sunjiawei999 任务状态VALIDATION 修改为DONE

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(6)
Python
1
https://gitee.com/mindspore/mindspore.git
git@gitee.com:mindspore/mindspore.git
mindspore
mindspore
mindspore

搜索帮助