name | about | labels |
---|---|---|
Bug Report | Use this template for reporting a bug | kind/bug |
[master][pangu_alpha][gpu][8p]网络训练失败
https://e.gitee.com/mind_spore/repos/mindspore/models/tree/master/official/nlp/Pangu_alpha
Ascend
/GPU
/CPU
) / 硬件环境:Please delete the backend not involved / 请删除不涉及的后端:
/device /ascend/
pass版本:
master_20230521222838_e260d4c8
问题 Mindspore版本:
_r2.1_master_20230602121730_cf123905
Run包:
HISI_C30/20230518
PyNative
/Graph
):Please delete the mode not involved / 请删除不涉及的模式:
/mode graph
用例仓地址:solution_test/remaining/test_scripts/mindspore/net/pangu_alpha/network/
用例:
ms_pangu_alpha_ascend_train_32p_0001
ms_pangu_moe_alltoall_train_check_loss_910_32p_0001
ms_pangu_moe_pipline_alltoall_train_check_loss_910_32p_0001
ms_pangu_alpha_pipeline_ascend_train_32p_0001
网络pangu_alpha正常运行,性能达到6500
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.679 [exchanger_network.cc:280][64810][136514][Wait][AllServerSocketEstab]errNo[0x000000000500000b] server : device[0] rank[0] get socket timeout, total[4] remain[1]
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.735 [exchanger_network.cc:344][64810][136514]call trace: hcclRet -> 4
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.749 [exchanger_network.cc:107][64810][136514][ExchangerNetwork][Init]rank[0] device[0] wait all socket establish 1200 second failed
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.777 [exchanger_network.cc:686][64810][136514]Some NPUs get socket timeout, the details are as follows:
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.791 [exchanger_network.cc:687][64810][136514] _____________________________________________________
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.802 [exchanger_network.cc:688][64810][136514] |device[0] userrank[0] exchanger Status: run_step[2]|
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.812 [exchanger_network.cc:689][64810][136514] | dest_dev | userrank | Role | connStatus |
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.836 [exchanger_network.cc:690][64810][136514] |------------|------------|------------|------------|
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.855 [exchanger_network.cc:709][64810][136514] | 0 | 0 | NA | NA |
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.867 [exchanger_network.cc:709][64810][136514] | 1 | 1 | server | YES |
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.877 [exchanger_network.cc:709][64810][136514] | 2 | 2 | client | YES |
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.886 [exchanger_network.cc:709][64810][136514] | 3 | 3 | server | YES |
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.910 [exchanger_network.cc:709][64810][136514] | 4 | 4 | client | YES |
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.919 [exchanger_network.cc:709][64810][136514] | 5 | 5 | server | NO |
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.928 [exchanger_network.cc:709][64810][136514] | 6 | 6 | client | YES |
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.936 [exchanger_network.cc:709][64810][136514] | 7 | 7 | server | YES |
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.945 [exchanger_network.cc:714][64810][136514] ___________________________________________________________________
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.954 [exchanger_network.cc:715][64810][136514]the connection failure between this device and the target device may be due to the following reasons:
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.963 [exchanger_network.cc:716][64810][136514]1. the connection between this device and the target device is abnormal.
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.972 [exchanger_network.cc:717][64810][136514]2. an exception occurred at the target devices.
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.982 [exchanger_network.cc:719][64810][136514]3. the time difference between the execution of hcom on this device and the target device exceeds the timeout threshold. make sure this by keyworld [Entry-]
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.992 [exchanger_network.cc:721][64810][136514]4. the behavior of executing the calculation graph on this device and the target device is inconsistent.
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.896.004 [comm_factory.cc:1470][64810][136514][Get][ExchangerNetwork]exchanger init failed
[ERROR] HCCL(64810,python):2023-06-02-13:38:18.027.621 [comm_factory.cc:223][64810][136514][Create][CommOuter]exchangerNetwork create failed
[ERROR] HCCL(64810,python):2023-06-02-13:38:18.027.652 [hccl_impl.cc:3820][64810][136514][Create][OuterComm]errNo[0x0000000005000006] tag[HcomAllReduce_6629421139219749105_8], created commOuter fail. commOuter[0] is null
[ERROR] HCCL(64810,python):2023-06-02-13:38:18.027.771 [hccl_impl.cc:3675][hccl-64810-2-1685711347-4-7943117906144662061][0][Create][CommByAlg]CreateInnerComm [0] or CreateOuterComm[6] failed. commType[2]
[ERROR] HCCL(64810,python):2023-06-02-13:38:18.027.808 [hccl_impl.cc:3542][hccl-64810-2-1685711347-4-7943117906144662061][0]call trace: hcclRet -> 4
[ERROR] HCCL(64810,python):2023-06-02-13:38:18.029.604 [hccl_impl.cc:1301][hccl-64810-2-1685711347-4-7943117906144662061][0][HcclImpl][AllReduce]errNo[0x0000000005000004] tag[HcomAllReduce_6629421139219749105_8],all reduce create comm failed
[ERROR] HCCL(64810,python):2023-06-02-13:38:18.029.623 [hccl_comm.cc:290][hccl-64810-2-1685711347-4-7943117906144662061][0]call trace: hcclRet -> 4
[ERROR] HCCL(64810,python):2023-06-02-13:38:18.029.641 [hcom.cc:575][hccl-64810-2-1685711347-4-7943117906144662061][0][AllReduce][Result]errNo[0x0000000005010004] hcclComm all reduce error, tag[HcomAllReduce_6629421139219749105_8], input_ptr[0x1241e6128a00], output_ptr[0x1241e6129000], count[128], data_type[2], op[0]
[ERROR] HCCL(64810,python):2023-06-02-13:38:18.029.659 [hcom_ops_kernel_info_store.cc:766][hccl-64810-2-1685711347-4-7943117906144662061][0]call trace: hcclRet -> 4
[ERROR] HCCL(64810,python):2023-06-02-13:38:18.029.708 [hcom_ops_kernel_info_store.cc:279][hccl-64810-2-1685711347-4-7943117906144662061][0]call trace: hcclRet -> 4
[ERROR] HCCL(64810,python):2023-06-02-13:38:18.029.722 [hcom_ops_kernel_info_store.cc:1791][hccl-64810-2-1685711347-4-7943117906144662061][0][Load][Task]errNo[0x0000000005010004] load task failed. (load op[HcomAllReduce] fail)
[CRITICAL] GE(64810,ffff8df40cf0,python):2023-06-02-13:38:18.030.405 [mindspore/ccsrc/plugin/device/ascend/hal/device/ge_runtime/task/hccl_task.cc:122] Distribute] davinci_model : load task fail, return ret: 1343225860, the task is Default/network-PanguAlphaTrainOneStepWithLossScaleCell/AllReduce-op19888_4645130
EI0006: Getting socket times out. Reason: 1. The remote does not initiate a connect request. some NPUs in the cluster are abnormal. 2. The remote does not initiate a connect request because the collective communication operator is started too late or is not started by some NPU in the cluster. 3. The communication link is disconnected. (For example, the IP addresses are not on the same network segment or the TLS configurations are inconsistent.)
Solution: 1. Check the rank service processes with other errors or no errors in the cluster.2. If this error is reported for all NPUs, check whether the time difference between the earliest and latest errors is greater than the connect timeout interval (120s by default). If so, adjust the timeout interval by using the HCCL_CONNECT_TIMEOUT environment variable.3. Check the connectivity of the communication link between nodes. (For details, see the TLS command and HCCN connectivity check examples.)
(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)
[CRITICAL] DEVICE(64810,ffff8df40cf0,python):2023-06-02-13:40:53.843.711 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:592] LoadTask] Distribute Task Failed,
error msg: davinci_model : load task fail, return ret: 1343225860, the task is Default/network-PanguAlphaTrainOneStepWithLossScaleCell/AllReduce-op19888_4645130
EI0006: Getting socket times out. Reason: 1. The remote does not initiate a connect request. some NPUs in the cluster are abnormal. 2. The remote does not initiate a connect request because the collective communication operator is started too late or is not started by some NPU in the cluster. 3. The communication link is disconnected. (For example, the IP addresses are not on the same network segment or the TLS configurations are inconsistent.)
Solution: 1. Check the rank service processes with other errors or no errors in the cluster.2. If this error is reported for all NPUs, check whether the time difference between the earliest and latest errors is greater than the connect timeout interval (120s by default). If so, adjust the timeout interval by using the HCCL_CONNECT_TIMEOUT environment variable.3. Check the connectivity of the communication link between nodes. (For details, see the TLS command and HCCN connectivity check examples.)
(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)
mindspore/ccsrc/plugin/device/ascend/hal/device/ge_runtime/task/hccl_task.cc:122 Distribute
EI0006: Getting socket times out. Reason: 1. The remote does not initiate a connect request. some NPUs in the cluster are abnormal. 2. The remote does not initiate a connect request because the collective communication operator is started too late or is not started by some NPU in the cluster. 3. The communication link is disconnected. (For example, the IP addresses are not on the same network segment or the TLS configurations are inconsistent.)
Solution: 1. Check the rank service processes with other errors or no errors in the cluster.2. If this error is reported for all NPUs, check whether the time difference between the earliest and latest errors is greater than the connect timeout interval (120s by default). If so, adjust the timeout interval by using the HCCL_CONNECT_TIMEOUT environment variable.3. Check the connectivity of the communication link between nodes. (For details, see the TLS command and HCCN connectivity check examples.)
(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)
[CRITICAL] DEVICE(64810,ffff8df40cf0,python):2023-06-02-13:40:53.844.690 [mindspore/ccsrc/plugin/device/ascend/hal/hardware/ascend_kernel_executor.cc:244] PreprocessBeforeRunGraph] Preprocess failed before run graph 1.
Distribute Task Failed,
error msg: davinci_model : load task fail, return ret: 1343225860, the task is Default/network-PanguAlphaTrainOneStepWithLossScaleCell/AllReduce-op19888_4645130
EI0006: Getting socket times out. Reason: 1. The remote does not initiate a connect request. some NPUs in the cluster are abnormal. 2. The remote does not initiate a connect request because the collective communication operator is started too late or is not started by some NPU in the cluster. 3. The communication link is disconnected. (For example, the IP addresses are not on the same network segment or the TLS configurations are inconsistent.)
Solution: 1. Check the rank service processes with other errors or no errors in the cluster.2. If this error is reported for all NPUs, check whether the time difference between the earliest and latest errors is greater than the connect timeout interval (120s by default). If so, adjust the timeout interval by using the HCCL_CONNECT_TIMEOUT environment variable.3. Check the connectivity of the communication link between nodes. (For details, see the TLS command and HCCN connectivity check examples.)
(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)
EI0006: Getting socket times out. Reason: 1. The remote does not initiate a connect request. some NPUs in the cluster are abnormal. 2. The remote does not initiate a connect request because the collective communication operator is started too late or is not started by some NPU in the cluster. 3. The communication link is disconnected. (For example, the IP addresses are not on the same network segment or the TLS configurations are inconsistent.)
Solution: 1. Check the rank service processes with other errors or no errors in the cluster.2. If this error is reported for all NPUs, check whether the time difference between the earliest and latest errors is greater than the connect timeout interval (120s by default). If so, adjust the timeout interval by using the HCCL_CONNECT_TIMEOUT environment variable.3. Check the connectivity of the communication link between nodes. (For details, see the TLS command and HCCN connectivity check examples.)
(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)
mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:592 LoadTask
mindspore/ccsrc/plugin/device/ascend/hal/device/ge_runtime/task/hccl_task.cc:122 Distribute
EI0006: Getting socket times out. Reason: 1. The remote does not initiate a connect request. some NPUs in the cluster are abnormal. 2. The remote does not initiate a connect request because the collective communication operator is started too late or is not started by some NPU in the cluster. 3. The communication link is disconnected. (For example, the IP addresses are not on the same network segment or the TLS configurations are inconsistent.)
Solution: 1. Check the rank service processes with other errors or no errors in the cluster.2. If this error is reported for all NPUs, check whether the time difference between the earliest and latest errors is greater than the connect timeout interval (120s by default). If so, adjust the timeout interval by using the HCCL_CONNECT_TIMEOUT environment variable.3. Check the connectivity of the communication link between nodes. (For details, see the TLS command and HCCN connectivity check examples.)
(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)
[TRACE] HCCL(64810,python):2023-06-02-13:41:02.710.607 [status:stop] [hcom.cc:452][hccl-64810-2-1685711347-4-7943117906144662061][0]hcom destroy complete,take time [678041]us, rankNum[32], rank[0]
Traceback (most recent call last):
File "/home/jenkins/workspace/TDT_deployment/solution_test/remaining/test_scripts/mindspore/net/pangu_alpha/network/test_ms_pangu_alpha_ascend_train_32p_0001/train.py", line 557, in <module>
run_train(opt)
File "/home/jenkins/workspace/TDT_deployment/solution_test/remaining/test_scripts/mindspore/net/pangu_alpha/network/test_ms_pangu_alpha_ascend_train_32p_0001/train.py", line 254, in run_train
model.train(actual_epoch_num, ds, callbacks=callback, sink_size=args_opt.sink_size, dataset_sink_mode=True)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 1048, in train
initial_epoch=initial_epoch)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 102, in wrapper
func(self, *args, **kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 600, in _train
cb_params, sink_size, initial_epoch, valid_infos)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 684, in _train_dataset_sink_process
outputs = train_network(*inputs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 626, in __call__
out = self.compile_and_run(*args, **kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 945, in compile_and_run
self.compile(*args, **kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 923, in compile
jit_config_dict=self._jit_config_dict, *args, **kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/common/api.py", line 1401, in compile
result = self._graph_executor.compile(obj, args, kwargs, phase, self._use_vm_mode())
RuntimeError: Preprocess failed before run graph 1.
----------------------------------------------------
- Ascend Error Message:
----------------------------------------------------
EI0006: Getting socket times out. Reason: 1. The remote does not initiate a connect request. some NPUs in the cluster are abnormal. 2. The remote does not initiate a connect request because the collective communication operator is started too late or is not started by some NPU in the cluster. 3. The communication link is disconnected. (For example, the IP addresses are not on the same network segment or the TLS configurations are inconsistent.)
Solution: 1. Check the rank service processes with other errors or no errors in the cluster.2. If this error is reported for all NPUs, check whether the time difference between the earliest and latest errors is greater than the connect timeout interval (120s by default). If so, adjust the timeout interval by using the HCCL_CONNECT_TIMEOUT environment variable.3. Check the connectivity of the communication link between nodes. (For details, see the TLS command and HCCN connectivity check examples.)
(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)
----------------------------------------------------
- Framework Error Message: (For framework developers)
----------------------------------------------------
Distribute Task Failed,
error msg: davinci_model : load task fail, return ret: 1343225860, the task is Default/network-PanguAlphaTrainOneStepWithLossScaleCell/AllReduce-op19888_4645130
----------------------------------------------------
- Ascend Error Message:
----------------------------------------------------
EI0006: Getting socket times out. Reason: 1. The remote does not initiate a connect request. some NPUs in the cluster are abnormal. 2. The remote does not initiate a connect request because the collective communication operator is started too late or is not started by some NPU in the cluster. 3. The communication link is disconnected. (For example, the IP addresses are not on the same network segment or the TLS configurations are inconsistent.)
Solution: 1. Check the rank service processes with other errors or no errors in the cluster.2. If this error is reported for all NPUs, check whether the time difference between the earliest and latest errors is greater than the connect timeout interval (120s by default). If so, adjust the timeout interval by using the HCCL_CONNECT_TIMEOUT environment variable.3. Check the connectivity of the communication link between nodes. (For details, see the TLS command and HCCN connectivity check examples.)
(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)
EI0006: Getting socket times out. Reason: 1. The remote does not initiate a connect request. some NPUs in the cluster are abnormal. 2. The remote does not initiate a connect request because the collective communication operator is started too late or is not started by some NPU in the cluster. 3. The communication link is disconnected. (For example, the IP addresses are not on the same network segment or the TLS configurations are inconsistent.)
Solution: 1. Check the rank service processes with other errors or no errors in the cluster.2. If this error is reported for all NPUs, check whether the time difference between the earliest and latest errors is greater than the connect timeout interval (120s by default). If so, adjust the timeout interval by using the HCCL_CONNECT_TIMEOUT environment variable.3. Check the connectivity of the communication link between nodes. (For details, see the TLS command and HCCN connectivity check examples.)
(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)
----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/plugin/device/ascend/hal/hardware/ascend_kernel_executor.cc:244 PreprocessBeforeRunGraph
mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:592 LoadTask
mindspore/ccsrc/plugin/device/ascend/hal/device/ge_runtime/task/hccl_task.cc:122 Distribute
走给黄信静
Please assign maintainer to check this issue.
请为此issue分配处理人。
@sunjiawei999
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
Please add labels (comp or sig), also you can visit https://gitee.com/mindspore/community/blob/master/sigs/dx/docs/labels.md to find more.
为了让代码尽快被审核,请您为Pull Request打上 组件(comp)或兴趣组(sig) 标签,打上标签的PR可直接推送给责任人进行审核。
更多的标签可以查看https://gitee.com/mindspore/community/blob/master/sigs/dx/docs/labels.md
以组件相关代码提交为例,如果你提交的是data组件代码,你可以这样评论:
//comp/data
当然你也可以邀请data SIG组来审核代码,可以这样写:
//sig/data
另外你还可以给这个PR标记类型,例如是bugfix或者是特性需求:
//kind/bug or //kind/feature
恭喜你,你已经学会了使用命令来打标签,接下来就在下面的评论里打上标签吧!
从报错看是环境问题,请测试确保环境ok后再验证一下
与(#I7H95P:[ST][MS][NET][pangu moe+alltoall/pangu moe+pipeline+alltoall][910 32p]network train failed)相同用例,已验证跑通无报错
问题:pangu_moe模型在开启alltoall的情况下通信报错
根因:cann包版本较低
解决方法:用6月29号以后的cann包和驱动包就能顺利跑通,且loss正常
自验结果如下
time: Thu Jul 6 20:15:04 2023 local_rank: 0, epoch: 0, step: 2, loss is 10.600865364074707, overflow is True, loss scale is 1073741824.0
Train epoch time: 101803.830 ms, per step time: 50901.915 ms
time: Thu Jul 6 20:15:04 2023 local_rank: 0, epoch: 0, step: 4, loss is 10.601276397705078, overflow is True, loss scale is 268435456.0
Train epoch time: 205.847 ms, per step time: 102.924 ms
time: Thu Jul 6 20:15:05 2023 local_rank: 0, epoch: 0, step: 6, loss is 10.602696418762207, overflow is True, loss scale is 67108864.0
Train epoch time: 174.017 ms, per step time: 87.008 ms
time: Thu Jul 6 20:15:05 2023 local_rank: 0, epoch: 0, step: 8, loss is 10.601007461547852, overflow is True, loss scale is 16777216.0
Train epoch time: 163.112 ms, per step time: 81.556 ms
time: Thu Jul 6 20:15:05 2023 local_rank: 0, epoch: 0, step: 10, loss is 10.59855842590332, overflow is True, loss scale is 4194304.0
Train epoch time: 162.162 ms, per step time: 81.081 ms
time: Thu Jul 6 20:15:05 2023 local_rank: 0, epoch: 0, step: 12, loss is 10.602463722229004, overflow is False, loss scale is 2097152.0
Train epoch time: 162.915 ms, per step time: 81.458 ms
time: Thu Jul 6 20:15:05 2023 local_rank: 0, epoch: 0, step: 14, loss is 10.602855682373047, overflow is False, loss scale is 2097152.0
Train epoch time: 163.778 ms, per step time: 81.889 ms
time: Thu Jul 6 20:15:05 2023 local_rank: 0, epoch: 0, step: 16, loss is 10.601556777954102, overflow is False, loss scale is 2097152.0
Train epoch time: 162.952 ms, per step time: 81.476 ms
time: Thu Jul 6 20:15:06 2023 local_rank: 0, epoch: 0, step: 18, loss is 10.601717948913574, overflow is False, loss scale is 2097152.0
Train epoch time: 163.209 ms, per step time: 81.605 ms
time: Thu Jul 6 20:15:06 2023 local_rank: 0, epoch: 0, step: 20, loss is 10.600467681884766, overflow is False, loss scale is 2097152.0
Train epoch time: 163.583 ms, per step time: 81.792 ms
time: Thu Jul 6 20:15:06 2023 local_rank: 0, epoch: 0, step: 22, loss is 10.602459907531738, overflow is False, loss scale is 2097152.0
Train epoch time: 163.259 ms, per step time: 81.630 ms
time: Thu Jul 6 20:15:06 2023 local_rank: 0, epoch: 0, step: 24, loss is 10.600919723510742, overflow is False, loss scale is 2097152.0
Train epoch time: 162.826 ms, per step time: 81.413 ms
time: Thu Jul 6 20:15:06 2023 local_rank: 0, epoch: 0, step: 26, loss is 10.600872993469238, overflow is False, loss scale is 2097152.0
Train epoch time: 163.225 ms, per step time: 81.613 ms
time: Thu Jul 6 20:15:06 2023 local_rank: 0, epoch: 0, step: 28, loss is 10.601860046386719, overflow is False, loss scale is 2097152.0
Train epoch time: 166.621 ms, per step time: 83.311 ms
time: Thu Jul 6 20:15:06 2023 local_rank: 0, epoch: 0, step: 30, loss is 10.599332809448242, overflow is False, loss scale is 2097152.0
已经验证cann包问题,待测试基于最新CANN包测试
回归版本: r2.2.0_master_20230927_44c3a76a
编译时间: 20230927
回归步骤:参考issue复现步骤
基本功能:问题已解决
测试结论:回归通过
回归人员:孙佳伟
登录 后才可以发表评论