[ST][MS][master][pangu系列][ascend][多机]网络训练失败

name	about	labels
Bug Report	Use this template for reporting a bug	kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

[master][pangu_alpha][gpu][8p]网络训练失败
https://e.gitee.com/mind_spore/repos/mindspore/models/tree/master/official/nlp/Pangu_alpha

Environment / 环境信息 (Mandatory / 必填)

Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device /ascend/

Software Environment / 软件环境 (Mandatory / 必填):
-- MindSpore version (e.g., 1.7.0.Bxxx) :
-- Python version (e.g., Python 3.7.5) :
-- OS platform and distribution (e.g., Linux Ubuntu 16.04):
-- GCC/Compiler version (if compiled from source):

pass版本：
master_20230521222838_e260d4c8
问题 Mindspore版本：
_r2.1_master_20230602121730_cf123905
Run包：
HISI_C30/20230518

Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode graph

Related testcase / 关联用例 (Mandatory / 必填)

用例仓地址：solution_test/remaining/test_scripts/mindspore/net/pangu_alpha/network/
用例：
ms_pangu_alpha_ascend_train_32p_0001
ms_pangu_moe_alltoall_train_check_loss_910_32p_0001
ms_pangu_moe_pipline_alltoall_train_check_loss_910_32p_0001
ms_pangu_alpha_pipeline_ascend_train_32p_0001

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

get code from models
cd /official/nlp/pangu_alpha
bash scripts/run_distributed_train_gpu.sh RANK_SIZE HOSTFILE DATASET PER_BATCH MOD
网络pangu正常运行，性能达到6500
验证网络训练是否成功

Describe the expected behavior / 预期结果 (Mandatory / 必填)

网络pangu_alpha正常运行，性能达到6500

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)


[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.679 [exchanger_network.cc:280][64810][136514][Wait][AllServerSocketEstab]errNo[0x000000000500000b] server : device[0] rank[0] get socket timeout, total[4] remain[1]
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.735 [exchanger_network.cc:344][64810][136514]call trace: hcclRet -> 4
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.749 [exchanger_network.cc:107][64810][136514][ExchangerNetwork][Init]rank[0] device[0] wait all socket establish 1200 second failed
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.777 [exchanger_network.cc:686][64810][136514]Some NPUs get socket timeout, the details are as follows:
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.791 [exchanger_network.cc:687][64810][136514]   _____________________________________________________
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.802 [exchanger_network.cc:688][64810][136514]   |device[0] userrank[0] exchanger Status: run_step[2]|
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.812 [exchanger_network.cc:689][64810][136514]   |  dest_dev  |  userrank  |    Role    | connStatus |
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.836 [exchanger_network.cc:690][64810][136514]   |------------|------------|------------|------------|
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.855 [exchanger_network.cc:709][64810][136514]   |         0  |         0  |     NA     |     NA     |
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.867 [exchanger_network.cc:709][64810][136514]   |         1  |         1  |   server   |     YES     |
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.877 [exchanger_network.cc:709][64810][136514]   |         2  |         2  |   client   |     YES     |
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.886 [exchanger_network.cc:709][64810][136514]   |         3  |         3  |   server   |     YES     |
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.910 [exchanger_network.cc:709][64810][136514]   |         4  |         4  |   client   |     YES     |
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.919 [exchanger_network.cc:709][64810][136514]   |         5  |         5  |   server   |     NO     |
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.928 [exchanger_network.cc:709][64810][136514]   |         6  |         6  |   client   |     YES     |
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.936 [exchanger_network.cc:709][64810][136514]   |         7  |         7  |   server   |     YES     |
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.945 [exchanger_network.cc:714][64810][136514]   ___________________________________________________________________
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.954 [exchanger_network.cc:715][64810][136514]the connection failure between this device and the target device may be due to the following reasons:
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.963 [exchanger_network.cc:716][64810][136514]1. the connection between this device and the target device is abnormal.
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.972 [exchanger_network.cc:717][64810][136514]2. an exception occurred at the target devices.
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.982 [exchanger_network.cc:719][64810][136514]3. the time difference between the execution of hcom on this device and the target device exceeds the timeout threshold. make sure this by keyworld [Entry-]
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.895.992 [exchanger_network.cc:721][64810][136514]4. the behavior of executing the calculation graph on this device and the target device is inconsistent.
[ERROR] HCCL(64810,python):2023-06-02-13:38:17.896.004 [comm_factory.cc:1470][64810][136514][Get][ExchangerNetwork]exchanger init failed
[ERROR] HCCL(64810,python):2023-06-02-13:38:18.027.621 [comm_factory.cc:223][64810][136514][Create][CommOuter]exchangerNetwork create failed
[ERROR] HCCL(64810,python):2023-06-02-13:38:18.027.652 [hccl_impl.cc:3820][64810][136514][Create][OuterComm]errNo[0x0000000005000006] tag[HcomAllReduce_6629421139219749105_8], created commOuter fail. commOuter[0] is null
[ERROR] HCCL(64810,python):2023-06-02-13:38:18.027.771 [hccl_impl.cc:3675][hccl-64810-2-1685711347-4-7943117906144662061][0][Create][CommByAlg]CreateInnerComm [0] or CreateOuterComm[6] failed. commType[2]
[ERROR] HCCL(64810,python):2023-06-02-13:38:18.027.808 [hccl_impl.cc:3542][hccl-64810-2-1685711347-4-7943117906144662061][0]call trace: hcclRet -> 4
[ERROR] HCCL(64810,python):2023-06-02-13:38:18.029.604 [hccl_impl.cc:1301][hccl-64810-2-1685711347-4-7943117906144662061][0][HcclImpl][AllReduce]errNo[0x0000000005000004]  tag[HcomAllReduce_6629421139219749105_8],all reduce create comm failed
[ERROR] HCCL(64810,python):2023-06-02-13:38:18.029.623 [hccl_comm.cc:290][hccl-64810-2-1685711347-4-7943117906144662061][0]call trace: hcclRet -> 4
[ERROR] HCCL(64810,python):2023-06-02-13:38:18.029.641 [hcom.cc:575][hccl-64810-2-1685711347-4-7943117906144662061][0][AllReduce][Result]errNo[0x0000000005010004] hcclComm all reduce error, tag[HcomAllReduce_6629421139219749105_8], input_ptr[0x1241e6128a00], output_ptr[0x1241e6129000], count[128], data_type[2], op[0]
[ERROR] HCCL(64810,python):2023-06-02-13:38:18.029.659 [hcom_ops_kernel_info_store.cc:766][hccl-64810-2-1685711347-4-7943117906144662061][0]call trace: hcclRet -> 4
[ERROR] HCCL(64810,python):2023-06-02-13:38:18.029.708 [hcom_ops_kernel_info_store.cc:279][hccl-64810-2-1685711347-4-7943117906144662061][0]call trace: hcclRet -> 4
[ERROR] HCCL(64810,python):2023-06-02-13:38:18.029.722 [hcom_ops_kernel_info_store.cc:1791][hccl-64810-2-1685711347-4-7943117906144662061][0][Load][Task]errNo[0x0000000005010004] load task failed. (load op[HcomAllReduce] fail)
[CRITICAL] GE(64810,ffff8df40cf0,python):2023-06-02-13:38:18.030.405 [mindspore/ccsrc/plugin/device/ascend/hal/device/ge_runtime/task/hccl_task.cc:122] Distribute] davinci_model : load task fail, return ret: 1343225860, the task is Default/network-PanguAlphaTrainOneStepWithLossScaleCell/AllReduce-op19888_4645130
EI0006: Getting socket times out. Reason: 1. The remote does not initiate a connect request. some NPUs in the cluster are abnormal.    2. The remote does not initiate a connect request because the collective communication operator is started too late or is not started by some NPU in the cluster.    3. The communication link is disconnected. (For example, the IP addresses are not on the same network segment or the TLS configurations are inconsistent.)
        Solution: 1. Check the rank service processes with other errors or no errors in the cluster.2. If this error is reported for all NPUs, check whether the time difference between the earliest and latest errors is greater than the connect timeout interval (120s by default). If so, adjust the timeout interval by using the HCCL_CONNECT_TIMEOUT environment variable.3. Check the connectivity of the communication link between nodes. (For details, see the TLS command and HCCN connectivity check examples.)

(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)
[CRITICAL] DEVICE(64810,ffff8df40cf0,python):2023-06-02-13:40:53.843.711 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:592] LoadTask] Distribute Task Failed,
error msg: davinci_model : load task fail, return ret: 1343225860, the task is Default/network-PanguAlphaTrainOneStepWithLossScaleCell/AllReduce-op19888_4645130

EI0006: Getting socket times out. Reason: 1. The remote does not initiate a connect request. some NPUs in the cluster are abnormal.    2. The remote does not initiate a connect request because the collective communication operator is started too late or is not started by some NPU in the cluster.    3. The communication link is disconnected. (For example, the IP addresses are not on the same network segment or the TLS configurations are inconsistent.)
        Solution: 1. Check the rank service processes with other errors or no errors in the cluster.2. If this error is reported for all NPUs, check whether the time difference between the earliest and latest errors is greater than the connect timeout interval (120s by default). If so, adjust the timeout interval by using the HCCL_CONNECT_TIMEOUT environment variable.3. Check the connectivity of the communication link between nodes. (For details, see the TLS command and HCCN connectivity check examples.)

(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)

mindspore/ccsrc/plugin/device/ascend/hal/device/ge_runtime/task/hccl_task.cc:122 Distribute

EI0006: Getting socket times out. Reason: 1. The remote does not initiate a connect request. some NPUs in the cluster are abnormal.    2. The remote does not initiate a connect request because the collective communication operator is started too late or is not started by some NPU in the cluster.    3. The communication link is disconnected. (For example, the IP addresses are not on the same network segment or the TLS configurations are inconsistent.)
        Solution: 1. Check the rank service processes with other errors or no errors in the cluster.2. If this error is reported for all NPUs, check whether the time difference between the earliest and latest errors is greater than the connect timeout interval (120s by default). If so, adjust the timeout interval by using the HCCL_CONNECT_TIMEOUT environment variable.3. Check the connectivity of the communication link between nodes. (For details, see the TLS command and HCCN connectivity check examples.)

(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)
[CRITICAL] DEVICE(64810,ffff8df40cf0,python):2023-06-02-13:40:53.844.690 [mindspore/ccsrc/plugin/device/ascend/hal/hardware/ascend_kernel_executor.cc:244] PreprocessBeforeRunGraph] Preprocess failed before run graph 1.
Distribute Task Failed,
error msg: davinci_model : load task fail, return ret: 1343225860, the task is Default/network-PanguAlphaTrainOneStepWithLossScaleCell/AllReduce-op19888_4645130

EI0006: Getting socket times out. Reason: 1. The remote does not initiate a connect request. some NPUs in the cluster are abnormal.    2. The remote does not initiate a connect request because the collective communication operator is started too late or is not started by some NPU in the cluster.    3. The communication link is disconnected. (For example, the IP addresses are not on the same network segment or the TLS configurations are inconsistent.)
        Solution: 1. Check the rank service processes with other errors or no errors in the cluster.2. If this error is reported for all NPUs, check whether the time difference between the earliest and latest errors is greater than the connect timeout interval (120s by default). If so, adjust the timeout interval by using the HCCL_CONNECT_TIMEOUT environment variable.3. Check the connectivity of the communication link between nodes. (For details, see the TLS command and HCCN connectivity check examples.)

(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)
EI0006: Getting socket times out. Reason: 1. The remote does not initiate a connect request. some NPUs in the cluster are abnormal.    2. The remote does not initiate a connect request because the collective communication operator is started too late or is not started by some NPU in the cluster.    3. The communication link is disconnected. (For example, the IP addresses are not on the same network segment or the TLS configurations are inconsistent.)
        Solution: 1. Check the rank service processes with other errors or no errors in the cluster.2. If this error is reported for all NPUs, check whether the time difference between the earliest and latest errors is greater than the connect timeout interval (120s by default). If so, adjust the timeout interval by using the HCCL_CONNECT_TIMEOUT environment variable.3. Check the connectivity of the communication link between nodes. (For details, see the TLS command and HCCN connectivity check examples.)

(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)

mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:592 LoadTask
mindspore/ccsrc/plugin/device/ascend/hal/device/ge_runtime/task/hccl_task.cc:122 Distribute

EI0006: Getting socket times out. Reason: 1. The remote does not initiate a connect request. some NPUs in the cluster are abnormal.    2. The remote does not initiate a connect request because the collective communication operator is started too late or is not started by some NPU in the cluster.    3. The communication link is disconnected. (For example, the IP addresses are not on the same network segment or the TLS configurations are inconsistent.)
        Solution: 1. Check the rank service processes with other errors or no errors in the cluster.2. If this error is reported for all NPUs, check whether the time difference between the earliest and latest errors is greater than the connect timeout interval (120s by default). If so, adjust the timeout interval by using the HCCL_CONNECT_TIMEOUT environment variable.3. Check the connectivity of the communication link between nodes. (For details, see the TLS command and HCCN connectivity check examples.)

(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)
[TRACE] HCCL(64810,python):2023-06-02-13:41:02.710.607 [status:stop] [hcom.cc:452][hccl-64810-2-1685711347-4-7943117906144662061][0]hcom destroy complete,take time [678041]us, rankNum[32], rank[0]
Traceback (most recent call last):
  File "/home/jenkins/workspace/TDT_deployment/solution_test/remaining/test_scripts/mindspore/net/pangu_alpha/network/test_ms_pangu_alpha_ascend_train_32p_0001/train.py", line 557, in <module>
    run_train(opt)
  File "/home/jenkins/workspace/TDT_deployment/solution_test/remaining/test_scripts/mindspore/net/pangu_alpha/network/test_ms_pangu_alpha_ascend_train_32p_0001/train.py", line 254, in run_train
    model.train(actual_epoch_num, ds, callbacks=callback, sink_size=args_opt.sink_size, dataset_sink_mode=True)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 1048, in train
    initial_epoch=initial_epoch)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 102, in wrapper
    func(self, *args, **kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 600, in _train
    cb_params, sink_size, initial_epoch, valid_infos)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 684, in _train_dataset_sink_process
    outputs = train_network(*inputs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 626, in __call__
    out = self.compile_and_run(*args, **kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 945, in compile_and_run
    self.compile(*args, **kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 923, in compile
    jit_config_dict=self._jit_config_dict, *args, **kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/common/api.py", line 1401, in compile
    result = self._graph_executor.compile(obj, args, kwargs, phase, self._use_vm_mode())
RuntimeError: Preprocess failed before run graph 1.

----------------------------------------------------
- Ascend Error Message:
----------------------------------------------------
EI0006: Getting socket times out. Reason: 1. The remote does not initiate a connect request. some NPUs in the cluster are abnormal.    2. The remote does not initiate a connect request because the collective communication operator is started too late or is not started by some NPU in the cluster.    3. The communication link is disconnected. (For example, the IP addresses are not on the same network segment or the TLS configurations are inconsistent.)
        Solution: 1. Check the rank service processes with other errors or no errors in the cluster.2. If this error is reported for all NPUs, check whether the time difference between the earliest and latest errors is greater than the connect timeout interval (120s by default). If so, adjust the timeout interval by using the HCCL_CONNECT_TIMEOUT environment variable.3. Check the connectivity of the communication link between nodes. (For details, see the TLS command and HCCN connectivity check examples.)

(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)

----------------------------------------------------
- Framework Error Message: (For framework developers)
----------------------------------------------------
Distribute Task Failed,
error msg: davinci_model : load task fail, return ret: 1343225860, the task is Default/network-PanguAlphaTrainOneStepWithLossScaleCell/AllReduce-op19888_4645130

----------------------------------------------------
- Ascend Error Message:
----------------------------------------------------
EI0006: Getting socket times out. Reason: 1. The remote does not initiate a connect request. some NPUs in the cluster are abnormal.    2. The remote does not initiate a connect request because the collective communication operator is started too late or is not started by some NPU in the cluster.    3. The communication link is disconnected. (For example, the IP addresses are not on the same network segment or the TLS configurations are inconsistent.)
        Solution: 1. Check the rank service processes with other errors or no errors in the cluster.2. If this error is reported for all NPUs, check whether the time difference between the earliest and latest errors is greater than the connect timeout interval (120s by default). If so, adjust the timeout interval by using the HCCL_CONNECT_TIMEOUT environment variable.3. Check the connectivity of the communication link between nodes. (For details, see the TLS command and HCCN connectivity check examples.)

(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)
EI0006: Getting socket times out. Reason: 1. The remote does not initiate a connect request. some NPUs in the cluster are abnormal.    2. The remote does not initiate a connect request because the collective communication operator is started too late or is not started by some NPU in the cluster.    3. The communication link is disconnected. (For example, the IP addresses are not on the same network segment or the TLS configurations are inconsistent.)
        Solution: 1. Check the rank service processes with other errors or no errors in the cluster.2. If this error is reported for all NPUs, check whether the time difference between the earliest and latest errors is greater than the connect timeout interval (120s by default). If so, adjust the timeout interval by using the HCCL_CONNECT_TIMEOUT environment variable.3. Check the connectivity of the communication link between nodes. (For details, see the TLS command and HCCN connectivity check examples.)

(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)

----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/plugin/device/ascend/hal/hardware/ascend_kernel_executor.cc:244 PreprocessBeforeRunGraph
mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:592 LoadTask
mindspore/ccsrc/plugin/device/ascend/hal/device/ge_runtime/task/hccl_task.cc:122 Distribute

Special notes for this issue/备注 (Optional / 选填)

走给黄信静

Please assign maintainer to check this issue.
请为此issue分配处理人。
@sunjiawei999

Please add labels (comp or sig), also you can visit https://gitee.com/mindspore/community/blob/master/sigs/dx/docs/labels.md to find more.
为了让代码尽快被审核，请您为Pull Request打上 组件(comp)或兴趣组(sig) 标签，打上标签的PR可直接推送给责任人进行审核。
更多的标签可以查看https://gitee.com/mindspore/community/blob/master/sigs/dx/docs/labels.md
以组件相关代码提交为例，如果你提交的是data组件代码，你可以这样评论：
//comp/data
当然你也可以邀请data SIG组来审核代码，可以这样写：
//sig/data
另外你还可以给这个PR标记类型，例如是bugfix或者是特性需求：
//kind/bug or //kind/feature
恭喜你，你已经学会了使用命令来打标签，接下来就在下面的评论里打上标签吧！

从报错看是环境问题，请测试确保环境ok后再验证一下

与（#I7H95P:[ST][MS][NET][pangu moe+alltoall/pangu moe+pipeline+alltoall][910 32p]network train failed）相同用例，已验证跑通无报错
问题：pangu_moe模型在开启alltoall的情况下通信报错
根因：cann包版本较低
解决方法：用6月29号以后的cann包和驱动包就能顺利跑通，且loss正常
自验结果如下
time: Thu Jul 6 20:15:04 2023 local_rank: 0, epoch: 0, step: 2, loss is 10.600865364074707, overflow is True, loss scale is 1073741824.0
Train epoch time: 101803.830 ms, per step time: 50901.915 ms
time: Thu Jul 6 20:15:04 2023 local_rank: 0, epoch: 0, step: 4, loss is 10.601276397705078, overflow is True, loss scale is 268435456.0
Train epoch time: 205.847 ms, per step time: 102.924 ms
time: Thu Jul 6 20:15:05 2023 local_rank: 0, epoch: 0, step: 6, loss is 10.602696418762207, overflow is True, loss scale is 67108864.0
Train epoch time: 174.017 ms, per step time: 87.008 ms
time: Thu Jul 6 20:15:05 2023 local_rank: 0, epoch: 0, step: 8, loss is 10.601007461547852, overflow is True, loss scale is 16777216.0
Train epoch time: 163.112 ms, per step time: 81.556 ms
time: Thu Jul 6 20:15:05 2023 local_rank: 0, epoch: 0, step: 10, loss is 10.59855842590332, overflow is True, loss scale is 4194304.0
Train epoch time: 162.162 ms, per step time: 81.081 ms
time: Thu Jul 6 20:15:05 2023 local_rank: 0, epoch: 0, step: 12, loss is 10.602463722229004, overflow is False, loss scale is 2097152.0
Train epoch time: 162.915 ms, per step time: 81.458 ms
time: Thu Jul 6 20:15:05 2023 local_rank: 0, epoch: 0, step: 14, loss is 10.602855682373047, overflow is False, loss scale is 2097152.0
Train epoch time: 163.778 ms, per step time: 81.889 ms
time: Thu Jul 6 20:15:05 2023 local_rank: 0, epoch: 0, step: 16, loss is 10.601556777954102, overflow is False, loss scale is 2097152.0
Train epoch time: 162.952 ms, per step time: 81.476 ms
time: Thu Jul 6 20:15:06 2023 local_rank: 0, epoch: 0, step: 18, loss is 10.601717948913574, overflow is False, loss scale is 2097152.0
Train epoch time: 163.209 ms, per step time: 81.605 ms
time: Thu Jul 6 20:15:06 2023 local_rank: 0, epoch: 0, step: 20, loss is 10.600467681884766, overflow is False, loss scale is 2097152.0
Train epoch time: 163.583 ms, per step time: 81.792 ms
time: Thu Jul 6 20:15:06 2023 local_rank: 0, epoch: 0, step: 22, loss is 10.602459907531738, overflow is False, loss scale is 2097152.0
Train epoch time: 163.259 ms, per step time: 81.630 ms
time: Thu Jul 6 20:15:06 2023 local_rank: 0, epoch: 0, step: 24, loss is 10.600919723510742, overflow is False, loss scale is 2097152.0
Train epoch time: 162.826 ms, per step time: 81.413 ms
time: Thu Jul 6 20:15:06 2023 local_rank: 0, epoch: 0, step: 26, loss is 10.600872993469238, overflow is False, loss scale is 2097152.0
Train epoch time: 163.225 ms, per step time: 81.613 ms
time: Thu Jul 6 20:15:06 2023 local_rank: 0, epoch: 0, step: 28, loss is 10.601860046386719, overflow is False, loss scale is 2097152.0
Train epoch time: 166.621 ms, per step time: 83.311 ms
time: Thu Jul 6 20:15:06 2023 local_rank: 0, epoch: 0, step: 30, loss is 10.599332809448242, overflow is False, loss scale is 2097152.0

已经验证cann包问题，待测试基于最新CANN包测试

回归版本: r2.2.0_master_20230927_44c3a76a

编译时间: 20230927

回归步骤:参考issue复现步骤

基本功能:问题已解决

测试结论：回归通过

回归人员：孙佳伟

输入图片说明

GVP MindSpore / mindspore

内容风险标识