Appearance & Root Cause

name	about	labels
Bug Report	Use this template for reporting a bug	kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

crnn_vgg7在data_lmdb_release八卡训练精度不达标，loss为nan

Environment / 环境信息 (Mandatory / 必填)

Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device ascend910B

Software Environment / 软件环境 (Mandatory / 必填):
-- MindSpore version (e.g., 1.7.0.Bxxx) :2.3.0
-- Python version (e.g., Python 3.7.5) :3.7
-- OS platform and distribution (e.g., Linux Ubuntu 16.04):
-- GCC/Compiler version (if compiled from source):
run包:Milan_C17/20240321
mindspore:Daily/Version/202403/0329/mindspore/r2.3.q1_20240329061516_c99698ba26958cd790921dd56daa7145f334b4ef/
Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode graph

Related testcase / 关联用例 (Mandatory / 必填)

test_ms_crnn_vgg7_data_lmdb_release_train_infer_910_gpu_8p_0001.py

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

cd solution_test/cases/02network/00cv/crnn/train/
pytest -s test_ms_crnn_vgg7_data_lmdb_release_train_infer_910_gpu_8p_0001.py

Describe the expected behavior / 预期结果 (Mandatory / 必填)

训练成功，精度达标达标

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

输入图片说明

Special notes for this issue/备注 (Optional / 选填)

走给詹霑

O3 B090有问题

O0 B100有问题

Appearance & Root Cause

nn.LSTM算子在输入FP32时有bug，导致loss收敛异常。

Fix Solution

将CRNN中nn.LSTM层及其输入转换为FP16进行计算。

Fix Description & Test Suggestion

--- a/mindocr/models/necks/rnn.py
+++ b/mindocr/models/necks/rnn.py
@@ -1,5 +1,5 @@
 from typing import List, Optional
-
+import mindspore as ms
 import numpy as np

 from mindspore import Tensor, nn, ops
@@ -37,6 +37,7 @@ class RNNEncoder(nn.Cell):
                                    has_bias=True,
                                    dropout=0.,
                                    bidirectional=True)
+        self.seq_encoder.to_float(ms.float16)

         self.hx = None
         if batch_size is not None:
@@ -46,6 +47,7 @@ class RNNEncoder(nn.Cell):

     def construct(self, features: List[Tensor]) -> Tensor:
         x = features[0]
+        x = x.astype(ms.float16)
         x = ops.squeeze(x, axis=2)  # [N, C, W]
         x = ops.transpose(x, (2, 0, 1))  # [W, N, C]

@@ -53,5 +55,5 @@ class RNNEncoder(nn.Cell):
             x, _ = self.seq_encoder(x)
         else:
             x, _ = self.seq_encoder(x, self.hx)
-
+        x = x.astype(ms.float32)
         return x

进行全量精度测试，以及多数据集验证。

Self-test Report & DT Review

输入图片说明
网络问题，已由解决方案测试看护。

Suggested regression version

在B100版本上进行全量精度测试，以及多数据集验证。

回归版本：
runpkg_version：Milan_C17/20240402
mindspore：2.3.0rc1 commit_id = '[sha1]:f6ee75fd,[branch]:(HEAD,origin/r2.3.q1,r2.3.q1)'
回归步骤：修改mindocr代码，启动测试用例test_ms_crnn_vgg7_data_lmdb_release_train_infer_910_gpu_8p_0001.py，O3下训练

diff --git a/mindocr/models/necks/rnn.py b/mindocr/models/necks/rnn.py
index 629439e..0d576d4 100644
--- a/mindocr/models/necks/rnn.py
+++ b/mindocr/models/necks/rnn.py
@@ -1,5 +1,6 @@
 from typing import List, Optional

+import mindspore as ms
 import numpy as np

 from mindspore import Tensor, nn, ops
@@ -37,6 +38,7 @@ class RNNEncoder(nn.Cell):
                                    has_bias=True,
                                    dropout=0.,
                                    bidirectional=True)
+        self.seq_encoder.to_float(ms.float16)

         self.hx = None
         if batch_size is not None:
@@ -46,6 +48,7 @@ class RNNEncoder(nn.Cell):

     def construct(self, features: List[Tensor]) -> Tensor:
         x = features[0]
+        x = x.astype(ms.float16)
         x = ops.squeeze(x, axis=2)  # [N, C, W]
         x = ops.transpose(x, (2, 0, 1))  # [W, N, C]

@@ -53,5 +56,5 @@ class RNNEncoder(nn.Cell):
             x, _ = self.seq_encoder(x)
         else:
             x, _ = self.seq_encoder(x, self.hx)
-
+        x = x.astype(ms.float32)
         return x

基本功能：正常训练，精度达标
输入图片说明
测试结论：O3用例回归通过
回归人员：jianyunchao
回归时间：2024-4-7

回归版本：
runpkg_version：Milan_C17/20240406
mindspore：2.3.0rc1+20240408 commit_id = '[sha1]:754bc563,[branch]:(HEAD,origin/r2.3.q1,r2.3.q1)'
回归步骤：拉取mindocr最新master代码，启动测试用例test_ms_crnn_vgg7_O0_data_lmdb_release_train_infer_910_gpu_8p_0001.py，O0下训练
基本功能：正常训练，精度达标
输入图片说明
测试结论：回归通过
回归人员：jianyunchao
回归时间：2024-4-8

GVP MindSpore / mindspore

内容风险标识

[MDT][MT][910B]crnn_vgg7在data_lmdb_release八卡训练精度不达标，loss为nan

Describe the current behavior / 问题描述 (Mandatory / 必填)

Environment / 环境信息 (Mandatory / 必填)

Related testcase / 关联用例 (Mandatory / 必填)

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

Describe the expected behavior / 预期结果 (Mandatory / 必填)

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

Special notes for this issue/备注 (Optional / 选填)

评论 (8)

Appearance & Root Cause

Fix Solution

Fix Description & Test Suggestion

Self-test Report & DT Review

Suggested regression version

GVPMindSpore / mindspore

内容风险标识

[MDT][MT][910B]crnn_vgg7在data_lmdb_release八卡训练精度不达标，loss为nan

Describe the current behavior / 问题描述 (Mandatory / 必填)

Environment / 环境信息 (Mandatory / 必填)

Related testcase / 关联用例 (Mandatory / 必填)

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

Describe the expected behavior / 预期结果 (Mandatory / 必填)

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

Special notes for this issue/备注 (Optional / 选填)

评论 (8)

Appearance & Root Cause

Fix Solution

Fix Description & Test Suggestion

Self-test Report & DT Review

Suggested regression version

搜索帮助

GVP MindSpore / mindspore