2.4K Star 8.2K Fork 4.4K

GVPMindSpore / mindspore

 / 详情

[MDT][MT][910B]crnn_vgg7在data_lmdb_release八卡训练精度不达标,loss为nan

DONE
Bug-Report
创建于  
2024-04-02 15:38
name about labels
Bug Report Use this template for reporting a bug kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

crnn_vgg7在data_lmdb_release八卡训练精度不达标,loss为nan

Environment / 环境信息 (Mandatory / 必填)

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device ascend910B

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) :2.3.0
    -- Python version (e.g., Python 3.7.5) :3.7
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):
    -- GCC/Compiler version (if compiled from source):
    run包:Milan_C17/20240321
    mindspore:Daily/Version/202403/0329/mindspore/r2.3.q1_20240329061516_c99698ba26958cd790921dd56daa7145f334b4ef/

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode graph

Related testcase / 关联用例 (Mandatory / 必填)

test_ms_crnn_vgg7_data_lmdb_release_train_infer_910_gpu_8p_0001.py

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

  1. cd solution_test/cases/02network/00cv/crnn/train/
  2. pytest -s test_ms_crnn_vgg7_data_lmdb_release_train_infer_910_gpu_8p_0001.py

Describe the expected behavior / 预期结果 (Mandatory / 必填)

训练成功,精度达标达标

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

输入图片说明

Special notes for this issue/备注 (Optional / 选填)

走给詹霑

评论 (8)

简云超 创建了Bug-Report
简云超 优先级设置为严重
简云超 添加了
 
device/ascend
标签
简云超 添加了
 
v2.3.0
标签
简云超 关联仓库设置为MindSpore/mindspore
简云超 里程碑设置为B-SIG-Kit
简云超 关联分支设置为r2.3
简云超 关联分支r2.3 修改为r2.3.q1
简云超 负责人Shawny 修改为zhanzhan
简云超 添加协作者Shawny
简云超 添加了
 
kind/bug
标签
简云超 添加了
 
sig/toolkit
标签
简云超 添加了
 
attr/accuracy
标签
wangbixing 修改了标题
展开全部操作日志

O3 B090有问题

O0 B100有问题

简云超 修改了描述

Appearance & Root Cause

nn.LSTM算子在输入FP32时有bug,导致loss收敛异常。

Fix Solution

将CRNN中nn.LSTM层及其输入转换为FP16进行计算。

Fix Description & Test Suggestion

--- a/mindocr/models/necks/rnn.py
+++ b/mindocr/models/necks/rnn.py
@@ -1,5 +1,5 @@
 from typing import List, Optional
-
+import mindspore as ms
 import numpy as np

 from mindspore import Tensor, nn, ops
@@ -37,6 +37,7 @@ class RNNEncoder(nn.Cell):
                                    has_bias=True,
                                    dropout=0.,
                                    bidirectional=True)
+        self.seq_encoder.to_float(ms.float16)

         self.hx = None
         if batch_size is not None:
@@ -46,6 +47,7 @@ class RNNEncoder(nn.Cell):

     def construct(self, features: List[Tensor]) -> Tensor:
         x = features[0]
+        x = x.astype(ms.float16)
         x = ops.squeeze(x, axis=2)  # [N, C, W]
         x = ops.transpose(x, (2, 0, 1))  # [W, N, C]

@@ -53,5 +55,5 @@ class RNNEncoder(nn.Cell):
             x, _ = self.seq_encoder(x)
         else:
             x, _ = self.seq_encoder(x, self.hx)
-
+        x = x.astype(ms.float32)
         return x

进行全量精度测试,以及多数据集验证。

Self-test Report & DT Review

输入图片说明
网络问题,已由解决方案测试看护。

i-robot 添加了
 
foruda
标签

Suggested regression version

在B100版本上进行全量精度测试,以及多数据集验证。

zhanzhan 移除了
 
foruda
标签
zhanzhan 添加了
 
ctl/solutiontest
标签
zhanzhan 添加了
 
rca/others
标签
zhanzhan 添加了
 
rct/oldrelease
标签
zhanzhan 里程碑B-SIG-Kit 修改为B-MDTest
zhanzhan 任务状态TODO 修改为VALIDATION
wangbixing 负责人zhanzhan 修改为简云超
wangbixing 添加协作者zhanzhan

回归版本:
runpkg_version:Milan_C17/20240402
mindspore:2.3.0rc1 commit_id = '[sha1]:f6ee75fd,[branch]:(HEAD,origin/r2.3.q1,r2.3.q1)'
回归步骤:修改mindocr代码,启动测试用例test_ms_crnn_vgg7_data_lmdb_release_train_infer_910_gpu_8p_0001.py,O3下训练

diff --git a/mindocr/models/necks/rnn.py b/mindocr/models/necks/rnn.py
index 629439e..0d576d4 100644
--- a/mindocr/models/necks/rnn.py
+++ b/mindocr/models/necks/rnn.py
@@ -1,5 +1,6 @@
 from typing import List, Optional

+import mindspore as ms
 import numpy as np

 from mindspore import Tensor, nn, ops
@@ -37,6 +38,7 @@ class RNNEncoder(nn.Cell):
                                    has_bias=True,
                                    dropout=0.,
                                    bidirectional=True)
+        self.seq_encoder.to_float(ms.float16)

         self.hx = None
         if batch_size is not None:
@@ -46,6 +48,7 @@ class RNNEncoder(nn.Cell):

     def construct(self, features: List[Tensor]) -> Tensor:
         x = features[0]
+        x = x.astype(ms.float16)
         x = ops.squeeze(x, axis=2)  # [N, C, W]
         x = ops.transpose(x, (2, 0, 1))  # [W, N, C]

@@ -53,5 +56,5 @@ class RNNEncoder(nn.Cell):
             x, _ = self.seq_encoder(x)
         else:
             x, _ = self.seq_encoder(x, self.hx)
-
+        x = x.astype(ms.float32)
         return x

基本功能:正常训练,精度达标
输入图片说明
测试结论:O3用例回归通过
回归人员:jianyunchao
回归时间:2024-4-7

i-robot 添加了
 
foruda
标签

回归版本:
runpkg_version:Milan_C17/20240406
mindspore:2.3.0rc1+20240408 commit_id = '[sha1]:754bc563,[branch]:(HEAD,origin/r2.3.q1,r2.3.q1)'
回归步骤:拉取mindocr最新master代码,启动测试用例test_ms_crnn_vgg7_O0_data_lmdb_release_train_infer_910_gpu_8p_0001.py,O0下训练
基本功能:正常训练,精度达标
输入图片说明
测试结论:回归通过
回归人员:jianyunchao
回归时间:2024-4-8

简云超 任务状态VALIDATION 修改为DONE

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(4)
8108889 shawny233 1628167362
Python
1
https://gitee.com/mindspore/mindspore.git
git@gitee.com:mindspore/mindspore.git
mindspore
mindspore
mindspore

搜索帮助

344bd9b3 5694891 D2dac590 5694891