2.4K Star 8.2K Fork 4.4K

GVPMindSpore / mindspore

 / 详情

[ST][MS][master][NET][resnet18][pynative][910][偶现]网络训练失败The pionter 【auto_grad_meta_data】 is null

TODO
Bug-Report
创建于  
2023-07-21 14:17
name about labels
Bug Report Use this template for reporting a bug kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

[resnet18][pynative][910][偶现]网络训练失败The pionter 【auto_grad_meta_data】 is null
resnet18网络仓地址:https://gitee.com/mindspore/models/tree/master/official/cv/ResNet

Environment / 环境信息 (Mandatory / 必填)

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device Ascend

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (HEAD,origin/master,origin/HEAD,master)'
    -- Python version (e.g., Python 3.7.5) :
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):
    -- GCC/Compiler version (if compiled from source):
    问题版本:
    mindspore版本:B100_r2.1_20230720161524_5f8174d7
    run包:HISI_C30/20230720

pass版本:
mindspore版本:r2.1_20230719161523_c433b910
run包:HISI_C30/20230715

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode pynative

Related testcase / 关联用例 (Mandatory / 必填)

test_ms_resnet18_cifar10_pynative_train_infer_910_8p_0001

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

  1. get code from models
  2.  cd official/cv/ResNet/;	
    
  3.    bash run_standalone_train.sh [DATASET_PATH]  [CONFIG_PATH] [RESUME_CKPT](可选)
    
  4.  set mode=pynative
    
  5. 验证性能达标,loss收敛 性能达标

Describe the expected behavior / 预期结果 (Mandatory / 必填)

网络resnet18训练成功 性能达标

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

Traceback (most recent call last):
  File "train.py", line 234, in <module>
    train_net()
  File "/home/jenkins/workspace/TDT_deployment/solution_test/cases/02network/00cv/resnet18/pynative/test_ms_resnet18_cifar10_pynative_train_infer_910_8p_0001/scripts/train_parallel1/src/model_utils/moxing_adapter.py", line 104, in wrapped_func
    run_func(*args, **kwargs)
  File "train.py", line 228, in train_net
    sink_size=dataset.get_dataset_size(), dataset_sink_mode=dataset_sink_mode)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 1066, in train
    initial_epoch=initial_epoch)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 113, in wrapper
    func(self, *args, **kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 620, in _train
    cb_params, sink_size, initial_epoch, valid_infos)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 703, in _train_dataset_sink_process
    outputs = train_network(*inputs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 664, in __call__
    raise err
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 660, in __call__
    output = self._run_construct(args, kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 444, in _run_construct
    output = self.construct(*cast_inputs, **kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/dataset_helper.py", line 101, in construct
    return self.network(*outputs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 664, in __call__
    raise err
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 660, in __call__
    output = self._run_construct(args, kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 444, in _run_construct
    output = self.construct(*cast_inputs, **kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/wrap/cell_wrapper.py", line 423, in construct
    loss = self.network(*inputs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 664, in __call__
    raise err
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 661, in __call__
    _pynative_executor.end_graph(self, output, *args, **kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/common/api.py", line 1304, in end_graph
    self._executor.end_graph(obj, output, *args, *(kwargs.values()))
RuntimeError: The pointer[auto_grad_meta_data] is null.

----------------------------------------------------
- Framework Unexpected Exception Raised:
----------------------------------------------------
This exception is caused by framework's unexpected error. Please create an issue at https://gitee.com/mindspore/mindspore/issues to get help.

----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/pipeline/pynative/grad/auto_grad.cc:1744 MapParameter

Special notes for this issue/备注 (Optional / 选填)

走给罗超

评论 (8)

sunjiawei999 创建了Bug-Report
sunjiawei999 添加了
 
attr/function
标签
sunjiawei999 添加了
 
kind/bug
标签
sunjiawei999 添加了
 
stage/func-debug
标签
sunjiawei999 添加了
 
v2.1.0
标签
sunjiawei999 添加了
 
sig/pynative
标签
展开全部操作日志

Please assign maintainer to check this issue.
请为此issue分配处理人。
@sunjiawei999

Please add labels (comp or sig), also you can visit https://gitee.com/mindspore/community/blob/master/sigs/dx/docs/labels.md to find more.
为了让代码尽快被审核,请您为Pull Request打上 组件(comp)或兴趣组(sig) 标签,打上标签的PR可直接推送给责任人进行审核。
更多的标签可以查看https://gitee.com/mindspore/community/blob/master/sigs/dx/docs/labels.md
以组件相关代码提交为例,如果你提交的是data组件代码,你可以这样评论:
//comp/data
当然你也可以邀请data SIG组来审核代码,可以这样写:
//sig/data
另外你还可以给这个PR标记类型,例如是bugfix或者是特性需求:
//kind/bug or //kind/feature
恭喜你,你已经学会了使用命令来打标签,接下来就在下面的评论里打上标签吧!

sunjiawei999 修改了标题
sunjiawei999 修改了描述
sunjiawei999 修改了描述
zhongjicheng 负责人zhongjicheng 修改为luochao60

使用最新2.1的代码编包,多台机器,尚未复现

与该问题单相同问题https://e.gitee.com/mind_spore/dashboard?issue=I7BHKY
目前测试,开发都未再复现该问题

fangwenyi 添加了
 
ccb/bug
标签

2023/7/27 CCB:
遗留原因:此问题在多台机器上多次测试均未复现,先继续观察,经CCB裁决,作为偶现问题先遗留
影响:resnet18在Ascend 动态图模式下偶现训练失败
规避措施:偶现问题,用户网络中如果出现此类问题,可以重新训练

fangwenyi 添加了
 
v2.2.0
标签

2023/9/9 CCB:
2.1.1版本出口满足降级标准,7月份以来一直未复现,降级为一般单跟踪

weiyang 优先级主要 修改为次要
chujinjin 添加了
 
ltnr
标签
chujinjin 添加了
 
ltnr
标签

网络:LSTM
版本:r2.2_20231102_188a4d04_2023-11-06 20:22:02
也出现相同问题:The pionter 【auto_grad_meta_data】 is null

zhunaipan 添加了
 
v2.2.10
标签
zhunaipan 添加了
 
v2.2.12
标签
zhunaipan 添加了
 
v2.2.13
标签
luochao60 优先级次要 修改为不重要

ccb结论:一直不复现,降级到不重要

zhunaipan 添加了
 
v2.2.14
标签
zhunaipan 添加了
 
r2.2
标签

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(7)
Python
1
https://gitee.com/mindspore/mindspore.git
git@gitee.com:mindspore/mindspore.git
mindspore
mindspore
mindspore

搜索帮助

344bd9b3 5694891 D2dac590 5694891