101 Star 812 Fork 1.4K

MindSpore/models

 / 详情

在MindSpore 1.10.0.B060版本进行ASR_dynamic_train_ascend模型测试,训练失败

ACCEPTED
Bug-Report
创建于  
2022-12-30 10:34

【Atlas800型号9000】【模型训练】在MindSpore 1.10.0.B060版本进行ASR_dynamic_train_ascend模型测试,训练失败。

Environment

Uncomment only one /device <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:
/device ascend

Related testcase

在MindSpore 1.10.0.B060版本在虚拟机进行ASR_dynamic_train_ascend模型测试,训练失败。

Steps to reproduce the issue

  1. https://codehub-y.huawei.com/mindspore/modelinternal/files?ref=master&filePath=media_net_for_smoke%2FASR_dynamic_train_ascend
  2. 环境10.174.216.229 root/Ascend@123
  3. 路径:/home/CI_daily/5326391/modelinternal-r1.10/media_net_for_smoke/ASR_dynamic_train_ascend
  4. 执行命令: bash ./scripts/run_1p.sh /home/data/aishell/ config/default_config.yaml 0

Describe the current behavior

在MindSpore 1.10.0.B060版本进行ASR_dynamic_train_ascend模型测试,训练失败

Describe the expected behavior

报错如下:
[WARNING] MD(65692,fffe397fa1e0,python3):2022-12-29-12:06:50.088.829 [mindspore/ccsrc/minddata/dataset/engine/datasetops/source/generator_op.cc:195] operator()] Bad performance attention, it takes more than 25 seconds to generator.next new row, which might cause GetNext timeout problem when sink_mode=True. You can increase the parameter num_parallel_workers in GeneratorDataset / optimize the efficiency of obtaining samples in the user-defined generator function.
Traceback (most recent call last):
File "e2e_sink_dev.py", line 214, in
run()
File "e2e_sink_dev.py", line 208, in run
model.train(sink_epochs, dataset, callbacks=callback_list, sink_size=1, dataset_sink_mode=True)
File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/train/model.py", line 1051, in train
initial_epoch=initial_epoch)
File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/train/model.py", line 98, in wrapper
func(self, *args, **kwargs)
File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/train/model.py", line 625, in _train
cb_params, sink_size, initial_epoch, valid_infos)
File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/train/model.py", line 703, in _train_dataset_sink_process
outputs = train_network(*inputs)
File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/nn/cell.py", line 596, in call
out = self.compile_and_run(*args)
File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/nn/cell.py", line 985, in compile_and_run
self.compile(*inputs)
File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/nn/cell.py", line 957, in compile
jit_config_dict=self._jit_config_dict)
File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/common/api.py", line 1131, in compile
result = self._graph_executor.compile(obj, args_list, phase, self._use_vm_mode())
RuntimeError: Single op compile failed, op: mat_mul_4547133356613339742_3

Related log / screenshot

Special notes for this issue

在MindSpore 1.10.0.B060版本进行ASR_dynamic_train_ascend模型测试,训练失败。

评论 (8)

奥运冠军 创建了Question

Please assign maintainer to check this issue.
请为此issue分配处理人。
@fangwenyi @chengxiaoli

Please add labels (comp or sig), also you can visit https://gitee.com/mindspore/community/blob/master/sigs/dx/docs/labels.md to find more.
为了让代码尽快被审核,请您为Pull Request打上 组件(comp)或兴趣组(sig) 标签,打上标签的PR可直接推送给责任人进行审核。
更多的标签可以查看https://gitee.com/mindspore/community/blob/master/sigs/dx/docs/labels.md
以组件相关代码提交为例,如果你提交的是data组件代码,你可以这样评论:
//comp/data
当然你也可以邀请data SIG组来审核代码,可以这样写:
//sig/data
另外你还可以给这个PR标记类型,例如是bugfix或者是特性需求:
//kind/bug or //kind/feature
恭喜你,你已经学会了使用命令来打标签,接下来就在下面的评论里打上标签吧!

fangwenyi 任务状态TODO 修改为ACCEPTED
fangwenyi 负责人设置为TronZhang
fangwenyi 任务类型Question 修改为Bug-Report
fangwenyi 优先级设置为主要
fangwenyi 添加了
 
v1.10
标签
fangwenyi 添加了
 
kind/bug
标签
fangwenyi 里程碑设置为B-SIG-DS

@张兆创
尽快看下这个问题

目前怀疑是run包问题,还在定界

Ascend算子问题,跟踪单:DTS2023010407721

TronZhang 添加协作者TronZhang
TronZhang 负责人TronZhang 修改为刘力力
TronZhang 里程碑B-SIG-DS 修改为B-SIG-MSLite

这个问题单回归的时候说下,我这边没权限看

GenFuncStub方法调用rts注册接口时,由于框架注册缓存不健全,相同的算子.o文件(新版本上,tbe算子编译出来的.o文件体积变大)被重复注册,导致内存上涨。如果网络较大,有导致超内存的风险。

chengbin 里程碑B-SIG-DS 修改为B-SIG-MSLite
zhanghaibo 添加了
 
sig/ops
标签
fangwenyi 添加了
 
rct/cann
标签

依赖海思问题单DTS2023010407721

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(5)
6574993 liubuyu 1716813926
1
https://gitee.com/mindspore/models.git
git@gitee.com:mindspore/models.git
mindspore
models
models

搜索帮助

344bd9b3 5694891 D2dac590 5694891