【Atlas800型号9000】【模型训练】在MindSpore 1.10.0.B060版本进行ASR_dynamic_train_ascend模型测试,训练失败。
Ascend
/GPU
/CPU
):Uncomment only one
/device <>
line, hit enter to put that in a new line, and remove leading whitespaces from that line:
/device ascend
在MindSpore 1.10.0.B060版本在虚拟机进行ASR_dynamic_train_ascend模型测试,训练失败。
在MindSpore 1.10.0.B060版本进行ASR_dynamic_train_ascend模型测试,训练失败
报错如下:
[WARNING] MD(65692,fffe397fa1e0,python3):2022-12-29-12:06:50.088.829 [mindspore/ccsrc/minddata/dataset/engine/datasetops/source/generator_op.cc:195] operator()] Bad performance attention, it takes more than 25 seconds to generator.next new row, which might cause GetNext
timeout problem when sink_mode=True. You can increase the parameter num_parallel_workers in GeneratorDataset / optimize the efficiency of obtaining samples in the user-defined generator function.
Traceback (most recent call last):
File "e2e_sink_dev.py", line 214, in
run()
File "e2e_sink_dev.py", line 208, in run
model.train(sink_epochs, dataset, callbacks=callback_list, sink_size=1, dataset_sink_mode=True)
File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/train/model.py", line 1051, in train
initial_epoch=initial_epoch)
File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/train/model.py", line 98, in wrapper
func(self, *args, **kwargs)
File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/train/model.py", line 625, in _train
cb_params, sink_size, initial_epoch, valid_infos)
File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/train/model.py", line 703, in _train_dataset_sink_process
outputs = train_network(*inputs)
File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/nn/cell.py", line 596, in call
out = self.compile_and_run(*args)
File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/nn/cell.py", line 985, in compile_and_run
self.compile(*inputs)
File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/nn/cell.py", line 957, in compile
jit_config_dict=self._jit_config_dict)
File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/common/api.py", line 1131, in compile
result = self._graph_executor.compile(obj, args_list, phase, self._use_vm_mode())
RuntimeError: Single op compile failed, op: mat_mul_4547133356613339742_3
在MindSpore 1.10.0.B060版本进行ASR_dynamic_train_ascend模型测试,训练失败。
Please assign maintainer to check this issue.
请为此issue分配处理人。
@fangwenyi @chengxiaoli
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
Please add labels (comp or sig), also you can visit https://gitee.com/mindspore/community/blob/master/sigs/dx/docs/labels.md to find more.
为了让代码尽快被审核,请您为Pull Request打上 组件(comp)或兴趣组(sig) 标签,打上标签的PR可直接推送给责任人进行审核。
更多的标签可以查看https://gitee.com/mindspore/community/blob/master/sigs/dx/docs/labels.md
以组件相关代码提交为例,如果你提交的是data组件代码,你可以这样评论:
//comp/data
当然你也可以邀请data SIG组来审核代码,可以这样写:
//sig/data
另外你还可以给这个PR标记类型,例如是bugfix或者是特性需求:
//kind/bug or //kind/feature
恭喜你,你已经学会了使用命令来打标签,接下来就在下面的评论里打上标签吧!
@张兆创
尽快看下这个问题
目前怀疑是run包问题,还在定界
Ascend算子问题,跟踪单:DTS2023010407721
这个问题单回归的时候说下,我这边没权限看
GenFuncStub方法调用rts注册接口时,由于框架注册缓存不健全,相同的算子.o文件(新版本上,tbe算子编译出来的.o文件体积变大)被重复注册,导致内存上涨。如果网络较大,有导致超内存的风险。
依赖海思问题单DTS2023010407721
登录 后才可以发表评论