name | about | labels |
---|---|---|
Bug Report | Use this template for reporting a bug | kind/bug |
conformer模型地址:https://github.com/mindspore-lab/mindaudio/main
conformer网络gpu 1p pynative模式训练失败
Ascend
/GPU
/CPU
) / 硬件环境:Please delete the backend not involved / 请删除不涉及的后端:
/device GPU
Software Environment / 软件环境 (Mandatory / 必填):
-- MindSpore version (e.g.,r1.6 commit_id=xxxx) :
-- Python version (e.g., Python 3.7.5) :
-- OS platform and distribution (e.g., Linux Ubuntu 16.04):
-- GCC/Compiler version (if compiled from source):
Run包:HISI_C30/20230906
Mindspore版本:r2.2_master_20230921021523_4d473f93
Excute Mode / 执行模式 (Mandatory / 必填)(PyNative
/Graph
):
Please delete the mode not involved / 请删除不涉及的模式:
/mode pynative
测试仓库地址:solution_test/cases/02network/05network_competitive_analysis/03audio/conformer/train
用例: test_ms_mindaudio_conformer_train_eval_gpu_1p_0001.py
网络训练成功
2023-09-22 18:45:44,983 - mindaudio - INFO - [Train] Epoch: [1/240], Step: [1809/1815], Step Time: 1.5048 sec, lr: 0.000000, Total Loss: 215.6742, Scale: 1024, Rank: 0.
2023-09-22 18:45:46,441 - mindaudio - INFO - [Train] Epoch: [1/240], Step: [1810/1815], Step Time: 1.4550 sec, lr: 0.000000, Total Loss: 434.0997, Scale: 1024, Rank: 0.
2023-09-22 18:45:47,906 - mindaudio - INFO - [Train] Epoch: [1/240], Step: [1811/1815], Step Time: 1.4606 sec, lr: 0.000000, Total Loss: 381.7842, Scale: 1024, Rank: 0.
2023-09-22 18:45:49,401 - mindaudio - INFO - [Train] Epoch: [1/240], Step: [1812/1815], Step Time: 1.4915 sec, lr: 0.000000, Total Loss: 362.5811, Scale: 1024, Rank: 0.
2023-09-22 18:45:50,803 - mindaudio - INFO - [Train] Epoch: [1/240], Step: [1813/1815], Step Time: 1.3982 sec, lr: 0.000000, Total Loss: 556.4254, Scale: 1024, Rank: 0.
2023-09-22 18:45:52,169 - mindaudio - INFO - [Train] Epoch: [1/240], Step: [1814/1815], Step Time: 1.3626 sec, lr: 0.000000, Total Loss: 245.5831, Scale: 1024, Rank: 0.
2023-09-22 18:45:53,577 - mindaudio - INFO - [Train] Epoch: [1/240], Step: [1815/1815], Step Time: 1.4037 sec, lr: 0.000000, Total Loss: 346.0873, Scale: 1024, Rank: 0.
2023-09-22 18:45:54,488 - mindaudio - INFO - [EvalCallback] Step: 0/359, Eval Loss: 335.1355.
[WARNING] PRE_ACT(92887,7f610d7fe700,python):2023-09-22-18:46:00.373.252 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:583] DumpDynamicMemPoolDebugInfo] Finish dump dynamic memory pool debug info.
[CRITICAL] DEVICE(92887,7f610d7fe700,python):2023-09-22-18:46:00.379.449 [mindspore/ccsrc/runtime/pynative/run_op_helper.cc:579] MallocOutputMemoryForDeviceAddress] Allocate device memory failed!
[CRITICAL] KERNEL(92887,7f610d7fe700,python):2023-09-22-18:46:00.396.702 [mindspore/ccsrc/plugin/device/gpu/kernel/arrays/cast_gpu_kernel.cc:74] LaunchKernel] For 'Cast', the input and output device addresses must be both null or both not null
Traceback (most recent call last):
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 692, in __call__
output = self._run_construct(args, kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 473, in _run_construct
output = self.construct(*cast_inputs, **kwargs)
File "/home/jenkins0/.local/lib/python3.7/site-packages/mindaudio-0.1.2-py3.7.egg/mindaudio/loss/label_smoothing_loss.py", line 117, in construct
return kl.sum() / denom
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/common/tensor.py", line 3306, in sum
res = tensor_operator_registry.get("sum")(self, axis, keepdims)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/ops/function/math_func.py", line 12044, in sum
if input.dtype == mstype.bool_:
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/common/_stub_tensor.py", line 94, in dtype
self.stub_dtype = self.stub.get_dtype()
RuntimeError: Allocate device memory failed!
----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/runtime/pynative/run_op_helper.cc:579 MallocOutputMemoryForDeviceAddress
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train.py", line 181, in <module>
train()
File "train.py", line 176, in train
dataset_sink_mode=False,
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 1073, in train
initial_epoch=initial_epoch)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 114, in wrapper
func(self, *args, **kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 617, in _train
self._train_process(epoch, train_dataset, list_callback, cb_params, initial_epoch, valid_infos)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 926, in _train_process
list_callback.on_train_step_end(run_context)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/callback/_callback.py", line 413, in on_train_step_end
cb.on_train_step_end(run_context)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/callback/_callback.py", line 255, in on_train_step_end
self.step_end(run_context)
File "/home/jenkins0/.local/lib/python3.7/site-packages/mindaudio-0.1.2-py3.7.egg/mindaudio/utils/callback.py", line 378, in step_end
avg_loss, eval_time = self.eval()
File "/home/jenkins0/.local/lib/python3.7/site-packages/mindaudio-0.1.2-py3.7.egg/mindaudio/utils/callback.py", line 309, in eval
loss = self.network(*input_data).asnumpy()
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 696, in __call__
raise err
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 692, in __call__
output = self._run_construct(args, kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 473, in _run_construct
output = self.construct(*cast_inputs, **kwargs)
File "/home/jenkins0/mindaudio/examples/conformer/asr_model.py", line 365, in construct
loss = self.network(*inputs, **kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 696, in __call__
raise err
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 692, in __call__
output = self._run_construct(args, kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 473, in _run_construct
output = self.construct(*cast_inputs, **kwargs)
File "/home/jenkins0/mindaudio/examples/conformer/asr_model.py", line 296, in construct
xs_chunk_masks,
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 696, in __call__
raise err
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 692, in __call__
output = self._run_construct(args, kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 473, in _run_construct
output = self.construct(*cast_inputs, **kwargs)
File "/home/jenkins0/mindaudio/examples/conformer/asr_model.py", line 126, in construct
ys_sub_masks,
File "/home/jenkins0/mindaudio/examples/conformer/asr_model.py", line 174, in _calc_att_loss
loss_att = self.criterion_att(decoder_out, ys_out_pad, ys_masks)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 695, in __call__
_pynative_executor.clear_res()
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/common/api.py", line 1249, in clear_res
return self._executor.clear_res()
RuntimeError: For 'Cast', the input and output device addresses must be both null or both not null
----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/plugin/device/gpu/kernel/arrays/cast_gpu_kernel.cc:74 LaunchKernel
走给李玉婷
问题请确认开发责任人和里程碑,以及是否在Q3的验证范围
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
登录 后才可以发表评论