2.4K Star 8.3K Fork 4.5K

GVPMindSpore / mindspore

 / 详情

[ST][MS][NET][conformer][gpu 1p][pynative]RuntimeError: For 'Cast', the input and output device addresses must be both null or both not null

REJECTED
Bug-Report
创建于  
2023-09-23 08:26
name about labels
Bug Report Use this template for reporting a bug kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

conformer模型地址:https://github.com/mindspore-lab/mindaudio/main
conformer网络gpu 1p pynative模式训练失败

Environment / 环境信息 (Mandatory / 必填)

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device GPU

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g.,r1.6 commit_id=xxxx) :
    -- Python version (e.g., Python 3.7.5) :
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):
    -- GCC/Compiler version (if compiled from source):
    Run包:HISI_C30/20230906
    Mindspore版本:r2.2_master_20230921021523_4d473f93

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode pynative

Related testcase / 关联用例 (Mandatory / 必填)

测试仓库地址:solution_test/cases/02network/05network_competitive_analysis/03audio/conformer/train
用例: test_ms_mindaudio_conformer_train_eval_gpu_1p_0001.py

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

  1. get code from mindaudio
  2. cd examples/conformer;set is_distributed = False
  3. python train.py --config_path ./conformer.yaml
  4. 验证网络是否训练成功

Describe the expected behavior / 预期结果 (Mandatory / 必填)

网络训练成功

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

2023-09-22 18:45:44,983 - mindaudio - INFO - [Train] Epoch: [1/240], Step: [1809/1815], Step Time: 1.5048 sec, lr: 0.000000, Total Loss: 215.6742, Scale: 1024, Rank: 0.
2023-09-22 18:45:46,441 - mindaudio - INFO - [Train] Epoch: [1/240], Step: [1810/1815], Step Time: 1.4550 sec, lr: 0.000000, Total Loss: 434.0997, Scale: 1024, Rank: 0.
2023-09-22 18:45:47,906 - mindaudio - INFO - [Train] Epoch: [1/240], Step: [1811/1815], Step Time: 1.4606 sec, lr: 0.000000, Total Loss: 381.7842, Scale: 1024, Rank: 0.
2023-09-22 18:45:49,401 - mindaudio - INFO - [Train] Epoch: [1/240], Step: [1812/1815], Step Time: 1.4915 sec, lr: 0.000000, Total Loss: 362.5811, Scale: 1024, Rank: 0.
2023-09-22 18:45:50,803 - mindaudio - INFO - [Train] Epoch: [1/240], Step: [1813/1815], Step Time: 1.3982 sec, lr: 0.000000, Total Loss: 556.4254, Scale: 1024, Rank: 0.
2023-09-22 18:45:52,169 - mindaudio - INFO - [Train] Epoch: [1/240], Step: [1814/1815], Step Time: 1.3626 sec, lr: 0.000000, Total Loss: 245.5831, Scale: 1024, Rank: 0.
2023-09-22 18:45:53,577 - mindaudio - INFO - [Train] Epoch: [1/240], Step: [1815/1815], Step Time: 1.4037 sec, lr: 0.000000, Total Loss: 346.0873, Scale: 1024, Rank: 0.
2023-09-22 18:45:54,488 - mindaudio - INFO - [EvalCallback] Step: 0/359, Eval Loss: 335.1355.
[WARNING] PRE_ACT(92887,7f610d7fe700,python):2023-09-22-18:46:00.373.252 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:583] DumpDynamicMemPoolDebugInfo] Finish dump dynamic memory pool debug info.
[CRITICAL] DEVICE(92887,7f610d7fe700,python):2023-09-22-18:46:00.379.449 [mindspore/ccsrc/runtime/pynative/run_op_helper.cc:579] MallocOutputMemoryForDeviceAddress] Allocate device memory failed!
[CRITICAL] KERNEL(92887,7f610d7fe700,python):2023-09-22-18:46:00.396.702 [mindspore/ccsrc/plugin/device/gpu/kernel/arrays/cast_gpu_kernel.cc:74] LaunchKernel] For 'Cast', the input and output device addresses must be both null or both not null
Traceback (most recent call last):
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 692, in __call__
    output = self._run_construct(args, kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 473, in _run_construct
    output = self.construct(*cast_inputs, **kwargs)
  File "/home/jenkins0/.local/lib/python3.7/site-packages/mindaudio-0.1.2-py3.7.egg/mindaudio/loss/label_smoothing_loss.py", line 117, in construct
    return kl.sum() / denom
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/common/tensor.py", line 3306, in sum
    res = tensor_operator_registry.get("sum")(self, axis, keepdims)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/ops/function/math_func.py", line 12044, in sum
    if input.dtype == mstype.bool_:
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/common/_stub_tensor.py", line 94, in dtype
    self.stub_dtype = self.stub.get_dtype()
RuntimeError: Allocate device memory failed!

----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/runtime/pynative/run_op_helper.cc:579 MallocOutputMemoryForDeviceAddress


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 181, in <module>
    train()
  File "train.py", line 176, in train
    dataset_sink_mode=False,
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 1073, in train
    initial_epoch=initial_epoch)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 114, in wrapper
    func(self, *args, **kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 617, in _train
    self._train_process(epoch, train_dataset, list_callback, cb_params, initial_epoch, valid_infos)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 926, in _train_process
    list_callback.on_train_step_end(run_context)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/callback/_callback.py", line 413, in on_train_step_end
    cb.on_train_step_end(run_context)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/callback/_callback.py", line 255, in on_train_step_end
    self.step_end(run_context)
  File "/home/jenkins0/.local/lib/python3.7/site-packages/mindaudio-0.1.2-py3.7.egg/mindaudio/utils/callback.py", line 378, in step_end
    avg_loss, eval_time = self.eval()
  File "/home/jenkins0/.local/lib/python3.7/site-packages/mindaudio-0.1.2-py3.7.egg/mindaudio/utils/callback.py", line 309, in eval
    loss = self.network(*input_data).asnumpy()
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 696, in __call__
    raise err
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 692, in __call__
    output = self._run_construct(args, kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 473, in _run_construct
    output = self.construct(*cast_inputs, **kwargs)
  File "/home/jenkins0/mindaudio/examples/conformer/asr_model.py", line 365, in construct
    loss = self.network(*inputs, **kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 696, in __call__
    raise err
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 692, in __call__
    output = self._run_construct(args, kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 473, in _run_construct
    output = self.construct(*cast_inputs, **kwargs)
  File "/home/jenkins0/mindaudio/examples/conformer/asr_model.py", line 296, in construct
    xs_chunk_masks,
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 696, in __call__
    raise err
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 692, in __call__
    output = self._run_construct(args, kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 473, in _run_construct
    output = self.construct(*cast_inputs, **kwargs)
  File "/home/jenkins0/mindaudio/examples/conformer/asr_model.py", line 126, in construct
    ys_sub_masks,
  File "/home/jenkins0/mindaudio/examples/conformer/asr_model.py", line 174, in _calc_att_loss
    loss_att = self.criterion_att(decoder_out, ys_out_pad, ys_masks)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 695, in __call__
    _pynative_executor.clear_res()
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/common/api.py", line 1249, in clear_res
    return self._executor.clear_res()
RuntimeError: For 'Cast', the input and output device addresses must be both null or both not null

----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/plugin/device/gpu/kernel/arrays/cast_gpu_kernel.cc:74 LaunchKernel


Special notes for this issue/备注 (Optional / 选填)

走给李玉婷

评论 (1)

zhongjicheng 创建了Bug-Report
zhongjicheng 添加了
 
attr/function
标签
zhongjicheng 添加了
 
stage/func-debug
标签
zhongjicheng 添加了
 
kind/bug
标签
zhongjicheng 添加了
 
v2.2.0
标签
weiyang 任务状态TODO 修改为REJECTED
展开全部操作日志

问题请确认开发责任人和里程碑,以及是否在Q3的验证范围

weiyang 里程碑设置为B-SolutionTest
xiangminshan 修改了描述

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(2)
Python
1
https://gitee.com/mindspore/mindspore.git
git@gitee.com:mindspore/mindspore.git
mindspore
mindspore
mindspore

搜索帮助

344bd9b3 5694891 D2dac590 5694891