GRU_for_PyTorchm GE问题 DynamicGRUV2Grad, error code is 500002

一、问题现象（附报错日志上下文）：

![输入图片说明](https://foruda.gitee.com/images/1756433137138753099/00dda6c3_1902900.png "屏幕截图")
二、软件版本:
-- CANN 版本 (e.g., CANN 3.0.x，5.x.x):   **8.1.RC1** 
-- Tensorflow/Pytorch/MindSpore 版本: Pytorch 2.6.0
-- Python 版本 (e.g., Python 3.7.5): 3.9.23
-- MindStudio版本 (e.g., MindStudio 2.0.0 (beta3)): /
-- 操作系统版本 (e.g., Ubuntu 18.04): Ubuntu 22.04
-- de_core_news_sm-3.8.0-py3-none-any.whl
-- en_core_web_sm-3.8.0-py3-none-any.whl

三、测试步骤：
参考https://www.hiascend.com/software/modelzoo/models/detail/25d8422a3e0648389f80e599be9a1ab8文档执行。
客户希望使用最新版本pytorch测试，所以安装了Pytorch 2.6.0
分词包下载的3.8.0的版本，github release直接下载whl后安装
数据集由于外网数据集下载问题，改为手动下载后解压到指定目录
最后执行 bash ./test/train_performance_1p.sh --data_path=数据集路径 时出现报错

四、日志信息:

```
Namespace(workers=12, epochs=1, batch_size=1536, world_size=1, rank=0, dist_url='tcp://127.0.0.1:50000', dist_backend='nccl', data_dir='/root/xxx/gru/datasets', seed=1234, gpu=None, multiprocessing_distributed=False, npu=1, amp=True, loss_scale=32.0, opt_level='O2')
/root/xxx/gru/ModelZoo-PyTorch/PyTorch/built-in/nlp/GRU_for_PyTorch/gru_1p.py:63: UserWarning: to - 1 is out of bounds [-(2^24), 2^24]. Due to precision limitations float can support discrete uniform distribution only within this range. This warning will become an error in version 1.7 release, please fix the code in advance (Triggered internally at /pytorch/aten/src/ATen/native/DistributionTemplates.h:112.)
  return torch.randint(1, MAX, size=(num,), dtype=torch.float)
[W829 01:46:19.872570637 compiler_depend.ts:65] Warning: Warning: The torch.npu.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='npu') to create tensors. (function operator())
/root/anaconda3/envs/xxx-gru/lib/python3.9/site-packages/torch/nn/modules/rnn.py:1393: UserWarning: Cannot create tensor with interal format while allow_internel_format=False, tensor will be created with base format. (Triggered internally at build/CMakeFiles/torch_npu.dir/compiler_depend.ts:335.)
  result = _VF.gru(
The model has 14,219,781 trainable parameters
Selected optimization level O2:  FP16 training with FP32 batchnorm and FP32 master weights.

Defaults for this optimization level are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
combine_grad           : None
combine_ddp            : None
ddp_replica_count      : 4
check_combined_tensors : None
user_cast_preferred    : None
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : 32.0
combine_grad           : None
combine_ddp            : None
ddp_replica_count      : 4
check_combined_tensors : None
user_cast_preferred    : None
Traceback (most recent call last):
  File "/root/xxx/gru/ModelZoo-PyTorch/PyTorch/built-in/nlp/GRU_for_PyTorch/gru_1p.py", line 408, in <module>
    main()
  File "/root/xxx/gru/ModelZoo-PyTorch/PyTorch/built-in/nlp/GRU_for_PyTorch/gru_1p.py", line 126, in main
    main_worker(args)
  File "/root/xxx/gru/ModelZoo-PyTorch/PyTorch/built-in/nlp/GRU_for_PyTorch/gru_1p.py", line 175, in main_worker
    train_loss = train(model, train_iterator, optimizer, criterion, args, CLIP, epoch)
  File "/root/xxx/gru/ModelZoo-PyTorch/PyTorch/built-in/nlp/GRU_for_PyTorch/gru_1p.py", line 233, in train
    scaled_loss.backward()
  File "/root/anaconda3/envs/xxx-gru/lib/python3.9/site-packages/torch/_tensor.py", line 626, in backward
    torch.autograd.backward(
  File "/root/anaconda3/envs/xxx-gru/lib/python3.9/site-packages/torch/autograd/__init__.py", line 347, in backward
    _engine_run_backward(
  File "/root/anaconda3/envs/xxx-gru/lib/python3.9/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: InnerRun:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:234 OPS function error: DynamicGRUV2Grad, error code is 500002
[ERROR] 2025-08-29-01:46:24 (PID:3621708, Device:1, RankID:-1) ERR01100 OPS call acl api failed
[Error]: A GE error occurs in the system.
        Rectify the fault based on the error information in the ascend log.
EZ9999: Inner Error!
EZ9999: [PID: 3621708] 2025-08-29-01:46:24.077.619 ChooseCompileInfo Failed[FUNC:Classify][FILE:transdata_dsl.cc][LINE:704]
        TraceBack (most recent call last):
       Autotiling func failed[FUNC:AutoTilingRun][FILE:auto_tiling_rt2.cc][LINE:109]
       op[trans_TransData_38], call DoTiling failed[FUNC:TransDataDSLTiling][FILE:trans_data.cc][LINE:878]
       op[trans_TransData_38], call DoTiling failed[FUNC:Tiling4TransData][FILE:trans_data.cc][LINE:977]
       [Exec][Op]Execute op failed. op type = DynamicGRUV2Grad, ge result = 4294967295[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

```

Ascend/ModelZoo-PyTorch
暂停

内容风险标识

评论 (0)

Ascend/ModelZoo-PyTorch暂停 .gitee-modal { width: 500px !important; }

内容风险标识