74 Star 572 Fork 1.1K

Ascend/pytorch

 / 详情

pytorch1.8 训练timegan模型出错

Feedback
训练问题
创建于  
2022-08-10 19:48

一、问题现象(附报错日志上下文):
我想在昇腾上跑timegan模型,于是我使用mindstudio的GPU2ascend工具将代码转换成npu的,发现有包括gru等几个算子不支持的情况,于是我查询了npu_pytorch的算子支持清单,发现1.8版本除了gru不支持其他都支持,然后我自定义了pytorch1.8的镜像,并且使用modelarts运行,我将代码中的gru改成了lstm,现在所有的算子都支持了,且我使用CPU运行代码没有问题,切换到npu时,出现如下报错,请问这是什么原因?
[W OperatorEntry.cpp:121] Warning: Overriding a previously registered kernel for the same operator and the same dispatch key
operator: aten::_has_compatible_shallow_copy_type(Tensor self, Tensor from) -> (bool)
registered at /home/ma-user/pytorch_v1.8.1/build/aten/src/ATen/RegisterSchema.cpp:20
dispatch key: Math
previous kernel: registered at /home/ma-user/pytorch_v1.8.1/build/aten/src/ATen/RegisterMath.cpp:5686
new kernel: registered at /home/ma-user/pytorch/torch_npu/csrc/aten/ops/HasCompatibleShallowCopyType.cpp:37 (function registerKernel)

Code directory: /home/ma-user/work/timegan-pytorch-main_without_tb
Data directory: /home/ma-user/work/timegan-pytorch-main_without_tb/data
Output directory: /home/ma-user/work/timegan-pytorch-main_without_tb/output/test
TensorBoard directory: /home/ma-user/work/timegan-pytorch-main_without_tb/tensorboard

Using CUDA

Loading data...

Dropped 504 rows (outliers)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3676/3676 [00:16<00:00, 220.47it/s]
Processed data: (3676, 100, 6) (Idx x MaxSeqLen x Features)

Original data preview:
[[[ 0.19376718 0.19446839]
[ 0.19232369 0.19224311]
[ 0.19594256 0.19481357]
[ 0.20078938 0.20019403]
[ 0.19906535 0.20037676]
[ 0.19672326 0.19752207]
[ 0.19728439 0.19644191]
[-1. -1. ]
[-1. -1. ]
[-1. -1. ]]

[[ 0.4860957 0.49640034]
[ 0.48522808 0.48878844]
[ 0.48351736 0.48673669]
[ 0.48463053 0.48547787]
[ 0.49108043 0.4905124 ]
[ 0.48256791 0.48940151]
[ 0.47696925 0.48430077]
[-1. -1. ]
[-1. -1. ]
[-1. -1. ]]]

Start Embedding Network Training
Epoch: 0, Loss: 0: 0%| | 0/600 [00:00<?, ?it/s]
Traceback (most recent call last):
File "main.py", line 291, in
main(args)
File "main.py", line 112, in main
timegan_trainer(model, train_data, train_time, args)
File "/home/ma-user/work/timegan-pytorch-main_without_tb/models/utils.py", line 211, in timegan_trainer
args=args,
File "/home/ma-user/work/timegan-pytorch-main_without_tb/models/utils.py", line 35, in embedding_trainer
_, E_loss0, E_loss_T0 = model(X=X_mb, T=T_mb, Z=None, obj="autoencoder")
File "/home/ma-user/anaconda3/envs/pytorch1.8.1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ma-user/work/timegan-pytorch-main_without_tb/models/timegan.py", line 546, in forward
loss = self._recovery_forward(X, T)
File "/home/ma-user/work/timegan-pytorch-main_without_tb/models/timegan.py", line 403, in _recovery_forward
H = self.embedder(X, T)
File "/home/ma-user/anaconda3/envs/pytorch1.8.1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ma-user/work/timegan-pytorch-main_without_tb/models/timegan.py", line 64, in forward
H_o, H_t = self.emb_rnn(X_packed)
File "/home/ma-user/anaconda3/envs/pytorch1.8.1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ma-user/anaconda3/envs/pytorch1.8.1/lib/python3.7/site-packages/torch_npu/utils/module.py", line 253, in lstm_forward
return output_packed, self.permute_hidden(hidden, unsorted_indices)
File "/home/ma-user/anaconda3/envs/pytorch1.8.1/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 615, in permute_hidden
return apply_permutation(hx[0], permutation), apply_permutation(hx[1], permutation)
File "/home/ma-user/anaconda3/envs/pytorch1.8.1/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 21, in apply_permutation
return tensor.index_select(dim, permutation)
AttributeError: 'NoneType' object has no attribute 'index_select'
THPModule_npu_shutdown success.

二、软件版本:
-- CANN 版本 (e.g., CANN 3.0.x,5.x.x): CANN 5.1.rc2
--Tensorflow/Pytorch/MindSpore 版本: pytorch 1.8.1
--Python 版本 (e.g., Python 3.7.5):python 3.7.10
-- MindStudio版本 (e.g., MindStudio 2.0.0 (beta3)): mindstudio 3.0.3
--操作系统版本 (e.g., Ubuntu 18.04): euleros 2.8

三、测试步骤:
xxxx

四、日志信息:
xxxx
请根据自己的运行环境参考以下方式搜集日志信息,如果涉及到算子开发相关的问题,建议也提供UT/ST测试和单算子集成测试相关的日志。

日志提供方式:
将日志打包后作为附件上传。若日志大小超出附件限制,则可上传至外部网盘后提供链接。

获取方法请参考wiki:
https://gitee.com/ascend/modelzoo/wikis/如何获取日志和计算图?sort_id=4097825

评论 (8)

做个俗人 创建了训练问题 3年前

Please add labels , also you can visit https://gitee.com/ascend/community/blob/master/labels.md to find more.
为了让代码尽快被审核,请您为Issue打上标签,打上标签的Issue可以直接推送给责任人进行审核。
更多的标签可以查看https://gitee.com/ascend/community/blob/master/labels.md
以模型训练相关代码提交为例,如果你提交的是模型训练代码,你可以这样评论:
//train/model
另外你还可以给这个Issue标记类型,例如是bugfix或者是特性需求:
//kind/bug or //kind/feature
恭喜你,你已经学会了使用命令来打标签,接下来就在下面的评论里打上标签吧!

//train/model

报错直接原因是参数不是tensor,所以没有index_select的属性,需要根据模型源代码看看H_o, H_t = self.emb_rnn(X_packed)里面做了哪些操作
输入图片说明

self.emb_rnn = torch.nn.LSTM(
input_size=self.feature_dim,
hidden_size=self.hidden_dim,
num_layers=self.num_layers,
batch_first=True
)
emb_rnn是一个LSTM网络,不知道这有什么问题,是要改成torch_npu吗

当前npu实现的lstm语义与原生torch的语义不同。

所以请问要怎么解决呢

此问题还在内部讨论中,待决策,建议暂时可以从模型代码层面考虑规避

郭夏 任务状态TODO 修改为Analysing 3年前
郭夏 任务状态Analysing 修改为Feedback 3年前

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
优先级
预计工期 (小时)
开始日期   -   截止日期
-
置顶选项
参与者(3)
ascend-robot-ascend-robot 做个俗人-zzmma 郭夏-petissue
Python
1
https://gitee.com/ascend/pytorch.git
git@gitee.com:ascend/pytorch.git
ascend
pytorch
pytorch

搜索帮助