pytorch1.8 训练timegan模型出错

一、问题现象（附报错日志上下文）：
我想在昇腾上跑timegan模型，于是我使用mindstudio的GPU2ascend工具将代码转换成npu的，发现有包括gru等几个算子不支持的情况，于是我查询了npu_pytorch的算子支持清单，发现1.8版本除了gru不支持其他都支持，然后我自定义了pytorch1.8的镜像，并且使用modelarts运行，我将代码中的gru改成了lstm，现在所有的算子都支持了，且我使用CPU运行代码没有问题，切换到npu时，出现如下报错，请问这是什么原因？
[W OperatorEntry.cpp:121] Warning: Overriding a previously registered kernel for the same operator and the same dispatch key
operator: aten::_has_compatible_shallow_copy_type(Tensor self, Tensor from) -> (bool)
registered at /home/ma-user/pytorch_v1.8.1/build/aten/src/ATen/RegisterSchema.cpp:20
dispatch key: Math
previous kernel: registered at /home/ma-user/pytorch_v1.8.1/build/aten/src/ATen/RegisterMath.cpp:5686
new kernel: registered at /home/ma-user/pytorch/torch_npu/csrc/aten/ops/HasCompatibleShallowCopyType.cpp:37 (function registerKernel)

Code directory: /home/ma-user/work/timegan-pytorch-main_without_tb
Data directory: /home/ma-user/work/timegan-pytorch-main_without_tb/data
Output directory: /home/ma-user/work/timegan-pytorch-main_without_tb/output/test
TensorBoard directory: /home/ma-user/work/timegan-pytorch-main_without_tb/tensorboard

Using CUDA

Loading data...

Dropped 504 rows (outliers)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3676/3676 [00:16<00:00, 220.47it/s]
Processed data: (3676, 100, 6) (Idx x MaxSeqLen x Features)

Original data preview:
[[[ 0.19376718 0.19446839]
[ 0.19232369 0.19224311]
[ 0.19594256 0.19481357]
[ 0.20078938 0.20019403]
[ 0.19906535 0.20037676]
[ 0.19672326 0.19752207]
[ 0.19728439 0.19644191]
[-1. -1. ]
[-1. -1. ]
[-1. -1. ]]

[[ 0.4860957 0.49640034]
[ 0.48522808 0.48878844]
[ 0.48351736 0.48673669]
[ 0.48463053 0.48547787]
[ 0.49108043 0.4905124 ]
[ 0.48256791 0.48940151]
[ 0.47696925 0.48430077]
[-1. -1. ]
[-1. -1. ]
[-1. -1. ]]]

Start Embedding Network Training
Epoch: 0, Loss: 0: 0%| | 0/600 [00:00<?, ?it/s]
Traceback (most recent call last):
File "main.py", line 291, in
main(args)
File "main.py", line 112, in main
timegan_trainer(model, train_data, train_time, args)
File "/home/ma-user/work/timegan-pytorch-main_without_tb/models/utils.py", line 211, in timegan_trainer
args=args,
File "/home/ma-user/work/timegan-pytorch-main_without_tb/models/utils.py", line 35, in embedding_trainer
_, E_loss0, E_loss_T0 = model(X=X_mb, T=T_mb, Z=None, obj="autoencoder")
File "/home/ma-user/anaconda3/envs/pytorch1.8.1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ma-user/work/timegan-pytorch-main_without_tb/models/timegan.py", line 546, in forward
loss = self._recovery_forward(X, T)
File "/home/ma-user/work/timegan-pytorch-main_without_tb/models/timegan.py", line 403, in _recovery_forward
H = self.embedder(X, T)
File "/home/ma-user/anaconda3/envs/pytorch1.8.1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ma-user/work/timegan-pytorch-main_without_tb/models/timegan.py", line 64, in forward
H_o, H_t = self.emb_rnn(X_packed)
File "/home/ma-user/anaconda3/envs/pytorch1.8.1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ma-user/anaconda3/envs/pytorch1.8.1/lib/python3.7/site-packages/torch_npu/utils/module.py", line 253, in lstm_forward
return output_packed, self.permute_hidden(hidden, unsorted_indices)
File "/home/ma-user/anaconda3/envs/pytorch1.8.1/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 615, in permute_hidden
return apply_permutation(hx[0], permutation), apply_permutation(hx[1], permutation)
File "/home/ma-user/anaconda3/envs/pytorch1.8.1/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 21, in apply_permutation
return tensor.index_select(dim, permutation)
AttributeError: 'NoneType' object has no attribute 'index_select'
THPModule_npu_shutdown success.

二、软件版本:
-- CANN 版本 (e.g., CANN 3.0.x，5.x.x): CANN 5.1.rc2
--Tensorflow/Pytorch/MindSpore 版本: pytorch 1.8.1
--Python 版本 (e.g., Python 3.7.5):python 3.7.10
-- MindStudio版本 (e.g., MindStudio 2.0.0 (beta3)): mindstudio 3.0.3
--操作系统版本 (e.g., Ubuntu 18.04): euleros 2.8

三、测试步骤：
xxxx

四、日志信息:
xxxx
请根据自己的运行环境参考以下方式搜集日志信息，如果涉及到算子开发相关的问题，建议也提供UT/ST测试和单算子集成测试相关的日志。

日志提供方式:
将日志打包后作为附件上传。若日志大小超出附件限制，则可上传至外部网盘后提供链接。

获取方法请参考wiki：
https://gitee.com/ascend/modelzoo/wikis/如何获取日志和计算图?sort_id=4097825

Please add labels , also you can visit https://gitee.com/ascend/community/blob/master/labels.md to find more.
为了让代码尽快被审核，请您为Issue打上标签，打上标签的Issue可以直接推送给责任人进行审核。
更多的标签可以查看https://gitee.com/ascend/community/blob/master/labels.md
以模型训练相关代码提交为例，如果你提交的是模型训练代码，你可以这样评论：
//train/model
另外你还可以给这个Issue标记类型，例如是bugfix或者是特性需求：
//kind/bug or //kind/feature
恭喜你，你已经学会了使用命令来打标签，接下来就在下面的评论里打上标签吧！

//train/model

报错直接原因是参数不是tensor，所以没有index_select的属性，需要根据模型源代码看看H_o, H_t = self.emb_rnn(X_packed)里面做了哪些操作
输入图片说明

self.emb_rnn = torch.nn.LSTM(
input_size=self.feature_dim,
hidden_size=self.hidden_dim,
num_layers=self.num_layers,
batch_first=True
)
emb_rnn是一个LSTM网络，不知道这有什么问题，是要改成torch_npu吗

当前npu实现的lstm语义与原生torch的语义不同。

所以请问要怎么解决呢

此问题还在内部讨论中，待决策，建议暂时可以从模型代码层面考虑规避

Ascend/pytorch

内容风险标识

评论 (8)

Ascend/pytorch .gitee-modal { width: 500px !important; }

内容风险标识