74 Star 218 Fork 167

Ascend / modelzoo

 / 详情

【众智】【西安交通大学】【ID2101】【PixelCNN++】在NPU上的测试精度与GPU相差较大

DONE
Bug-Report
创建于  
2021-11-15 16:54

一、问题现象(附报错日志上下文):
模型在GPU上的loss曲线如下图所示:
输入图片说明
在npu上的loss曲线如下图所示:
输入图片说明
二、软件版本:
--Tensorflow 版本:1.15
--Python 版本 :3.7.5

三、测试步骤:
1,使用pycharm最新Modelarts插件。
2,Image Path (SWR-URL)为:ascend-share/5.0.3.alpha005_tensorflow-ascend910-cp37-euleros2.8-aarch64-training:1.15.0-21.0.2_1101
3,NPU:1Ascend 910 CPU:24vCPUs 96GB。

四、日志信息:
打屏日志:
https://wbh2868-pixelcnn.obs.cn-north-4.myhuaweicloud.com:443/output/npu_train_out/MA-new-pixel-cnn-master_npu_allow_mix_precision/log/modelarts-job-20421ee1-fdbf-4c56-aa02-56eedd1b2b0f-worker-0.log?AccessKeyId=HXAKTDLH0ZYQTMMRRWZZ&Expires=1652518290&Signature=NRd4YqtRR6rqlO6jppsMTYqI4LM%3D

计算图文件:
URL:
https://e-share.obs-website.cn-north-1.myhuaweicloud.com?token=bV/cU7kXGpsBjqXSPEcIM+AmQi28RY4bapHWbgd1UmdNTSrUxuKtu45wHm8lYwSald+C0PEAvPVDwM69DnNw+XpRzbGHJ62rZAW/wNNI0klVXJMPrvD226tPg+ha+qvJ2msWHjEjA0NrJw03gam9FyFp923Ohq1+PcbhkVGK8IS3O1SWF9HeRBnfJX30L1ZldpzwqdNn6pWq2yYY+uHJ9OXEK4xIUv8eH0xPKNPZ6QwwOH7/4Pw1dh3D94M66hOozBtCAslElqYOq473a1PrNziq77fZn3zJdNyvcPCU8F2T9GpT1y8jEyUi0J5/EnIinGPrVk8gL7waeffl8/dgySeNDTYOeuQH9Wy9VuIhQHlLmUmB6GDmUAqDqMHoISUFr/UvZtrrGWhVOEgk+a7CRJ+gV2VJNg9H5Lyt6Ck3Gm+mN59wIO7evzntRD6Ew1Q8KMC6fmyFeP6LMOLuOcjwNY5YYjir0GLoNoRMU/6JnAQPP6/wBR2+MPC/Wm7YH8bn38V+fvKCgQSt518ulU6xwOGoUdPykZnEduhr/c1RGFhQvdY+TQte6QPNNH1lkWnQAdArYQJUTF4eKanHMjLM0CMTa1Wn3Nxalbj/ZwiFdPsNRMHPgieeaJwmkeQjZNbL7FxE+HjjST+0W6fK91xwJIt1VsSSlEOWYRN39MyAWCzhAIzAwzXS+tOpJzURFC1Ln1NKSWWRrP0be1bbm0cMgchEBoWjMmy73gNe8R6FFXZhQyyI2/Y0J5Xw2xC8bNmeuri9s6cVtMEyA/qMqDps9iz28+nYw+nWfvQFPF/9fuN+VhoghQhZRqXg90gggKmwSHNgoHGjWHICdMpDTVSi/4OgItPibXybGIgusFqU947DryDaniXapLsrTqjbfQStcd0zsALL9k14EiKJtFYnCP2DZR3B6rBdYTxxJacgIzbvn10Yr5L0CU7fhUNY4eBflohavyVi4jtRt/pE1u8YtQ3vaw28TWxjh0jLPiFtWIxQZ7/5dgLAB+Tywrz2FjN8VePufMaFeaSADGHUmOgvVECgj+a/Re2r6fbb8TskiTxBJhpVH9NyH+t3EwqO+vx+l22XrAuU67xQxRpD8ZEMxO/DQudQ0F1h6XFpNHEHRxA43C1ACZompvOEhfPxLcFpOsnfUNvVN926Apnkal95zA4+G5cNypOxjro8iLAZ5JZ4ixgtZYnjSZR6hDaEiNoO2hiCKNzF6RwimPA4HCNYX7rIt2uzROnc+Z3q1zmkju5ifY4jgT83hL+541CySybX

提取码:
123456
*有效期至: 2022/05/14 16:52:50 GMT+08:00

评论 (6)

WBH 创建了Bug-Report
wangxiaodan1103 任务状态TODO 修改为Analysing
wangxiaodan1103 负责人设置为张晓龙
张晓龙 添加协作者张晓龙
张晓龙 负责人张晓龙 修改为hanfuwei
张晓龙 取消协作者张晓龙
张晓龙 添加协作者张晓龙
展开全部操作日志

你好,第一步,先看一下溢出检测专题,是否有算子溢出。
https://support.huaweicloud.com/tfmigr-cann504alpha1training/atlasmprtg_13_0042.html

第二步,可以参考关闭融合规则后重新运行(如果loss值修正和GPU一致,那就是融合规则的问题)。
https://support.huaweicloud.com/fusionref-cann504alpha1training/atlasrr_30_0001.html

好的,谢谢,我尝试一下

你好,现在在使用modelarts运行训练任务时一直会出现 /home/ma-user/modelarts/ma-training-toolkit: Permission denied 这样的提示,然后会有 registered signal handler提示,一直无法正常运行训练任务,现在不知道是什么问题

你好,根据你们的建议我这边使用了最新的镜像(ascend-share/5.0.4.alpha002_tensorflow-ascend910-cp37-euleros2.8-aarch64-training:1.15.0-21.0.2_1207)在混合精度的模式下训练是仍然存在算子溢出的问题,并且之前关闭融合规则后精度没有提升。

日志信息如下:
https://wbh2868-pixelcnn.obs.cn-north-4.myhuaweicloud.com:443/output/npu_train_out/MA-new-pixel-cnn-master_npu_overflow_detect/log/modelarts-job-7c978e7a-4284-40ad-aca5-04c2c0d30e90-worker-0.log?AccessKeyId=FH7STWYX1HSCOLV0WQ2S&Expires=1670815957&Signature=Pb9owZSzStQdVjxlRsdJ5L%2BSGqY%3D

计算图文件如下:
URL:
https://e-share.obs-website.cn-north-1.myhuaweicloud.com?token=xXi7n6BPQZNez1vfj8XM+dRQx4+dg4GdffFp6V1gPR7jy2H6p8KUn6uc1WKYLU0Mt482KlmsDjkkfEoa6XFRKF2l+Q3dEOe7MnB9IK9DWFG/s85Jggrs7PEq3GT62z8E4u+woTLD7Fba+IMCH1rnSwkgWW3XrW/dblOey+WiAUaXn8xn8URNwWBx9qh5ygRW1MkWO2zJYzr2v7KRt/u4B0InxsKuT8lk0qNPXWXaT3RctFoiKPWkZk6Azw7j0SJlU4NZlvjW/bvFMyTa8Q/O6l6xLyoJd/Zxvvr5rGpV5hrjetoYttAZjFqwNtVgEWINfH0NDpXYQ0jjYoNswwKZAI2G4rdWiHvZEuoxpmhpn7bEX88dng3BxILwEzKGw9i4aihbAvEAtt/84JYLrv07sTdl8wsK2ujmIKzZSna81xCQ+LfEg0H/xPO5HBYYULR4/id46ET0kkOgf3k0p7M/iM8wRu6XpMC4NcqI4rxXxzbVjrsAaWlVhutVflgZ4oksAUuzkHhQgcnsJuryHBZ5HhfA54UjuKah4pIL4i+OOCiC22dmonHdNqjz6vEfky8dE95VoP45tf43cmU33l+ESXHNIWhBzqFCU+Y/csZTkNjWrGZyxeCCVdFNSw1Wzp2UnhGRHd5txou1ytRSy8+MjggLZcA2Wo0oKoL0KSJuyQ2uOAKsb+UJqinyWevfLjsvMIPr6bBEHxJwvh1hTUmLS9EVpA6AHn12IbtwCOs4oZM5mCaTRIA/80AJ6WhvFK+x86ajBvZ2+4nOKSGqFYf6amQO/8QGgUY9dGoQXOY5+nghjCZIXnR+oyEnZaK3d/YO7MhWnAnVEhzawWw46nAQrw+URBlyksB9XF4bY1eLMCL3UCcgwO2cAlSXmw07VRZ/v5x1gzBWPBZbINpD/OY2gLVlDXHUabUsJ35U9NO0VD9NkCzVFYzdg6p0rWfoykd1kB5bIaWccy9+EPXtzjEqXy3MeVLRslLWsqpBPcpDZze7PCjcB7ohlkmIYovoQ26zMF5YcnrhpHj9cmx9NnHf1CVHRJtZLZhCz755hqD+ip9hOD0iDpioLdMMzDCKxhjnYby67n8DNpTDPzcXxksTY7xKCmWs9D1nYIeWlBb35SyWxVTnZ3ytBnP/xtky8QoM

提取码:
111111

*有效期至: 2022/12/12 11:35:40 GMT+08:00

保存的算子溢出文件链接如下:
URL:
https://e-share.obs-website.cn-north-1.myhuaweicloud.com?token=xXi7n6BPQZNez1vfj8XM+dRQx4+dg4GdffFp6V1gPR7jy2H6p8KUn6uc1WKYLU0Mt482KlmsDjkkfEoa6XFRKF2l+Q3dEOe7MnB9IK9DWFG/s85Jggrs7PEq3GT62z8E4u+woTLD7Fba+IMCH1rnSwkgWW3XrW/dblOey+WiAUaXn8xn8URNwWBx9qh5ygRW1MkWO2zJYzr2v7KRt/u4B0InxsKuT8lk0qNPXWXaT3SLUIZwf5IrkXbdsjWjgO9VbLQEp4/K18FyIIQwkAzoNW6Pzzw5rZc+e3tuddfp+s3VUOqTmUs5FRH5rK7YQNxgguzhMFwv/IPyYSJ80FmwewegspfU3VImHesE9T9XX/GjLrwc7GfoGqOMb2LtuQIwsrc46sMC/Wxi6jsgJB78RgIkSAHLjBp+HKRCFXhROaemFzKrzBGajle3TxmHAWsBWoRP/8u/g1H12mDuJvk/bq7HCV90EJs/lXC3YlB1LvcsH2wuF1qiYszD9NWglLjCIswjGII1cz2NFNnlVSStmyai4krG5smNKkbYZXM4NOD/t/ZIyg5jhbKvErjqxtuZZT24DVc/ymEAU/VArdtH7mBM3eaQd3ywDx3eFmEACHBYPltK7NE7RSkJYow55D0vvTJfpo/NSUKXihColGwA1jb+itWZ4a7tO7QPduNoI421aunXP3hJRuw7y2+C+YcC7zc8yBG18WIozmoMLWPKM0I8vWCKiCPB17gSGyitBV74WuRk35G5Y4pUyd94gyAzXOhmY8AtnjIGxlrO9OtqHTZO0uLoMTllc2nTAJEk004v+3lMcU3Ex8aI2fA8vDaDU7+F82Cbjemo+AGulXtzuQA2Jjjf2iuuxz0J8E9qt2H+qujFZIRnUx6Kt6OWduLmZSv8SiwHdIcRqenvloqmoTK76r3vX62qYZYS1ynMMX0=

提取码:
111111

*有效期至: 2022/12/12 11:31:17 GMT+08:00

这个问题已经解决了,请用ascend-share/5.0.3.alpha005_tensorflow-ascend910-cp37-euleros2.8-aarch64-training:1.15.0-2.0.12_1116镜像试下

李想 修改了描述
李想 修改了标题
颜亚文 任务状态Analysing 修改为DONE
吴定远 关联仓库Ascend/modelzoo-his 修改为Ascend/modelzoo

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(3)
1
https://gitee.com/ascend/modelzoo.git
git@gitee.com:ascend/modelzoo.git
ascend
modelzoo
modelzoo

搜索帮助