75 Star 219 Fork 166

Ascend / modelzoo

 / 详情

NPU训练时开启混合精度后训练一半会报错

DONE
训练问题
创建于  
2021-12-14 21:55

一、问题现象(附报错日志上下文):
就是不开混合精度的话能完全完成训练,但若开启混合精度后,经常训练一半就会报错停止,一般训练120轮左右会停止,前120轮的训练是正常的

二、软件版本:
tf1.15
python3.7.5
使用的镜像是
ascend-share/5.0.4.alpha002_tensorflow-ascend910-cp37-euleros2.8-aarch64-training:1.15.0-21.0.2_1207

三、测试步骤:
我这个程序可以直接运行main.py,因为我是pycharm上连modelarts进行NPU训练,所以有个boot_modelarts文件
我这个程序要测试的话要改三个地方
在main.py的19行dataroot要改成存放数据集的路径
在main.py的36行logdir要改成文件输出的路径
在masf_func的第63行WEIGHTPATHS要改成预训练权重路径

四、日志信息:
Error Message is :
EZ9999: Inner Error!
The device(0), core list[0-0], error code is:[FUNC:PrintCoreInfoErrMsg][FILE:device_error_proc.cc][LINE:417]
coreId( 0): 0x800000 [FUNC:PrintCoreInfoErrMsg][FILE:device_error_proc.cc][LINE:428]
Aicore kernel execute failed, device_id=0, stream_id=1271, report_stream_id=1286, task_id=67, fault kernel_name=0_151_gradients_1/model/truediv_94_grad/Sum_1, func_name=te_reducesumd_83ea292cf87215d9cb8c1c225a7616884663e98411cb3ec855885931462ef8fd_6d6fa7fefba29d16_0__kernel0, program id=5666, hash=11804509405317503013[FUNC:GetError][FILE:stream.cc][LINE:712]
Stream synchronize failed, stream = 0xfff561f00720[FUNC:StreamSynchronize][FILE:logger.cc][LINE:270]
rtStreamSynchronize execute failed, reason=[the model stream execute failed][FUNC:ReportFuncErrorReason][FILE:error_message_manage.cc][LINE:39]
invoke rtStreamSynchronize failed, ret = 507011[FUNC:Synchronize][FILE:hybrid_execution_context.cc][LINE:87]
failed to execute graph. model_id = 18[FUNC:HandleResult][FILE:hybrid_model_async_executor.cc][LINE:220]

 [[{{node GeOp49_0}}]]

五、文件:
下面是数据集和预训练权重OBS链接
URL:
https://e-share.obs-website.cn-north-1.myhuaweicloud.com?token=+Ang3FM+ea8yFRyrtuC1YtByeWBzWWShkNrTxwShcutO2dDITmdx1NkPIPxZqu0xJ6T5N0GL2fg4sM027902p5r7VRmHB1s33N2eYIRDCwmcltigZbUhaZSm0sRPy0TQ1rPZLoyl97Ix2Mhou9VY6scMTUwcLJ8UsT4kZDPsj5MVX5DYrg5aH0kOwQ2JEqIbE2FlMKHCx4XTCnzqIQkBlDbNxn8owxO6e30QK5IoW9p+g7IVZM/03yohS+baPt4TO4xryuWrQVS99xEVskoyMgDNb3memBTfcxxjabwhGQs4s4bfkbtDmcGYEs9T1Cbu0q1pvzRWt2VFT+gHjUsK0Ip/RRg2t7SFpZa1Rt91BZJM2JTB1xCumzoJFixO9SEj8XWIFQp699yeW+XfTxZDl2V4DHgxrIG4SISSPGcthO4sj8vEXC6FBTJxgppbrIYNb+EBKxQsD2hHcm31PMZ9ebVDw1G0sYn4JDdy2tCt3tK0VfyDON+3eYKTaFnFreJXIdcqLdwz1AK5I4pX7Ju7EhgrqrphFJmk8JGFltaxiucgRSX0jNRna1kv334U4Cv+6DAewZBlYZjLg3ly+KFpmB6ZMwKWQ1sj5txruOmWO/wEbsP1IFV33uwIO7jTfj8f2qv2KhSd96Cd9js85EQRPXFDs6lnWo2jgrT18RTmUegCzJ8UwJpBH0KhzmXqvVrDZ0x2Jh+Z0zV6E9rCMC7dsE/pOirpAuTZIT37EThR+aTMGk82yPkGTvC7dW/q9s8+nkaJdMp0e9Q5KlaKhJGrtOpW6oMNSVGTxE/s9AqBjtxKUEdwbUbA4tlK/PiAh28XJu3SnErXIqL/eAoUq+nX9tLjsNI5kUyU9oV1ZWyOftW71YAtgTKvrWSigwSZ2C91+NXBOaDDuP6a2eRJ6SpPxaiGmoQlEsMDPG4rFNDI3EEjoS67YmUT0gVUswfrZ7deObEvYveVSbS9bdGUiuAB5ViYLDifXC90amHGu5PTnshhjG+kobLu2auI5SwdmYt8PXTiMSwTassxsoPvDUIN2A==

提取码:
123456

*有效期至: 2022/12/09 21:51:18 GMT+08:00

下面是程序文件OBS链接
URL:
https://e-share.obs-website.cn-north-1.myhuaweicloud.com?token=+Ang3FM+ea8yFRyrtuC1YtByeWBzWWShkNrTxwShcutO2dDITmdx1NkPIPxZqu0xJ6T5N0GL2fg4sM027902p5r7VRmHB1s33N2eYIRDCwmcltigZbUhaZSm0sRPy0TQ1rPZLoyl97Ix2Mhou9VY6scMTUwcLJ8UsT4kZDPsj5MVX5DYrg5aH0kOwQ2JEqIbE2FlMKHCx4XTCnzqIQkBlNH5Bm5+zE+lLra0qOjbg8m6aQRdXyQoGM9ziPQ3HwGSL4a4rNnmwMlAOw1kd04uTuasbqOKzXHd381xWFc+k1AupB/wMKbmvLoC+406vLiygPwpAnGEj6HH3U8xJBiCuXhYOSp7jteJvKsGEhO0FjyeMozewkmdG36aSRWLV1ZMknzVM/McmzqRV8OrQJgFLHWsJ0QALCGQ3Jur8OQ4xEF58DKzBfUPtP4d5kISR3IWJeAgqexDGOoq1yZkySQfyqbnHj8P1dAtq6Y63yvSqHeqATE72eSkvdDCjBEcrwJJ5KDmbPiXPOkDon5j7NO5IU2RFFL6xCDNUmcYopzuZw4=

提取码:
123456

*有效期至: 2022/12/09 21:55:01 GMT+08:00

评论 (1)

chen 创建了训练问题
zhujianpeng 负责人设置为张晓龙
zhujianpeng 任务状态TODO 修改为Analysing
张晓龙 负责人张晓龙 修改为未设置
张晓龙 负责人设置为chenhu
展开全部操作日志

你好请将学校和模型名称填上

chen 任务状态Analysing 修改为CLOSED
chen 任务状态CLOSED 修改为Analysing
chen 任务状态Analysing 修改为DONE
吴定远 关联仓库Ascend/modelzoo-his 修改为Ascend/modelzoo

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(2)
1
https://gitee.com/ascend/modelzoo.git
git@gitee.com:ascend/modelzoo.git
ascend
modelzoo
modelzoo

搜索帮助