74 Star 218 Fork 167

Ascend / modelzoo

 / 详情

【众智】【南京大学】【ID2053】【ENAS】训练执行失败报错:Check param failed, kerArgs can not be null

DONE
Bug-Report
创建于  
2022-04-03 11:13

一、问题现象(附报错日志上下文):

valid_ppl=171.09
epoch=11 step=4800/1242000 ppl=165.86 lr=18.857 |g|=0.272 avg=0 min=142.10-min/step=0.0362
epoch=11 step=4900/1242000 ppl=164.84 lr=18.857 |g|=0.286 avg=0 min=144.71-min/step=0.0261
INFO:tensorflow:Saving checkpoints for 4968 into /cache/result/model.ckpt.
INFO:tensorflow:Saving checkpoints for 4968 into /cache/result/model.ckpt.
epoch=11 step=5000/1242000 ppl=162.90 lr=14.857 |g|=0.311 avg=0 min=147.42-min/step=0.0271
epoch=11 step=5100/1242000 ppl=164.42 lr=18.857 |g|=0.499 avg=0 min=150.09-min/step=0.0267
should_reset:True
valid_ppl=162.29
epoch=12 step=5200/1242000 ppl=176.47 lr=23.429 |g|=0.273 avg=0 min=153.11-min/step=0.0302
Traceback (most recent call last):
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: GeOp13_0GEOP::::DoRunAsync Failed
Error Message is :
EE9999: Inner Error!
EE9999 Check param failed, kerArgs can not be null.[FUNC:LoadCpuKernelArgs][FILE:arg_loader.cc][LINE:305]
Failed to load cpu Kernel args , retCode=0x711000e[FUNC:LaunchCpuKernel][FILE:context.cc][LINE:772]
rtAicpuKernelLaunchWithFlag execute failed, reason=[alloc memory error][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:45]
Call rt_ret fail, ret: 0x32899[FUNC:LaunchTask][FILE:aicpu_node_executor.cc][LINE:1335]
[Root-Graph] Error:207001 occurred while executing graph.[FUNC:OnError][FILE:subgraph_context.cc][LINE:168]
[Root-Graph] Error occurs while launching tasks. quit from preparing nodes.[FUNC:NodeEnqueue][FILE:subgraph_executor.cc][LINE:174]
failed to execute graph. model_id = 4[FUNC:HandleResult][FILE:hybrid_model_async_executor.cc][LINE:224]

 [[{{node GeOp13_0}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/fixed.py", line 241, in
tf.app.run()
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "/home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/fixed.py", line 233, in main
train(params)
File "/home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/fixed.py", line 140, in train
loss, gn, lr, should_reset, moving_avg_started, _ = sess.run(run_ops)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run
run_metadata=run_metadata)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
raise six.reraise(*original_exc_info)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/six.py", line 719, in reraise
raise value
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
return self._sess.run(*args, **kwargs)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1418, in run
run_metadata=run_metadata)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1176, in run
return self._sess.run(*args, **kwargs)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: GeOp13_0GEOP::::DoRunAsync Failed
Error Message is :
EE9999: Inner Error!
EE9999 Check param failed, kerArgs can not be null.[FUNC:LoadCpuKernelArgs][FILE:arg_loader.cc][LINE:305]
Failed to load cpu Kernel args , retCode=0x711000e[FUNC:LaunchCpuKernel][FILE:context.cc][LINE:772]
rtAicpuKernelLaunchWithFlag execute failed, reason=[alloc memory error][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:45]
Call rt_ret fail, ret: 0x32899[FUNC:LaunchTask][FILE:aicpu_node_executor.cc][LINE:1335]
[Root-Graph] Error:207001 occurred while executing graph.[FUNC:OnError][FILE:subgraph_context.cc][LINE:168]
[Root-Graph] Error occurs while launching tasks. quit from preparing nodes.[FUNC:NodeEnqueue][FILE:subgraph_executor.cc][LINE:174]
failed to execute graph. model_id = 4[FUNC:HandleResult][FILE:hybrid_model_async_executor.cc][LINE:224]

 [[{{node GeOp13_0}}]]

[ModelArts Service Log]2022-04-03 03:41:38,546 - INFO - Begin destroy training processes
[ModelArts Service Log]2022-04-03 03:41:38,546 - INFO - proc-rank-0-device-0 (pid: 114) has exited
[ModelArts Service Log]2022-04-03 03:41:38,546 - INFO - End destroy training processes
time="2022-04-03T03:41:38+08:00" level=info msg="http Server goroutine is exit" file="server.go:51" Args="[/home/ma-user/anaconda/bin/python /home/ma-user/modelarts/run/davincirun.py /home/ma-user/anaconda/bin/python /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/boot_modelarts.py --data_url=/home/ma-user/modelarts/inputs/data_url_0/ --train_url=/home/ma-user/modelarts/outputs/train_url_0/ --data_url=/home/ma-user/modelarts/inputs/data_url_0/ --train_url=/home/ma-user/modelarts/outputs/train_url_0/]" Command=run-with-backoff Component=ma-training-toolkit Platform=ModelArts-Service TaskID=worker-0
time="2022-04-03T03:41:38+08:00" level=info msg="run-with-backoff exit with 0" file="controller.go:159" Args="[/home/ma-user/anaconda/bin/python /home/ma-user/modelarts/run/davincirun.py /home/ma-user/anaconda/bin/python /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/boot_modelarts.py --data_url=/home/ma-user/modelarts/inputs/data_url_0/ --train_url=/home/ma-user/modelarts/outputs/train_url_0/ --data_url=/home/ma-user/modelarts/inputs/data_url_0/ --train_url=/home/ma-user/modelarts/outputs/train_url_0/]" Command=run-with-backoff Component=ma-training-toolkit Platform=ModelArts-Service TaskID=worker-0
[2022-04-03T03:41:38+08:00][ModelArts Service Log]exiting...
[2022-04-03T03:41:38+08:00][ModelArts Service Log]exit with 0
[2022-04-03T03:41:39+08:00][ModelArts Service Log][sidecar] training is completed
[2022-04-03T03:41:39+08:00][ModelArts Service Log][sidecar] stop toolkit_obs_upload_by_channels_pid = 42 by signal SIGTERM
time="2022-04-03T03:41:39+08:00" level=info msg="the periodic upload task exiting..." file="upload.go:214" Command=obs/upload_by_channels Component=ma-training-toolkit Platform=ModelArts-Service Task=log_url
time="2022-04-03T03:41:39+08:00" level=info msg="the periodic upload task exiting..." file="upload.go:214" Command=obs/upload_by_channels Component=ma-training-toolkit Platform=ModelArts-Service Task=train_url
time="2022-04-03T03:41:39+08:00" level=info msg="the periodic upload task exiting..." file="upload.go:214" Command=obs/upload_by_channels Component=ma-training-toolkit Platform=ModelArts-Service Task=srt_log_collection

二、软件版本:
-- CANN 版本 5.04:
--Tensorflow版本 1.15:
--Python 版本3.7.10:
Ascend 910 CPU:vCPUs 96GB
操作系统 windows10
pycharm modelarts插件运行
输入图片说明

三、测试步骤:
pycharm modelarts插件运行

四、日志信息:
【log】
URL:
https://e-share.obs-website.cn-north-1.myhuaweicloud.com?token=hBzUu953FUPYIyO9yBKxu0F/1ofEsOLVvU/ZshrjC89Ejzbk1o27hk4xvxJhJLKetpLfpdRZpyX1ggWNX+u4Es18f4rDHanYg9x24m71X1OcrxUd0M3NPq3gx8enpWujcQB/EfJqS/52jeUerhb4r09eBSiHJKndkSBDRM1bZyXOBqnAnFCTm0UETenHEUFmejQS1cflMm/fdJEGxgJFnubzVGMw+EAMis39ggegxhQ6dlhwVO+N0SAoapqDos35IcnLrWCn6vg21OAbNmpUps3JAagIQ64qlszeqMFhjmwetSIwN+acVgGswvGy0r9fTWUNjgO5aikaJf9Oh0YND/jG5ZhwYro1PhhKK/1sXp3b22jt9njjqfWIkrcjeyx0NHrhd1m5Y/vMyxfXoWSCI4RHULlZleOTvS1+xaAOBBLNNS4DF0MX4L3k3EllR0Petms7fBURa3QXLLnBEfYmevQS5WPgxnrTtM7UHMERSC4XqWRho2PZ3okOIDnOzZKqML/WNwzrylyaAWPpaCr5FWvQLrv4eyC1wBUzbhKZMJUi4lV7/iCvCmHQrJbHkQHiQV9Ka6Km4x4KzmF0b3zEgE8BhKsnQYOeDbCwiORddp/Oj0B005FzDmTu5m3HlZ656xTZEXeICq/c6srD7DXWhGCwIR5lo8G5BeZpMtaepf5cYD4Sjx0QzgxCGwzi44Sv+15z07K809b6J9ujOFznKw==

提取码:
111111

*有效期至: 2022/05/03 11:10:49 GMT+08:00

【code】
URL:
https://e-share.obs-website.cn-north-1.myhuaweicloud.com?token=hBzUu953FUPYIyO9yBKxu0F/1ofEsOLVvU/ZshrjC89Ejzbk1o27hk4xvxJhJLKetpLfpdRZpyX1ggWNX+u4Es18f4rDHanYg9x24m71X1OcrxUd0M3NPq3gx8enpWujcQB/EfJqS/52jeUerhb4r09eBSiHJKndkSBDRM1bZyXOBqnAnFCTm0UETenHEUFmejQS1cflMm/fdJEGxgJFnngxnWesnMzkxqUe8MQN+4GgUUObAatHYgJZlFsK+D07yQVAkM/pQxkq6YvxfSOrV+jfxqrj0dr+cqMv9F76+OT2owYcozwe+x4oaLvWrWA2sRdXrIcO/oaikIimhIUScCCVOxFQwSjufZf3AJ6dxfMFEmfCeA0GCPMJ1X24gZ4iGEVrKD8l3Nxe/thGMU+dQ3xkUtefOS78n3TM0EKJTEAvCDawG4Ln4ViFZT181HExi7rVV6+CF67xoub7Q8fcKiYTxBt8A3rIUZwB6hUsv2NYmQmnafHEcKOAp1oHp41t0rdcqm6GDeSRgdDlbFp/eimKZZhD9kFxNcpMqo2Z+UWCvn2e4Q07SXXto2m7w5bY0JMLEmn2VrbWk8jheI0b2fmjHDhbAPv3mThFD8dXdJhRBpBGg6dBc89iCDBN8KB9yWxdwvHLBI9SDhABg/cvF0G2mAXWy+3nWLUt1sqKVzUmdN9Op8gWSY8bYvaNJiruRRAiaRtCFoR+CgyTBsIl8Q==

提取码:
111111

*有效期至: 2022/05/03 11:11:21 GMT+08:00
【dataset】
URL:
https://e-share.obs-website.cn-north-1.myhuaweicloud.com?token=hBzUu953FUPYIyO9yBKxu0F/1ofEsOLVvU/ZshrjC89Ejzbk1o27hk4xvxJhJLKetpLfpdRZpyX1ggWNX+u4Es18f4rDHanYg9x24m71X1OcrxUd0M3NPq3gx8enpWujcQB/EfJqS/52jeUerhb4r09eBSiHJKndkSBDRM1bZyXOBqnAnFCTm0UETenHEUFmejQS1cflMm/fdJEGxgJFns0YUbVLOk6pqPAfzUktj8q31yMYEqJ19KtCwOV2IerYNiZkTUmXdOq0QewshiM7zMS7R80sOJdRahczy246Qmr8FO6g+HSVVDIn6dTYJ931KPugjy5oKcK6j4crIbJjsKpmDIMqnBRQBESe95RPcXURS/RnQ/itGa3a/d8d2ew50RVjR1xBgrtytmVJD4SKBXtytLxWd6u+eSD/oyHdEVOVWGPf9lyeuspvEdVbITHjq5iLAe7mbc6SMGaRrjDAr89YZEC8PxAbEETO+3u06mikDH5yFu1CNsgy6sd9b0DbH3gmFPOoSSxz8e9dNs5eo+ucft/ZDz2oVbXzqxG/NyA=

提取码:
111111

*有效期至: 2022/05/03 11:13:00 GMT+08:00
【output】
URL:
https://e-share.obs-website.cn-north-1.myhuaweicloud.com?token=hBzUu953FUPYIyO9yBKxu0F/1ofEsOLVvU/ZshrjC89Ejzbk1o27hk4xvxJhJLKetpLfpdRZpyX1ggWNX+u4Es18f4rDHanYg9x24m71X1OcrxUd0M3NPq3gx8enpWujcQB/EfJqS/52jeUerhb4r09eBSiHJKndkSBDRM1bZyXOBqnAnFCTm0UETenHEUFmejQS1cflMm/fdJEGxgJFnjHiPALoS9aX18TMfC0ZqiL9gWzc7jHdEhrMBCApwjNicar5B0MdDRzmy0E39grSqPjqCznl1Ry43HUenqBVNSnVZXToqhLqQgJ+R1iy+vIqIw9823eKPMvpwE5M1r+6XB4N+WwF+ksf+rCGf0AwbuEhh2FIrkCbgmccwkLyT4uDEEKo20KWL/SzYzB6rda65iXOpWy8a0mi1iBK8U1CIX6C040qQVbjpm+BLteLZUWRZ42xkuwJRgO7D+F/Y6TccFgI+nAASw2M32eU6nonoJU12x+XkxGFVKoMkBwWZqlPBYTXyN2I2r7bN1uX9pR1tzZFYKiWH2fw0euB8+2l4pAdnTX/vb0mkux3FGbrcQngS/LNbNLE0qFCnQ0xrwlEv6UTU4TNEhPKhkfgMNfEhUdUOHfLQYXxBeo4yqQU2NVY3iDiMV/e8Bpa5GvS862r3oeOGORRkFU+QvuTQJUJVEBosHZlGLl6ByU/2smEsKHE8aAGZH0hfZN7XJHFrEkjXw==

提取码:
111111

*有效期至: 2022/05/03 11:12:14 GMT+08:00

评论 (2)

王培亮 创建了Bug-Report
wangxiaodan1103 任务状态TODO 修改为Analysing
wangxiaodan1103 负责人设置为chenhu
展开全部操作日志

模型功能已经通过。

颜亚文 任务状态Analysing 修改为DONE

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(3)
1
https://gitee.com/ascend/modelzoo.git
git@gitee.com:ascend/modelzoo.git
ascend
modelzoo
modelzoo

搜索帮助