一、问题现象(附报错日志上下文):
valid_ppl=171.09
epoch=11 step=4800/1242000 ppl=165.86 lr=18.857 |g|=0.272 avg=0 min=142.10-min/step=0.0362
epoch=11 step=4900/1242000 ppl=164.84 lr=18.857 |g|=0.286 avg=0 min=144.71-min/step=0.0261
INFO:tensorflow:Saving checkpoints for 4968 into /cache/result/model.ckpt.
INFO:tensorflow:Saving checkpoints for 4968 into /cache/result/model.ckpt.
epoch=11 step=5000/1242000 ppl=162.90 lr=14.857 |g|=0.311 avg=0 min=147.42-min/step=0.0271
epoch=11 step=5100/1242000 ppl=164.42 lr=18.857 |g|=0.499 avg=0 min=150.09-min/step=0.0267
should_reset:True
valid_ppl=162.29
epoch=12 step=5200/1242000 ppl=176.47 lr=23.429 |g|=0.273 avg=0 min=153.11-min/step=0.0302
Traceback (most recent call last):
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: GeOp13_0GEOP::::DoRunAsync Failed
Error Message is :
EE9999: Inner Error!
EE9999 Check param failed, kerArgs can not be null.[FUNC:LoadCpuKernelArgs][FILE:arg_loader.cc][LINE:305]
Failed to load cpu Kernel args , retCode=0x711000e[FUNC:LaunchCpuKernel][FILE:context.cc][LINE:772]
rtAicpuKernelLaunchWithFlag execute failed, reason=[alloc memory error][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:45]
Call rt_ret fail, ret: 0x32899[FUNC:LaunchTask][FILE:aicpu_node_executor.cc][LINE:1335]
[Root-Graph] Error:207001 occurred while executing graph.[FUNC:OnError][FILE:subgraph_context.cc][LINE:168]
[Root-Graph] Error occurs while launching tasks. quit from preparing nodes.[FUNC:NodeEnqueue][FILE:subgraph_executor.cc][LINE:174]
failed to execute graph. model_id = 4[FUNC:HandleResult][FILE:hybrid_model_async_executor.cc][LINE:224]
[[{{node GeOp13_0}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/fixed.py", line 241, in
tf.app.run()
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "/home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/fixed.py", line 233, in main
train(params)
File "/home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/fixed.py", line 140, in train
loss, gn, lr, should_reset, moving_avg_started, _ = sess.run(run_ops)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run
run_metadata=run_metadata)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
raise six.reraise(*original_exc_info)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/six.py", line 719, in reraise
raise value
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
return self._sess.run(*args, **kwargs)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1418, in run
run_metadata=run_metadata)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1176, in run
return self._sess.run(*args, **kwargs)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: GeOp13_0GEOP::::DoRunAsync Failed
Error Message is :
EE9999: Inner Error!
EE9999 Check param failed, kerArgs can not be null.[FUNC:LoadCpuKernelArgs][FILE:arg_loader.cc][LINE:305]
Failed to load cpu Kernel args , retCode=0x711000e[FUNC:LaunchCpuKernel][FILE:context.cc][LINE:772]
rtAicpuKernelLaunchWithFlag execute failed, reason=[alloc memory error][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:45]
Call rt_ret fail, ret: 0x32899[FUNC:LaunchTask][FILE:aicpu_node_executor.cc][LINE:1335]
[Root-Graph] Error:207001 occurred while executing graph.[FUNC:OnError][FILE:subgraph_context.cc][LINE:168]
[Root-Graph] Error occurs while launching tasks. quit from preparing nodes.[FUNC:NodeEnqueue][FILE:subgraph_executor.cc][LINE:174]
failed to execute graph. model_id = 4[FUNC:HandleResult][FILE:hybrid_model_async_executor.cc][LINE:224]
[[{{node GeOp13_0}}]]
[ModelArts Service Log]2022-04-03 03:41:38,546 - INFO - Begin destroy training processes
[ModelArts Service Log]2022-04-03 03:41:38,546 - INFO - proc-rank-0-device-0 (pid: 114) has exited
[ModelArts Service Log]2022-04-03 03:41:38,546 - INFO - End destroy training processes
time="2022-04-03T03:41:38+08:00" level=info msg="http Server goroutine is exit" file="server.go:51" Args="[/home/ma-user/anaconda/bin/python /home/ma-user/modelarts/run/davincirun.py /home/ma-user/anaconda/bin/python /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/boot_modelarts.py --data_url=/home/ma-user/modelarts/inputs/data_url_0/ --train_url=/home/ma-user/modelarts/outputs/train_url_0/ --data_url=/home/ma-user/modelarts/inputs/data_url_0/ --train_url=/home/ma-user/modelarts/outputs/train_url_0/]" Command=run-with-backoff Component=ma-training-toolkit Platform=ModelArts-Service TaskID=worker-0
time="2022-04-03T03:41:38+08:00" level=info msg="run-with-backoff exit with 0" file="controller.go:159" Args="[/home/ma-user/anaconda/bin/python /home/ma-user/modelarts/run/davincirun.py /home/ma-user/anaconda/bin/python /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/boot_modelarts.py --data_url=/home/ma-user/modelarts/inputs/data_url_0/ --train_url=/home/ma-user/modelarts/outputs/train_url_0/ --data_url=/home/ma-user/modelarts/inputs/data_url_0/ --train_url=/home/ma-user/modelarts/outputs/train_url_0/]" Command=run-with-backoff Component=ma-training-toolkit Platform=ModelArts-Service TaskID=worker-0
[2022-04-03T03:41:38+08:00][ModelArts Service Log]exiting...
[2022-04-03T03:41:38+08:00][ModelArts Service Log]exit with 0
[2022-04-03T03:41:39+08:00][ModelArts Service Log][sidecar] training is completed
[2022-04-03T03:41:39+08:00][ModelArts Service Log][sidecar] stop toolkit_obs_upload_by_channels_pid = 42 by signal SIGTERM
time="2022-04-03T03:41:39+08:00" level=info msg="the periodic upload task exiting..." file="upload.go:214" Command=obs/upload_by_channels Component=ma-training-toolkit Platform=ModelArts-Service Task=log_url
time="2022-04-03T03:41:39+08:00" level=info msg="the periodic upload task exiting..." file="upload.go:214" Command=obs/upload_by_channels Component=ma-training-toolkit Platform=ModelArts-Service Task=train_url
time="2022-04-03T03:41:39+08:00" level=info msg="the periodic upload task exiting..." file="upload.go:214" Command=obs/upload_by_channels Component=ma-training-toolkit Platform=ModelArts-Service Task=srt_log_collection
二、软件版本:
-- CANN 版本 5.04:
--Tensorflow版本 1.15:
--Python 版本3.7.10:
Ascend 910 CPU:vCPUs 96GB
操作系统 windows10
pycharm modelarts插件运行
三、测试步骤:
pycharm modelarts插件运行
提取码:
111111
*有效期至: 2022/05/03 11:10:49 GMT+08:00
提取码:
111111
提取码:
111111
提取码:
111111
*有效期至: 2022/05/03 11:12:14 GMT+08:00
@王培亮 你好同学,用4.15日的镜像去跑应该是没问题的,可以自己试一下哈,有什么问题再问,镜像网址https://gitee.com/ascend/docs-openmind/blob/master/guide/modelzoo/tensorflow_model/tutorials/ModelArts%E5%B9%B3%E5%8F%B0CANN%E8%87%AA%E5%AE%9A%E4%B9%89%E9%95%9C%E5%83%8F%E5%88%97%E8%A1%A8.md#https://gitee.com/link?target=https%3A%2F%2Fascend.huawei.com%2F%23%2Fsoftware%2Fcann%2Fcommunity
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
模型功能已经通过。
登录 后才可以发表评论