76 Star 221 Fork 169

Ascend / modelzoo

 / 详情

【众智】【南京大学】【ENAS】EE3001: The process has lost connection between the host and device. This might be caused by execution timeout of particular operators or unstable connection...

DONE
Bug-Report
创建于  
2021-11-25 22:23

一、问题现象(附报错日志上下文):
日志显示与设备连接有问题:
2021/11/25 22:08:47 job status is Creating please wait.
2021/11/25 22:08:57 job status is Queuing please wait.
2021/11/25 22:09:08 job status is Downloading please wait.
2021/11/25 22:09:19 job status is Downloading please wait.
2021/11/25 22:09:29 job status is Downloading please wait.
2021/11/25 22:09:39 job status is Downloading please wait.
2021/11/25 22:09:49 job status is Downloading please wait.
2021/11/25 22:10:00 job status is Downloading please wait.
2021/11/25 22:10:10 job status is Downloading please wait.
2021/11/25 22:10:20 job status is Downloading please wait.
2021/11/25 22:10:30 job status is Downloading please wait.
2021/11/25 22:10:41 job status is Downloading please wait.
2021/11/25 22:10:51 job status is Downloading please wait.
2021/11/25 22:11:01 job status is Downloading please wait.
2021/11/25 22:11:11 job status is Downloading please wait.
2021/11/25 22:11:21 job status is Running please wait.
[ModelArts Service Log][INFO][2021/11/25 22:10:54]: cache the content of [data_url] inputs successfully
[ModelArts Service Log][INFO][2021/11/25 22:10:54]: it can be accessed at local dir [/home/ma-user/modelarts/inputs/data_url_0]
[ModelArts Service Log][INFO][2021/11/25 22:10:55,786]: mkdir for local output dir
[ModelArts Service Log][INFO][2021/11/25 22:10:55,786]: output-handler finalized
[ModelArts Service Log][init] exiting at 2021-11-25-22:10:55
[ModelArts Service Log][init] upload_metrics_pid = 449
[ModelArts Service Log][init] stop toolkit_obs_upload_pid = 53 by signal SIGTERM
time="2021-11-25T22:10:56+08:00" level=info msg="the periodic upload task exiting..." file="upload.go:59" Command=obs/upload Component=ma-training-toolkit Platform=ModelArts-Service
[ModelArts Service Log][sidecar] toolkit_obs_upload 53 ret_code is 0
[ModelArts Service Log][init] exit with 0
time="2021-11-25T22:10:58+08:00" level=info msg="run command: mkdir -p ~/.pip; echo -e '[global]\ntrusted-host = repo.myhuaweicloud.com\nindex-url = http://repo.myhuaweicloud.com/repository/pypi/simple/' > ~/.pip/pip.conf; bash /home/ma-user/modelarts/run/run_train_v2.sh /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/bash.py --data_url=/home/ma-user/modelarts/inputs/data_url_0/ --train_url=/home/ma-user/modelarts/outputs/train_url_0/ ; ret=$?; echo $ret > /home/ma-user/modelarts/retCode; exit $ret" file="run_train.go:169" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service
[ModelArts Service Log]user: uid=1000(HwHiAiUser) gid=1000(HwHiAiUser) groups=1000(HwHiAiUser)
[ModelArts Service Log]pwd: /home/work
[ModelArts Service Log]boot_file: /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/bash.py
[ModelArts Service Log]command: /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/bash.py --data_url=/home/ma-user/modelarts/inputs/data_url_0/ --train_url=/home/ma-user/modelarts/outputs/train_url_0/
[ModelArts Service Log]local_code_dir: /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src
[ModelArts Service Log]training start at 2021-11-25-22:10:59
[ModelArts Service Log]skip install modelarts training system python packages, due to it's customized image
[ModelArts Service Log]you may install them if necessary
/home/ma-user/modelarts/user-job-dir
INFO:root:Using MoXing-v2.0.0.rc2.4b57a67b-4b57a67b
INFO:root:Using OBS-Python-SDK-3.20.9.1
time="2021-11-25T22:10:59+08:00" level=info msg="upload command: /home/ma-user/training/sidecar.sh" file="upload.go:38" Command=bootstrap/upload Component=ma-training-toolkit Platform=ModelArts-Service
[ModelArts Service Log][sidecar] toolkit_obs_upload_job_pid = 32
[ModelArts Service Log][sidecar] toolkit_obs_upload_pid = 34
[ModelArts Service Log][sidecar] running at 2021-11-25-22:10:59
[ModelArts Service Log][sidecar] outputs_handler_job_pid = 60
[ModelArts Service Log][sidecar] outputs_handler_pid = 61
[ModelArts Service Log][sidecar] toolkit_obs_sync_by_channels_job_pid = 75
[ModelArts Service Log][sidecar] toolkit_obs_sync_by_channels_pid = 77
[ModelArts Service Log][sidecar] waiting for training complete
time="2021-11-25T22:10:59+08:00" level=info msg="local dir = /home/ma-user/modelarts/log/" file="upload.go:39" Command=obs/upload Component=ma-training-toolkit Platform=ModelArts-Service
time="2021-11-25T22:10:59+08:00" level=info msg="obs dir = s3://modelarts-training-log-cn-north-4/7bde9cc4-6f48-4551-b298-afc066cc53c8/worker-0" file="upload.go:42" Command=obs/upload Component=ma-training-toolkit Platform=ModelArts-Service
time="2021-11-25T22:10:59+08:00" level=info msg="start the periodic upload task, upload period = 5 seconds " file="upload.go:52" Command=obs/upload Component=ma-training-toolkit Platform=ModelArts-Service
time="2021-11-25T22:10:59+08:00" level=info msg="local dir = /home/ma-user/modelarts/outputs/train_url_0/" file="upload.go:39" Command=obs/sync_by_channels Component=ma-training-toolkit Ctx=train_url Platform=ModelArts-Service
time="2021-11-25T22:10:59+08:00" level=info msg="obs dir = s3://rstg/MA-new-enas-11-25-22-07/output/" file="upload.go:42" Command=obs/sync_by_channels Component=ma-training-toolkit Ctx=train_url Platform=ModelArts-Service
time="2021-11-25T22:10:59+08:00" level=info msg="start the periodic upload task, upload period = 30 seconds " file="upload.go:52" Command=obs/sync_by_channels Component=ma-training-toolkit Ctx=train_url Platform=ModelArts-Service
[ModelArts Service Log]2021-11-25 22:10:59,983 - INFO - Ascend Driver: Version=21.0.2
[ModelArts Service Log]2021-11-25 22:10:59,984 - INFO - you are advised to use ASCEND_DEVICE_ID env instead of DEVICE_ID, as the DEVICE_ID env will be discarded in later versions
[ModelArts Service Log]2021-11-25 22:10:59,984 - INFO - particularly, ${ASCEND_DEVICE_ID} == ${DEVICE_ID}, it's the logical device id
[ModelArts Service Log]2021-11-25 22:10:59,984 - INFO - Davinci training command
[ModelArts Service Log]2021-11-25 22:10:59,984 - INFO - ['/home/ma-user/anaconda/bin/python', '/home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/bash.py', '--data_url=/home/ma-user/modelarts/inputs/data_url_0/', '--train_url=/home/ma-user/modelarts/outputs/train_url_0/']
[ModelArts Service Log]2021-11-25 22:10:59,984 - INFO - Wait for Rank table file ready
[ModelArts Service Log]2021-11-25 22:10:59,984 - INFO - Rank table file (K8S generated) is ready for read
[ModelArts Service Log]2021-11-25 22:10:59,985 - INFO -
{
"status": "completed",
"group_count": "1",
"group_list": [
{
"group_name": "worker",
"device_count": "1",
"instance_count": "1",
"instance_list": [
{
"pod_name": "ma-job-7bde9cc4-6f48-4551-b298-afc066cc53c8-worker-0",
"server_id": "192.168.0.88",
"devices": [
{
"device_id": "4",
"device_ip": "192.1.210.98"
}
]
}
]
}
]
}
[ModelArts Service Log]2021-11-25 22:10:59,985 - INFO - Rank table file (V1)
[ModelArts Service Log]2021-11-25 22:10:59,985 - INFO -
{
"status": "completed",
"version": "1.0",
"server_count": "1",
"server_list": [
{
"server_id": "192.168.0.88",
"device": [
{
"device_id": "4",
"device_ip": "192.1.210.98",
"rank_id": "0"
}
]
}
]
}
[ModelArts Service Log]2021-11-25 22:10:59,985 - INFO - Rank table file (V1) is generated
[ModelArts Service Log]2021-11-25 22:10:59,986 - INFO - Current server
[ModelArts Service Log]2021-11-25 22:10:59,986 - INFO -
{
"server_id": "192.168.0.88",
"device": [
{
"device_id": "4",
"device_ip": "192.1.210.98",
"rank_id": "0"
}
]
}
[ModelArts Service Log]2021-11-25 22:10:59,987 - ERROR - Route plan so files not found. Please check files in /usr/local/route
[ModelArts Service Log]2021-11-25 22:10:59,988 - INFO - bootstrap proc-rank-0-device-0
[ModelArts Service Log]2021-11-25 22:10:59,997 - INFO - proc-rank-0-device-0 (pid: 81)
time="2021-11-25T22:11:00+08:00" level=info msg="local dir = /home/ma-user/modelarts/log/" file="upload.go:39" Command=obs/sync_by_channels Component=ma-training-toolkit Ctx=log_url Platform=ModelArts-Service
time="2021-11-25T22:11:00+08:00" level=info msg="obs dir = s3://rstg/MA-new-enas-11-25-22-07/log/" file="upload.go:42" Command=obs/sync_by_channels Component=ma-training-toolkit Ctx=log_url Platform=ModelArts-Service
time="2021-11-25T22:11:00+08:00" level=info msg="start the periodic upload task, upload period = 30 seconds " file="upload.go:52" Command=obs/sync_by_channels Component=ma-training-toolkit Ctx=log_url Platform=ModelArts-Service
[ModelArts Service Log][INFO][2021/11/25 22:11:01,011]: registered signal handler
WARNING:tensorflow:From /usr/local/Ascend/tfplugin/latest/tfplugin/python/site-packages/npu_bridge/estimator/npu/npu_optimizer.py:279: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

WARNING:tensorflow:From /usr/local/Ascend/tfplugin/latest/tfplugin/python/site-packages/npu_bridge/estimator/npu/npu_optimizer.py:279: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

WARNING:tensorflow:From /usr/local/Ascend/tfplugin/latest/tfplugin/python/site-packages/npu_bridge/estimator/npu/npu_optimizer.py:279: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.


Path /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/output/search does not exist. Creating

Logging to /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/output/search/stdout
Controller has 41344 params
WARNING:tensorflow:From /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/controller.py:150: multinomial (from tensorflow.python.ops.random_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.random.categorical instead.
W1125 22:11:15.285624 281473482445168 deprecation.py:323] From /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/controller.py:150: multinomial (from tensorflow.python.ops.random_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.random.categorical instead.

Building LM
WARNING:tensorflow:From /home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/input.py:198: add_queue_runner (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the tf.data module.
W1125 22:11:16.354206 281473482445168 deprecation.py:323] From /home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/input.py:198: add_queue_runner (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the tf.data module.

Building model params
All children have 16560000 params
Each child has 2880000 params

Building train graph
WARNING:tensorflow:From /home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/ops/math_grad.py:1375: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W1125 22:11:17.449046 281473482445168 deprecation.py:323] From /home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/ops/math_grad.py:1375: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Building valid graph
WARNING:tensorflow:From /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/controller.py:43: SyncReplicasOptimizer.init (from tensorflow.python.training.sync_replicas_optimizer) is deprecated and will be removed in a future version.
Instructions for updating:
The SyncReplicaOptimizer class is deprecated. For synchrononous training, please use Distribution Strategies.
W1125 22:11:19.832743 281473482445168 deprecation.py:323] From /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/controller.py:43: SyncReplicasOptimizer.init (from tensorflow.python.training.sync_replicas_optimizer) is deprecated and will be removed in a future version.
Instructions for updating:
The SyncReplicaOptimizer class is deprecated. For synchrononous training, please use Distribution Strategies.
INFO:tensorflow:SyncReplicasV2: replicas_to_aggregate=10; total_num_replicas=1
I1125 22:11:19.833111 281473482445168 sync_replicas_optimizer.py:188] SyncReplicasV2: replicas_to_aggregate=10; total_num_replicas=1
INFO:tensorflow:Create CheckpointSaverHook.
I1125 22:11:22.893970 281473482445168 basic_session_run_hooks.py:541] Create CheckpointSaverHook.
WARNING:tensorflow:From /home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
W1125 22:11:35.191197 281473482445168 deprecation.py:323] From /home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
2021-11-25 22:11:35.464592: W tensorflow/core/platform/profile_utils/cpu_utils.cc:98] Failed to find bogomips in /proc/cpuinfo; cannot determine CPU frequency
2021-11-25 22:11:35.474593: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0xaaaae73479b0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-11-25 22:11:35.474649: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2021-11-25 22:11:44.270727: I /home/jenkins/agent/workspace/Compile_GraphEngine_Centos_ARM/tensorflow/tf_adapter/kernels/geop_npu.cc:765] The model has been compiled on the Ascend AI processor, current graph id is:1
INFO:tensorflow:Running local_init_op.
I1125 22:11:58.227085 281473482445168 session_manager.py:500] Running local_init_op.
2021-11-25 22:11:58.399832: I /home/jenkins/agent/workspace/Compile_GraphEngine_Centos_ARM/tensorflow/tf_adapter/kernels/geop_npu.cc:765] The model has been compiled on the Ascend AI processor, current graph id is:21
INFO:tensorflow:Done running local_init_op.
I1125 22:11:58.457854 281473482445168 session_manager.py:502] Done running local_init_op.
WARNING:tensorflow:From /home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py:882: start_queue_runners (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the tf.data module.
W1125 22:11:58.648907 281473482445168 deprecation.py:323] From /home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py:882: start_queue_runners (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the tf.data module.
2021-11-25 22:11:58.919788: I /home/jenkins/agent/workspace/Compile_GraphEngine_Centos_ARM/tensorflow/tf_adapter/kernels/geop_npu.cc:765] The model has been compiled on the Ascend AI processor, current graph id is:41
2021-11-25 22:11:59.097355: I /home/jenkins/agent/workspace/Compile_GraphEngine_Centos_ARM/tensorflow/tf_adapter/kernels/geop_npu.cc:765] The model has been compiled on the Ascend AI processor, current graph id is:51
I1125 22:12:21.824146 281473482445168 basic_session_run_hooks.py:606] Saving checkpoints for 0 into /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/output/search/model.ckpt.
2021-11-25 22:12:22.223409: I /home/jenkins/agent/workspace/Compile_GraphEngine_Centos_ARM/tensorflow/tf_adapter/kernels/geop_npu.cc:765] The model has been compiled on the Ascend AI processor, current graph id is:61
2021-11-25 22:12:32.528275: I /home/jenkins/agent/workspace/Compile_GraphEngine_Centos_ARM/tensorflow/tf_adapter/kernels/geop_npu.cc:765] The model has been compiled on the Ascend AI processor, current graph id is:81
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InternalError'>, GeOp9_0GEOP::::DoRunAsync Failed
Error Message is :
EE3001: The process has lost connection between the host and device. This might be caused by execution timeout of particular operators or unstable connection. Check the error message detail and try again.
Aicpu kernel execute failed, device_id=0, stream_id=908, task_id=3, fault so_name=, fault kernel_name=, fault op_name=input_producer/input_producer_EnqueueMany, extend_info=(info_type:4, info_len:41, msg_info:input_producer/input_producer_EnqueueMany)[FUNC:ProcessDrvErr][FILE:stream.cc][LINE:680]
Stream synchronize failed, stream = 0xfffe2c28a2f0[FUNC:StreamSynchronize][FILE:logger.cc][LINE:285]

 [[{{node GeOp9_0}}]]

I1125 22:12:38.887955 281466294485472 coordinator.py:224] Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InternalError'>, GeOp9_0GEOP::::DoRunAsync Failed
Error Message is :
EE3001: The process has lost connection between the host and device. This might be caused by execution timeout of particular operators or unstable connection. Check the error message detail and try again.
Aicpu kernel execute failed, device_id=0, stream_id=908, task_id=3, fault so_name=, fault kernel_name=, fault op_name=input_producer/input_producer_EnqueueMany, extend_info=(info_type:4, info_len:41, msg_info:input_producer/input_producer_EnqueueMany)[FUNC:ProcessDrvErr][FILE:stream.cc][LINE:680]
Stream synchronize failed, stream = 0xfffe2c28a2f0[FUNC:StreamSynchronize][FILE:logger.cc][LINE:285]

 [[{{node GeOp9_0}}]]

2021-11-25 22:12:39.073107: I /home/jenkins/agent/workspace/Compile_GraphEngine_Centos_ARM/tensorflow/tf_adapter/kernels/geop_npu.cc:765] The model has been compiled on the Ascend AI processor, current graph id is:91

二、软件版本:
-- CANN 版本 5.03:
--Tensorflow版本 1.15:
--Python 版本3.7:
Ascend 910 CPU:vCPUs 96GB
操作系统 windows10
pycharm modelarts插件运行

三、测试步骤:
将数据集上传到OBS桶,在本地使用Pycharm modelarts插件运行调试

pycharm tookit插件配置如下图:
输入图片说明

【code链接】
URL:
https://e-share.obs-website.cn-north-1.myhuaweicloud.com?token=hZCF5qtUlknpNASBV1aRGSyuRUG0yvx6pMpYtyP78bftjwvZCS8MHlBbrDMk3CxVpXNdBmfIM/xZczyBIAXOjkUqnnr2f/5PSjClWRW6eecA1c4LVPBTCL5Jkoak5KqRifnxgbo/MuWySWK9Y3Ds7mAVGSWgqzvAiHuXVxYk4rhWA3WAgQyOhHnltfIZSab5bmo3NGm0d5W/hHqpoURQb2s9pZtBV2lirycnCizb4NqaSDuLzfWO3imvJO7Dngw8gTi+O9PYJulXIs+rMyE0PzOnMKN3XtHr3TMAZttuej2SLFk42GYUylsQcwXmh8fd0uGt/ZtodLgirCDIxQtuqipINwygHZJWBllk/5mDouEexTC0yecQqexqcZgex4a/jwRlOWrz3i9Au5DwLAD+9FMoP/sMd9GDKYunvp8xaCoxFdbYYiF/MRKU1bhMOdXUwxsXd6YZny5o0Zhig1ba2p2q8aElRPiV1nTW0A5dMVbSS+qixizv6WVGu6vSAgq3fzBPtId8naXm6v7GlSBmbCegjRI+DoxfvRmUqlq4FKlWpHU2AQ/H6fQTqnBhFlVFGtC3EFGA4C/1tDI7i+5tYYIF1t5qmzhXEO0VNVM8x0U=

提取码:
123456

  • 有效期至:
    2021/11/26 03:21:13

【Dataset】
URL:
https://e-share.obs-website.cn-north-1.myhuaweicloud.com?token=JqBZP9wXgNR8FX5POqxHbFuHReOMzWnQKxAZABshuUhlIq2KM9oxqsYW3J9BbGO2fxydl/cFzTUrilnEsrVUFA1/bc5Iwo9tx3sm/ZDqWGBy6fpVAeQqLf3rDpZZVYs/x8+dYB9KWQhuQQeTnGtHFSfpSpQUNU/wZfaBzsIWAkwhwnAzjjB8DHPgBnk+PquUM1DYuUmggVxZox5PwUuXtf1+Yx7Z3ZkgG7tWyTdaqQPdB5ktP2mnykoh3q1APJ/aLrolrfQ1vZ7q7GxELWgEjzmTlcboZtIFYXYruZc6nSZD4T3DqB1ZAz4AWOCxBlDVCkschz9PgIekuyAiz/t+alyznql3ld1MHYUX73hehYWNih9iCOcT9Eo0Cp6dk9gM+yU+N8Gylo/uzzJlN/cKlC93+msF2JYJ8V19uk097v7G39bVVrFb3gtU89traiTfPEIxgLBykhuYDwZYHexMtM/0gGkBUUyJ0s0D3MsIbzgQNrw/qLaKkUQGCKHkA8JKjclDUWtGqO833jy5xNqG2lIczyoLM8+odn5oS5kdbMU=

提取码:
132456

  • 有效期至:
    2021/11/26 16:22:04

四、日志信息:
【log】
URL:
https://e-share.obs-website.cn-north-1.myhuaweicloud.com?token=hZCF5qtUlknpNASBV1aRGSyuRUG0yvx6pMpYtyP78bftjwvZCS8MHlBbrDMk3CxVpXNdBmfIM/xZczyBIAXOjkUqnnr2f/5PSjClWRW6eecA1c4LVPBTCL5Jkoak5KqRifnxgbo/MuWySWK9Y3Ds7mAVGSWgqzvAiHuXVxYk4rhWA3WAgQyOhHnltfIZSab5bmo3NGm0d5W/hHqpoURQb65lX3mviFKSDu9n8dKEZK3MY1RkiFSDF1MISJfKxNvAfa2ohKB93evcQtvNl2uuWTjkwck9TAii327PpiFEd88pVvRC4mYD19UuRISiWb60INNEV+N9Jfgw+HLl+qs10rrTDAQwwy9+k+rDBYzavDObDm0BGrT3/XEhTA+LOVaHy+/YDO8zAYYY4j7JDDiELO9jBAUNcqSWYpJjSzsOPNuxVBjb4FGW9ULDonYz04dK/9NE1G44U2ZvUHvwc+I0j+HyooafC1Mhptxvewszb0b4UzC0Mo5JC0rLhukth2rutyIjgFY4O8qymHTPr+My88KRQWO+L0bSfN73Ue349RhHf31Hcg5fs8YRr3i2XQO+GM/LsURNkDK1HXT+NcSlzXU9oTkuiM2EMxgrduA1dcE=

提取码:
123456

  • 有效期至:
    2021/11/26 16:22:48

评论 (4)

王培亮 创建了Bug-Report
zhujianpeng 负责人设置为张晓龙
zhujianpeng 任务状态TODO 修改为Analysing
展开全部操作日志

你好, 我修改之后会这样的错误
2021/12/01 19:54:54 job status is Creating please wait.
2021/12/01 19:55:04 job status is Queuing please wait.
2021/12/01 19:55:14 job status is Queuing please wait.
2021/12/01 19:55:25 job status is Queuing please wait.
2021/12/01 19:55:35 job status is Queuing please wait.
2021/12/01 19:55:45 job status is Queuing please wait.
2021/12/01 19:55:56 job status is Downloading please wait.
2021/12/01 19:56:06 job status is Downloading please wait.
2021/12/01 19:56:16 job status is Downloading please wait.
2021/12/01 19:56:26 job status is Downloading please wait.
2021/12/01 19:56:36 job status is Downloading please wait.
2021/12/01 19:56:47 job status is Downloading please wait.
2021/12/01 19:56:57 job status is Downloading please wait.
2021/12/01 19:57:07 job status is Downloading please wait.
2021/12/01 19:57:17 job status is Running please wait.
[ModelArts Service Log][INFO][2021/12/01 19:57:09]: cache the content of [data_url] inputs successfully
[ModelArts Service Log][INFO][2021/12/01 19:57:09]: it can be accessed at local dir [/home/ma-user/modelarts/inputs/data_url_0]
[ModelArts Service Log][INFO][2021/12/01 19:57:10,506]: mkdir for local output dir
[ModelArts Service Log][INFO][2021/12/01 19:57:10,506]: output-handler finalized
[ModelArts Service Log][init] exiting at 2021-12-01-19:57:10
[ModelArts Service Log][init] upload_metrics_pid = 445
[ModelArts Service Log][init] stop toolkit_obs_upload_pid = 51 by signal SIGTERM
time="2021-12-01T19:57:11+08:00" level=info msg="the periodic upload task exiting..." file="upload.go:59" Command=obs/upload Component=ma-training-toolkit Platform=ModelArts-Service
[ModelArts Service Log][sidecar] toolkit_obs_upload 51 ret_code is 0
[ModelArts Service Log][init] exit with 0
time="2021-12-01T19:57:13+08:00" level=info msg="run command: mkdir -p ~/.pip; echo -e '[global]\ntrusted-host = repo.myhuaweicloud.com\nindex-url = http://repo.myhuaweicloud.com/repository/pypi/simple/' > ~/.pip/pip.conf; bash /home/ma-user/modelarts/run/run_train_v2.sh /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/bash.py --data_url=/home/ma-user/modelarts/inputs/data_url_0/ --train_url=/home/ma-user/modelarts/outputs/train_url_0/ ; ret=$?; echo $ret > /home/ma-user/modelarts/retCode; exit $ret" file="run_train.go:169" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service
[ModelArts Service Log]user: uid=1000(HwHiAiUser) gid=1000(HwHiAiUser) groups=1000(HwHiAiUser)
[ModelArts Service Log]pwd: /home/work
[ModelArts Service Log]boot_file: /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/bash.py
[ModelArts Service Log]command: /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/bash.py --data_url=/home/ma-user/modelarts/inputs/data_url_0/ --train_url=/home/ma-user/modelarts/outputs/train_url_0/
[ModelArts Service Log]local_code_dir: /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src
[ModelArts Service Log]training start at 2021-12-01-19:57:13
[ModelArts Service Log]skip install modelarts training system python packages, due to it's customized image
[ModelArts Service Log]you may install them if necessary
/home/ma-user/modelarts/user-job-dir
INFO:root:Using MoXing-v2.0.0.rc2.4b57a67b-4b57a67b
INFO:root:Using OBS-Python-SDK-3.20.9.1
time="2021-12-01T19:57:14+08:00" level=info msg="upload command: /home/ma-user/training/sidecar.sh" file="upload.go:38" Command=bootstrap/upload Component=ma-training-toolkit Platform=ModelArts-Service
[ModelArts Service Log][sidecar] toolkit_obs_upload_job_pid = 32
[ModelArts Service Log][sidecar] toolkit_obs_upload_pid = 33
[ModelArts Service Log][sidecar] running at 2021-12-01-19:57:14
[ModelArts Service Log][sidecar] outputs_handler_job_pid = 60
[ModelArts Service Log][sidecar] outputs_handler_pid = 61
[ModelArts Service Log][sidecar] toolkit_obs_sync_by_channels_job_pid = 75
[ModelArts Service Log][sidecar] toolkit_obs_sync_by_channels_pid = 77
[ModelArts Service Log][sidecar] waiting for training complete
time="2021-12-01T19:57:14+08:00" level=info msg="local dir = /home/ma-user/modelarts/log/" file="upload.go:39" Command=obs/upload Component=ma-training-toolkit Platform=ModelArts-Service
time="2021-12-01T19:57:14+08:00" level=info msg="obs dir = s3://modelarts-training-log-cn-north-4/5c998f1b-43e8-4fec-99e8-9ffbb3674001/worker-0" file="upload.go:42" Command=obs/upload Component=ma-training-toolkit Platform=ModelArts-Service
time="2021-12-01T19:57:14+08:00" level=info msg="start the periodic upload task, upload period = 5 seconds " file="upload.go:52" Command=obs/upload Component=ma-training-toolkit Platform=ModelArts-Service
time="2021-12-01T19:57:14+08:00" level=info msg="local dir = /home/ma-user/modelarts/outputs/train_url_0/" file="upload.go:39" Command=obs/sync_by_channels Component=ma-training-toolkit Ctx=train_url Platform=ModelArts-Service
time="2021-12-01T19:57:14+08:00" level=info msg="obs dir = s3://rstg/MA-new-enas-12-01-19-54/output/" file="upload.go:42" Command=obs/sync_by_channels Component=ma-training-toolkit Ctx=train_url Platform=ModelArts-Service
time="2021-12-01T19:57:14+08:00" level=info msg="start the periodic upload task, upload period = 30 seconds " file="upload.go:52" Command=obs/sync_by_channels Component=ma-training-toolkit Ctx=train_url Platform=ModelArts-Service
time="2021-12-01T19:57:14+08:00" level=info msg="local dir = /home/ma-user/modelarts/log/" file="upload.go:39" Command=obs/sync_by_channels Component=ma-training-toolkit Ctx=log_url Platform=ModelArts-Service
time="2021-12-01T19:57:14+08:00" level=info msg="obs dir = s3://rstg/MA-new-enas-12-01-19-54/log/" file="upload.go:42" Command=obs/sync_by_channels Component=ma-training-toolkit Ctx=log_url Platform=ModelArts-Service
time="2021-12-01T19:57:14+08:00" level=info msg="start the periodic upload task, upload period = 30 seconds " file="upload.go:52" Command=obs/sync_by_channels Component=ma-training-toolkit Ctx=log_url Platform=ModelArts-Service
[ModelArts Service Log]2021-12-01 19:57:14,581 - INFO - Ascend Driver: Version=21.0.2
[ModelArts Service Log]2021-12-01 19:57:14,582 - INFO - you are advised to use ASCEND_DEVICE_ID env instead of DEVICE_ID, as the DEVICE_ID env will be discarded in later versions
[ModelArts Service Log]2021-12-01 19:57:14,582 - INFO - particularly, ${ASCEND_DEVICE_ID} == ${DEVICE_ID}, it's the logical device id
[ModelArts Service Log]2021-12-01 19:57:14,582 - INFO - Davinci training command
[ModelArts Service Log]2021-12-01 19:57:14,582 - INFO - ['/home/ma-user/anaconda/bin/python', '/home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/bash.py', '--data_url=/home/ma-user/modelarts/inputs/data_url_0/', '--train_url=/home/ma-user/modelarts/outputs/train_url_0/']
[ModelArts Service Log]2021-12-01 19:57:14,582 - INFO - Wait for Rank table file ready
[ModelArts Service Log]2021-12-01 19:57:14,582 - INFO - Rank table file (K8S generated) is ready for read
[ModelArts Service Log]2021-12-01 19:57:14,583 - INFO -
{
"status": "completed",
"group_count": "1",
"group_list": [
{
"group_name": "worker",
"device_count": "1",
"instance_count": "1",
"instance_list": [
{
"pod_name": "ma-job-5c998f1b-43e8-4fec-99e8-9ffbb3674001-worker-0",
"server_id": "192.168.0.91",
"devices": [
{
"device_id": "0",
"device_ip": "192.1.216.96"
}
]
}
]
}
]
}
[ModelArts Service Log]2021-12-01 19:57:14,583 - INFO - Rank table file (V1)
[ModelArts Service Log]2021-12-01 19:57:14,583 - INFO -
{
"status": "completed",
"version": "1.0",
"server_count": "1",
"server_list": [
{
"server_id": "192.168.0.91",
"device": [
{
"device_id": "0",
"device_ip": "192.1.216.96",
"rank_id": "0"
}
]
}
]
}
[ModelArts Service Log]2021-12-01 19:57:14,583 - INFO - Rank table file (V1) is generated
[ModelArts Service Log]2021-12-01 19:57:14,584 - INFO - Current server
[ModelArts Service Log]2021-12-01 19:57:14,584 - INFO -
{
"server_id": "192.168.0.91",
"device": [
{
"device_id": "0",
"device_ip": "192.1.216.96",
"rank_id": "0"
}
]
}
[ModelArts Service Log]2021-12-01 19:57:14,584 - ERROR - Route plan so files not found. Please check files in /usr/local/route
[ModelArts Service Log]2021-12-01 19:57:14,585 - INFO - bootstrap proc-rank-0-device-0
[ModelArts Service Log]2021-12-01 19:57:14,591 - INFO - proc-rank-0-device-0 (pid: 81)
[ModelArts Service Log][INFO][2021/12/01 19:57:15,499]: registered signal handler
WARNING:tensorflow:From /usr/local/Ascend/tfplugin/latest/tfplugin/python/site-packages/npu_bridge/estimator/npu/npu_optimizer.py:279: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

WARNING:tensorflow:From /usr/local/Ascend/tfplugin/latest/tfplugin/python/site-packages/npu_bridge/estimator/npu/npu_optimizer.py:279: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

runType=search!

WARNING:tensorflow:From /usr/local/Ascend/tfplugin/latest/tfplugin/python/site-packages/npu_bridge/estimator/npu/npu_optimizer.py:279: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.


Path /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/output/search does not exist. Creating

Logging to /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/output/search/stdout

train_size: 929589
valid_size: 73760

Create a controller
Controller has 41344 params
WARNING:tensorflow:From /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/controller.py:147: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
W1201 19:57:27.636366 281473056653680 deprecation.py:323] From /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/controller.py:147: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
WARNING:tensorflow:From /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/controller.py:150: multinomial (from tensorflow.python.ops.random_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.random.categorical instead.
W1201 19:57:27.643595 281473056653680 deprecation.py:323] From /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/controller.py:150: multinomial (from tensorflow.python.ops.random_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.random.categorical instead.
WARNING:tensorflow:From /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/controller.py:151: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
W1201 19:57:27.645992 281473056653680 deprecation.py:323] From /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/controller.py:151: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.

Building LM
raw_data_size:929589
raw_data:Tensor("raw_data:0", shape=(929589,), dtype=int32)
data/reshape:Tensor("Reshape_36:0", shape=(64, 14524), dtype=int32)
raw_data_size:73760
raw_data:Tensor("raw_data_1:0", shape=(73760,), dtype=int32)
data/reshape:Tensor("Reshape_37:0", shape=(64, 1152), dtype=int32)
data_x:Tensor("transpose:0", shape=(1152, 64), dtype=int32)
data_y:Tensor("transpose_1:0", shape=(1152, 64), dtype=int32)
b_x:Tensor("IteratorGetNext:0", shape=(?, 64), dtype=int32)
x:Tensor("Reshape_38:0", shape=(64, 20), dtype=int32)
b_y:Tensor("IteratorGetNext_1:0", shape=(?, 64), dtype=int32)
y:Tensor("Reshape_39:0", shape=(64, 20), dtype=int32)

Building model params
All children have 16560000 params
Each child has 2880000 params

Building train graph
inp:Tensor("while/strided_slice:0", shape=(64, 200), dtype=float32)
w_prev:Tensor("child/rnn_cell/mul:0", shape=(400, 400), dtype=float32)
layer:[<tf.Tensor 'while/add_1:0' shape=(64, 200) dtype=float32>]
layers_id:0
s before set_shape:Tensor("while/add_4:0", shape=(64, 200), dtype=float32)
s after set_shape:Tensor("while/add_4:0", shape=(64, 200), dtype=float32)
layers_id:1
s before set_shape:Tensor("while/add_7:0", shape=(64, 200), dtype=float32)
s after set_shape:Tensor("while/add_7:0", shape=(64, 200), dtype=float32)
layers_id:2
s before set_shape:Tensor("while/add_10:0", shape=(64, 200), dtype=float32)
s after set_shape:Tensor("while/add_10:0", shape=(64, 200), dtype=float32)
layers_id:3
s before set_shape:Tensor("while/add_13:0", shape=(64, 200), dtype=float32)
s after set_shape:Tensor("while/add_13:0", shape=(64, 200), dtype=float32)
layers_id:4
s before set_shape:Tensor("while/add_16:0", shape=(64, 200), dtype=float32)
s after set_shape:Tensor("while/add_16:0", shape=(64, 200), dtype=float32)
layers_id:5
s before set_shape:Tensor("while/add_19:0", shape=(64, 200), dtype=float32)
s after set_shape:Tensor("while/add_19:0", shape=(64, 200), dtype=float32)
layers_id:6
s before set_shape:Tensor("while/add_22:0", shape=(64, 200), dtype=float32)
s after set_shape:Tensor("while/add_22:0", shape=(64, 200), dtype=float32)
layers_id:7
s before set_shape:Tensor("while/add_25:0", shape=(64, 200), dtype=float32)
s after set_shape:Tensor("while/add_25:0", shape=(64, 200), dtype=float32)
layers_id:8
s before set_shape:Tensor("while/add_28:0", shape=(64, 200), dtype=float32)
s after set_shape:Tensor("while/add_28:0", shape=(64, 200), dtype=float32)
layers:[<tf.Tensor 'while/add_1:0' shape=(64, 200) dtype=float32>, <tf.Tensor 'while/add_4:0' shape=(64, 200) dtype=float32>, <tf.Tensor 'while/add_7:0' shape=(64, 200) dtype=float32>, <tf.Tensor 'while/add_10:0' shape=(64, 200) dtype=float32>, <tf.Tensor 'while/add_13:0' shape=(64, 200) dtype=float32>, <tf.Tensor 'while/add_16:0' shape=(64, 200) dtype=float32>, <tf.Tensor 'while/add_19:0' shape=(64, 200) dtype=float32>, <tf.Tensor 'while/add_22:0' shape=(64, 200) dtype=float32>, <tf.Tensor 'while/add_25:0' shape=(64, 200) dtype=float32>, <tf.Tensor 'while/add_28:0' shape=(64, 200) dtype=float32>]
s:Tensor("while/add_28:0", shape=(64, 200), dtype=float32)
step:Tensor("while/Identity:0", shape=(), dtype=int32)
next_s:Tensor("while/truediv:0", shape=(64, 200), dtype=float32)
stack_all_s:Tensor("TensorArrayStack/TensorArrayGatherV3:0", shape=(?, 64, 200), dtype=float32)
all_s:Tensor("transpose_4:0", shape=(64, ?, 200), dtype=float32)
WARNING:tensorflow:From /home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/ops/math_grad.py:1375: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W1201 19:57:29.936891 281473056653680 deprecation.py:323] From /home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/ops/math_grad.py:1375: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Building valid graph
inp:Tensor("while_1/strided_slice:0", shape=(64, 200), dtype=float32)
w_prev:<tf.Variable 'child/rnn_cell/w_prev:0' shape=(400, 400) dtype=float32_ref>
layer:[<tf.Tensor 'while_1/add_1:0' shape=(64, 200) dtype=float32>]
layers_id:0
s before set_shape:Tensor("while_1/add_4:0", shape=(64, 200), dtype=float32)
s after set_shape:Tensor("while_1/add_4:0", shape=(64, 200), dtype=float32)
layers_id:1
s before set_shape:Tensor("while_1/add_7:0", shape=(64, 200), dtype=float32)
s after set_shape:Tensor("while_1/add_7:0", shape=(64, 200), dtype=float32)
layers_id:2
s before set_shape:Tensor("while_1/add_10:0", shape=(64, 200), dtype=float32)
s after set_shape:Tensor("while_1/add_10:0", shape=(64, 200), dtype=float32)
layers_id:3
s before set_shape:Tensor("while_1/add_13:0", shape=(64, 200), dtype=float32)
s after set_shape:Tensor("while_1/add_13:0", shape=(64, 200), dtype=float32)
layers_id:4
s before set_shape:Tensor("while_1/add_16:0", shape=(64, 200), dtype=float32)
s after set_shape:Tensor("while_1/add_16:0", shape=(64, 200), dtype=float32)
layers_id:5
s before set_shape:Tensor("while_1/add_19:0", shape=(64, 200), dtype=float32)
s after set_shape:Tensor("while_1/add_19:0", shape=(64, 200), dtype=float32)
layers_id:6
s before set_shape:Tensor("while_1/add_22:0", shape=(64, 200), dtype=float32)
s after set_shape:Tensor("while_1/add_22:0", shape=(64, 200), dtype=float32)
layers_id:7
s before set_shape:Tensor("while_1/add_25:0", shape=(64, 200), dtype=float32)
s after set_shape:Tensor("while_1/add_25:0", shape=(64, 200), dtype=float32)
layers_id:8
s before set_shape:Tensor("while_1/add_28:0", shape=(64, 200), dtype=float32)
s after set_shape:Tensor("while_1/add_28:0", shape=(64, 200), dtype=float32)
layers:[<tf.Tensor 'while_1/add_1:0' shape=(64, 200) dtype=float32>, <tf.Tensor 'while_1/add_4:0' shape=(64, 200) dtype=float32>, <tf.Tensor 'while_1/add_7:0' shape=(64, 200) dtype=float32>, <tf.Tensor 'while_1/add_10:0' shape=(64, 200) dtype=float32>, <tf.Tensor 'while_1/add_13:0' shape=(64, 200) dtype=float32>, <tf.Tensor 'while_1/add_16:0' shape=(64, 200) dtype=float32>, <tf.Tensor 'while_1/add_19:0' shape=(64, 200) dtype=float32>, <tf.Tensor 'while_1/add_22:0' shape=(64, 200) dtype=float32>, <tf.Tensor 'while_1/add_25:0' shape=(64, 200) dtype=float32>, <tf.Tensor 'while_1/add_28:0' shape=(64, 200) dtype=float32>]
s:Tensor("while_1/add_28:0", shape=(64, 200), dtype=float32)
step:Tensor("while_1/Identity:0", shape=(), dtype=int32)
next_s:Tensor("while_1/truediv:0", shape=(64, 200), dtype=float32)
stack_all_s:Tensor("TensorArrayStack_1/TensorArrayGatherV3:0", shape=(?, 64, 200), dtype=float32)
all_s:Tensor("transpose_5:0", shape=(64, ?, 200), dtype=float32)
WARNING:tensorflow:From /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/controller.py:43: SyncReplicasOptimizer.init (from tensorflow.python.training.sync_replicas_optimizer) is deprecated and will be removed in a future version.
Instructions for updating:
The SyncReplicaOptimizer class is deprecated. For synchrononous training, please use Distribution Strategies.
W1201 19:57:32.262350 281473056653680 deprecation.py:323] From /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/controller.py:43: SyncReplicasOptimizer.init (from tensorflow.python.training.sync_replicas_optimizer) is deprecated and will be removed in a future version.
Instructions for updating:
The SyncReplicaOptimizer class is deprecated. For synchrononous training, please use Distribution Strategies.
INFO:tensorflow:SyncReplicasV2: replicas_to_aggregate=10; total_num_replicas=1
I1201 19:57:32.262695 281473056653680 sync_replicas_optimizer.py:188] SyncReplicasV2: replicas_to_aggregate=10; total_num_replicas=1
WARNING:tensorflow:From /home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/sync_replicas_optimizer.py:351: QueueRunner.init (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the tf.data module.
W1201 19:57:35.083702 281473056653680 deprecation.py:323] From /home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/sync_replicas_optimizer.py:351: QueueRunner.init (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the tf.data module.

HParams:
{
"alpha": 0.0,
"batch_size": 64,
"best_valid_ppl_threshold": 5,
"beta": 1.0,
"bptt_steps": 20,
"controller_baseline_dec": 0.999,
"controller_entropy_weight": 1e-05,
"controller_hidden_size": 64,
"controller_learning_rate": 5e-05,
"controller_num_aggregate": 10,
"controller_num_functions": 4,
"controller_num_layers": 9,
"controller_num_train_steps": 25,
"controller_tanh_constant": 2.25,
"controller_temperature": 5.0,
"data_path": "/home/ma-user/modelarts/inputs/data_url_0/ptb/ptb.pkl",
"drop_e": 0.1,
"drop_i": 0.2,
"drop_l": 0.25,
"drop_o": 0.75,
"drop_w": 0.0,
"drop_x": 0.75,
"grad_bound": 0.1,
"hidden_size": 200,
"init_range": 0.04,
"learning_rate": 20.0,
"log_every": 200,
"num_train_batches": 726,
"num_train_epochs": 600,
"num_train_steps": 435600,
"output_dir": "/home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/output/search",
"vocab_size": 10000,
"weight_decay": 8e-07
}
INFO:tensorflow:Create CheckpointSaverHook.
I1201 19:57:35.173776 281473056653680 basic_session_run_hooks.py:541] Create CheckpointSaverHook.
WARNING:tensorflow:From /home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
W1201 19:57:47.176337 281473056653680 deprecation.py:323] From /home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
INFO:tensorflow:Graph was finalized.
I1201 19:57:47.395758 281473056653680 monitored_session.py:240] Graph was finalized.
2021-12-01 19:57:47.418316: W tensorflow/core/platform/profile_utils/cpu_utils.cc:98] Failed to find bogomips in /proc/cpuinfo; cannot determine CPU frequency
2021-12-01 19:57:47.426044: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0xaaaae124e110 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-12-01 19:57:47.426090: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2021-12-01 19:57:52.445980: I /home/jenkins/agent/workspace/Compile_GraphEngine_Centos_ARM/tensorflow/tf_adapter/kernels/geop_npu.cc:765] The model has been compiled on the Ascend AI processor, current graph id is:1
INFO:tensorflow:Running local_init_op.
I1201 19:58:04.936841 281473056653680 session_manager.py:500] Running local_init_op.
2021-12-01 19:58:05.116458: I /home/jenkins/agent/workspace/Compile_GraphEngine_Centos_ARM/tensorflow/tf_adapter/kernels/geop_npu.cc:765] The model has been compiled on the Ascend AI processor, current graph id is:21
INFO:tensorflow:Done running local_init_op.
I1201 19:58:05.172280 281473056653680 session_manager.py:502] Done running local_init_op.
2021-12-01 19:58:05.551652: I /home/jenkins/agent/workspace/Compile_GraphEngine_Centos_ARM/tensorflow/tf_adapter/kernels/geop_npu.cc:765] The model has been compiled on the Ascend AI processor, current graph id is:41
INFO:tensorflow:Saving checkpoints for 0 into /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/output/search/model.ckpt.
I1201 19:58:24.926260 281473056653680 basic_session_run_hooks.py:606] Saving checkpoints for 0 into /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/output/search/model.ckpt.
2021-12-01 19:58:25.179278: I /home/jenkins/agent/workspace/Compile_GraphEngine_Centos_ARM/tensorflow/tf_adapter/kernels/geop_npu.cc:765] The model has been compiled on the Ascend AI processor, current graph id is:51
2021-12-01 19:58:33.356756: I /home/jenkins/agent/workspace/Compile_GraphEngine_Centos_ARM/tensorflow/tf_adapter/kernels/geop_npu.cc:765] The model has been compiled on the Ascend AI processor, current graph id is:71
2021-12-01 19:58:33.641136: I /home/jenkins/agent/workspace/Compile_GraphEngine_Centos_ARM/tensorflow/tf_adapter/kernels/geop_npu.cc:765] The model has been compiled on the Ascend AI processor, current graph id is:81
2021-12-01 19:58:34.214017: I /home/jenkins/agent/workspace/Compile_GraphEngine_Centos_ARM/tensorflow/tf_adapter/kernels/geop_npu.cc:765] The model has been compiled on the Ascend AI processor, current graph id is:91
2021-12-01 19:58:36.420151: I /home/jenkins/agent/workspace/Compile_GraphEngine_Centos_ARM/tensorflow/tf_adapter/kernels/geop_npu.cc:765] The model has been compiled on the Ascend AI processor, current graph id is:101
[WARNING] TBE:2021-12-01-20:00:28 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((floordiv((((16*(m(1, 2147483647)*3328)) + 4096) - 1), 4096)4096) + 33280000), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:00:29 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((floordiv((((16
(m(1, 2147483647)*3328)) + 4096) - 1), 4096)4096) + 16826368), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:00:29 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((floordiv((((16
(m(1, 2147483647)*3328)) + 4096) - 1), 4096)4096) + 33280000), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:00:29 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((floordiv((((16
(floordiv((m(1, 2147483647) + 15), 16)*3328)) + 4096) - 1), 4096)4096) + 16826368), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:00:30 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((floordiv((((16
(floordiv((m(1, 2147483647) + 31), 32)*3328)) + 4096) - 1), 4096)4096) + 33280000), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:00:30 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((floordiv((((16
(floordiv((m(1, 2147483647) + 31), 32)*53248)) + 4096) - 1), 4096)4096) + 33280000), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:00:30 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((floordiv((((16
(floordiv((m(1, 2147483647) + 3), 4)*3328)) + 4096) - 1), 4096)4096) + 17039360), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:00:30 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((floordiv((((16
(floordiv((m(1, 2147483647) + 63), 64)*53248)) + 4096) - 1), 4096)4096) + 33280000), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:00:30 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((floordiv((((16
(floordiv((m(1, 2147483647) + 31), 32)*3328)) + 4096) - 1), 4096)4096) + 33280000), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:00:31 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((floordiv((((16
(floordiv((m(1, 2147483647) + 31), 32)*53248)) + 4096) - 1), 4096)*4096) + 16719872), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:00:31 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is 18743296, while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:00:31 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((40960000 + (floordiv(((max(262144, 262144) + 4096) - 1), 4096)*4096)) + (floordiv(((max(262144, 262144) + 4096) - 1), 4096)*4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:00:31 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is 17039360, while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:00:32 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((20971520 + (floordiv(((max(262144, 262144) + 4096) - 1), 4096)*4096)) + (floordiv(((max(262144, 262144) + 4096) - 1), 4096)4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:00:32 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((floordiv((((16
(floordiv((m(1, 2147483647) + 255), 256)*425984)) + 4096) - 1), 4096)4096) + 33280000), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:00:32 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is 34131968, while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:00:46 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((floordiv((((16
((floordiv((m(1, 2147483647) + 15), 16)32)(floordiv((k(1, 2147483647) + 31), 32)*512))) + 4096) - 1), 4096)4096) + (floordiv((((16(floordiv((k(1, 2147483647) + 31), 32)*5120000)) + 4096) - 1), 4096)4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:00:46 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is (((floordiv((((16
(floordiv((k(1, 2147483647) + 31), 32)*5120000)) + 4096) - 1), 4096)4096) + (floordiv((((16(floordiv((k(1, 2147483647) + 31), 32)*8192)) + 4096) - 1), 4096)4096)) + (floordiv((((16(floordiv((k(1, 2147483647) + 31), 32)*8192)) + 4096) - 1), 4096)4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:00 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((floordiv((((16
(m(1, 2147483647)*160000)) + 4096) - 1), 4096)*4096) + 40960000), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:00 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is (33652736 + (floordiv(((max(32768, 32768) + 4096) - 1), 4096)4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:00 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is (((floordiv((((16
(m(1, 2147483647)*160000)) + 4096) - 1), 4096)*4096) + 5120000) + 5120000), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:01 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is (34078720 + (floordiv(((max(max(max(65536, 65536), 65536), 65536) + 4096) - 1), 4096)*4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:01 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is (33652736 + (floordiv(((max(32768, 32768) + 4096) - 1), 4096)*4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:01 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is (33652736 + (floordiv(((max(max(max(32768, 32768), 32768), 32768) + 4096) - 1), 4096)*4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:02 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is (33652736 + (floordiv(((max(32768, 32768) + 4096) - 1), 4096)4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:02 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is (((floordiv((((16
(floordiv((m(1, 2147483647) + 7), 8)*655360)) + 4096) - 1), 4096)4096) + 65536) + 65536), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:03 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((floordiv((((16
(floordiv((m(1, 2147483647) + 15), 16)*643072)) + 4096) - 1), 4096)*4096) + (floordiv(((max(262144, 262144) + 4096) - 1), 4096)*4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:03 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is (33652736 + (floordiv(((max(max(max(32768, 32768), 32768), 32768) + 4096) - 1), 4096)*4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:03 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is (33652736 + (floordiv(((max(32768, 32768) + 4096) - 1), 4096)4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:03 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((floordiv((((16
(floordiv((m(1, 2147483647) + 31), 32)*2560000)) + 4096) - 1), 4096)*4096) + 10240000), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:04 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is (33652736 + (floordiv(((max(max(max(32768, 32768), 32768), 32768) + 4096) - 1), 4096)4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:04 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((floordiv((((16
(floordiv((m(1, 2147483647) + 63), 64)*2560000)) + 4096) - 1), 4096)*4096) + 33280000), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:04 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((10354688 + (floordiv(((max(262144, 262144) + 4096) - 1), 4096)*4096)) + (floordiv(((max(262144, 262144) + 4096) - 1), 4096)*4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:04 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is (33652736 + (floordiv(((max(32768, 32768) + 4096) - 1), 4096)*4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:04 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is 21233664, while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:04 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is (33652736 + (floordiv(((max(32768, 32768) + 4096) - 1), 4096)4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:04 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is (((floordiv((((16
(floordiv((m(1, 2147483647) + 63), 64)*641024)) + 4096) - 1), 4096)4096) + 1048576) + 1048576), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:04 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((floordiv((((16
(floordiv((m(1, 2147483647) + 31), 32)*163840)) + 4096) - 1), 4096)4096) + 34078720), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:04 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is (((floordiv((((16
(floordiv((m(1, 2147483647) + 63), 64)*640000)) + 4096) - 1), 4096)*4096) + 1048576) + 1048576), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:05 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is (10354688 + (floordiv(((max(max(max(262144, 262144), 262144), 262144) + 4096) - 1), 4096)*4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:06 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is 21757952, while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:06 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is (33652736 + (floordiv(((max(max(max(131072, 131072), 131072), 131072) + 4096) - 1), 4096)*4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:06 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is (33652736 + (floordiv(((max(262144, 262144) + 4096) - 1), 4096)*4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:06 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((10240000 + (floordiv(((max(262144, 262144) + 4096) - 1), 4096)*4096)) + (floordiv(((max(262144, 262144) + 4096) - 1), 4096)4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:07 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is (((floordiv((((16
(floordiv((m(1, 2147483647) + 255), 256)*1294336)) + 4096) - 1), 4096)*4096) + (floordiv(((max(524288, 524288) + 4096) - 1), 4096)*4096)) + (floordiv(((max(524288, 524288) + 4096) - 1), 4096)*4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:07 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((20709376 + (floordiv(((max(1048576, 1048576) + 4096) - 1), 4096)*4096)) + (floordiv(((max(1048576, 1048576) + 4096) - 1), 4096)*4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:07 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((20709376 + (floordiv(((max(524288, 524288) + 4096) - 1), 4096)*4096)) + (floordiv(((max(524288, 524288) + 4096) - 1), 4096)*4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:07 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((20578304 + (floordiv(((max(262144, 262144) + 4096) - 1), 4096)*4096)) + (floordiv(((max(262144, 262144) + 4096) - 1), 4096)*4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:07 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is (33652736 + (floordiv(((max(max(max(262144, 262144), 262144), 262144) + 4096) - 1), 4096)*4096)), while memory size is 8388608 (unit is bit)
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InternalError'>, GeOp19_0GEOP::::DoRunAsync Failed
Error Message is :

 [[{{node GeOp19_0}}]]

I1201 20:02:01.978578 281465065296352 coordinator.py:224] Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InternalError'>, GeOp19_0GEOP::::DoRunAsync Failed
Error Message is :

 [[{{node GeOp19_0}}]]

2021-12-01 20:02:02.178498: I /home/jenkins/agent/workspace/Compile_GraphEngine_Centos_ARM/tensorflow/tf_adapter/kernels/geop_npu.cc:765] The model has been compiled on the Ascend AI processor, current graph id is:111
time="2021-12-01T20:04:51+08:00" level=info msg="stop upload_command_pid" file="upload.go:54" Command=bootstrap/upload Component=ma-training-toolkit Platform=ModelArts-Service
time="2021-12-01T20:04:51+08:00" level=info msg="stop run_command_pid" file="run_train.go:185" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service
time="2021-12-01T20:04:51+08:00" level=error msg="signal: terminated, details info: run command err" file="run_train.go:62" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service
[ModelArts Service Log][sidecar] exiting at 2021-12-01-20:04:52
[ModelArts Service Log][sidecar] exit with
[ModelArts Service Log][sidecar] stop toolkit_obs_upload_pid = 33 by signal SIGTERM
time="2021-12-01T20:04:52+08:00" level=info msg="the periodic upload task exiting..." file="upload.go:59" Command=obs/upload Component=ma-training-toolkit Platform=ModelArts-Service
[ModelArts Service Log][sidecar] toolkit_obs_upload 33 ret_code is 0

张晓龙 任务状态Analysing 修改为DONE

inputproducer问题已解决,其他问题另外提单。

吴定远 关联仓库Ascend/modelzoo-his 修改为Ascend/modelzoo

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(2)
1
https://gitee.com/ascend/modelzoo.git
git@gitee.com:ascend/modelzoo.git
ascend
modelzoo
modelzoo

搜索帮助

53164aa7 5694891 3bd8fe86 5694891