一、问题现象(附报错日志上下文):
日志显示与设备连接有问题:
2021/11/25 22:08:47 job status is Creating please wait.
2021/11/25 22:08:57 job status is Queuing please wait.
2021/11/25 22:09:08 job status is Downloading please wait.
2021/11/25 22:09:19 job status is Downloading please wait.
2021/11/25 22:09:29 job status is Downloading please wait.
2021/11/25 22:09:39 job status is Downloading please wait.
2021/11/25 22:09:49 job status is Downloading please wait.
2021/11/25 22:10:00 job status is Downloading please wait.
2021/11/25 22:10:10 job status is Downloading please wait.
2021/11/25 22:10:20 job status is Downloading please wait.
2021/11/25 22:10:30 job status is Downloading please wait.
2021/11/25 22:10:41 job status is Downloading please wait.
2021/11/25 22:10:51 job status is Downloading please wait.
2021/11/25 22:11:01 job status is Downloading please wait.
2021/11/25 22:11:11 job status is Downloading please wait.
2021/11/25 22:11:21 job status is Running please wait.
[ModelArts Service Log][INFO][2021/11/25 22:10:54]: cache the content of [data_url] inputs successfully
[ModelArts Service Log][INFO][2021/11/25 22:10:54]: it can be accessed at local dir [/home/ma-user/modelarts/inputs/data_url_0]
[ModelArts Service Log][INFO][2021/11/25 22:10:55,786]: mkdir for local output dir
[ModelArts Service Log][INFO][2021/11/25 22:10:55,786]: output-handler finalized
[ModelArts Service Log][init] exiting at 2021-11-25-22:10:55
[ModelArts Service Log][init] upload_metrics_pid = 449
[ModelArts Service Log][init] stop toolkit_obs_upload_pid = 53 by signal SIGTERM
time="2021-11-25T22:10:56+08:00" level=info msg="the periodic upload task exiting..." file="upload.go:59" Command=obs/upload Component=ma-training-toolkit Platform=ModelArts-Service
[ModelArts Service Log][sidecar] toolkit_obs_upload 53 ret_code is 0
[ModelArts Service Log][init] exit with 0
time="2021-11-25T22:10:58+08:00" level=info msg="run command: mkdir -p ~/.pip; echo -e '[global]\ntrusted-host = repo.myhuaweicloud.com\nindex-url = http://repo.myhuaweicloud.com/repository/pypi/simple/' > ~/.pip/pip.conf; bash /home/ma-user/modelarts/run/run_train_v2.sh /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/bash.py --data_url=/home/ma-user/modelarts/inputs/data_url_0/ --train_url=/home/ma-user/modelarts/outputs/train_url_0/ ; ret=$?; echo $ret > /home/ma-user/modelarts/retCode; exit $ret" file="run_train.go:169" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service
[ModelArts Service Log]user: uid=1000(HwHiAiUser) gid=1000(HwHiAiUser) groups=1000(HwHiAiUser)
[ModelArts Service Log]pwd: /home/work
[ModelArts Service Log]boot_file: /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/bash.py
[ModelArts Service Log]command: /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/bash.py --data_url=/home/ma-user/modelarts/inputs/data_url_0/ --train_url=/home/ma-user/modelarts/outputs/train_url_0/
[ModelArts Service Log]local_code_dir: /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src
[ModelArts Service Log]training start at 2021-11-25-22:10:59
[ModelArts Service Log]skip install modelarts training system python packages, due to it's customized image
[ModelArts Service Log]you may install them if necessary
/home/ma-user/modelarts/user-job-dir
INFO:root:Using MoXing-v2.0.0.rc2.4b57a67b-4b57a67b
INFO:root:Using OBS-Python-SDK-3.20.9.1
time="2021-11-25T22:10:59+08:00" level=info msg="upload command: /home/ma-user/training/sidecar.sh" file="upload.go:38" Command=bootstrap/upload Component=ma-training-toolkit Platform=ModelArts-Service
[ModelArts Service Log][sidecar] toolkit_obs_upload_job_pid = 32
[ModelArts Service Log][sidecar] toolkit_obs_upload_pid = 34
[ModelArts Service Log][sidecar] running at 2021-11-25-22:10:59
[ModelArts Service Log][sidecar] outputs_handler_job_pid = 60
[ModelArts Service Log][sidecar] outputs_handler_pid = 61
[ModelArts Service Log][sidecar] toolkit_obs_sync_by_channels_job_pid = 75
[ModelArts Service Log][sidecar] toolkit_obs_sync_by_channels_pid = 77
[ModelArts Service Log][sidecar] waiting for training complete
time="2021-11-25T22:10:59+08:00" level=info msg="local dir = /home/ma-user/modelarts/log/" file="upload.go:39" Command=obs/upload Component=ma-training-toolkit Platform=ModelArts-Service
time="2021-11-25T22:10:59+08:00" level=info msg="obs dir = s3://modelarts-training-log-cn-north-4/7bde9cc4-6f48-4551-b298-afc066cc53c8/worker-0" file="upload.go:42" Command=obs/upload Component=ma-training-toolkit Platform=ModelArts-Service
time="2021-11-25T22:10:59+08:00" level=info msg="start the periodic upload task, upload period = 5 seconds " file="upload.go:52" Command=obs/upload Component=ma-training-toolkit Platform=ModelArts-Service
time="2021-11-25T22:10:59+08:00" level=info msg="local dir = /home/ma-user/modelarts/outputs/train_url_0/" file="upload.go:39" Command=obs/sync_by_channels Component=ma-training-toolkit Ctx=train_url Platform=ModelArts-Service
time="2021-11-25T22:10:59+08:00" level=info msg="obs dir = s3://rstg/MA-new-enas-11-25-22-07/output/" file="upload.go:42" Command=obs/sync_by_channels Component=ma-training-toolkit Ctx=train_url Platform=ModelArts-Service
time="2021-11-25T22:10:59+08:00" level=info msg="start the periodic upload task, upload period = 30 seconds " file="upload.go:52" Command=obs/sync_by_channels Component=ma-training-toolkit Ctx=train_url Platform=ModelArts-Service
[ModelArts Service Log]2021-11-25 22:10:59,983 - INFO - Ascend Driver: Version=21.0.2
[ModelArts Service Log]2021-11-25 22:10:59,984 - INFO - you are advised to use ASCEND_DEVICE_ID env instead of DEVICE_ID, as the DEVICE_ID env will be discarded in later versions
[ModelArts Service Log]2021-11-25 22:10:59,984 - INFO - particularly, ${ASCEND_DEVICE_ID} == ${DEVICE_ID}, it's the logical device id
[ModelArts Service Log]2021-11-25 22:10:59,984 - INFO - Davinci training command
[ModelArts Service Log]2021-11-25 22:10:59,984 - INFO - ['/home/ma-user/anaconda/bin/python', '/home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/bash.py', '--data_url=/home/ma-user/modelarts/inputs/data_url_0/', '--train_url=/home/ma-user/modelarts/outputs/train_url_0/']
[ModelArts Service Log]2021-11-25 22:10:59,984 - INFO - Wait for Rank table file ready
[ModelArts Service Log]2021-11-25 22:10:59,984 - INFO - Rank table file (K8S generated) is ready for read
[ModelArts Service Log]2021-11-25 22:10:59,985 - INFO -
{
"status": "completed",
"group_count": "1",
"group_list": [
{
"group_name": "worker",
"device_count": "1",
"instance_count": "1",
"instance_list": [
{
"pod_name": "ma-job-7bde9cc4-6f48-4551-b298-afc066cc53c8-worker-0",
"server_id": "192.168.0.88",
"devices": [
{
"device_id": "4",
"device_ip": "192.1.210.98"
}
]
}
]
}
]
}
[ModelArts Service Log]2021-11-25 22:10:59,985 - INFO - Rank table file (V1)
[ModelArts Service Log]2021-11-25 22:10:59,985 - INFO -
{
"status": "completed",
"version": "1.0",
"server_count": "1",
"server_list": [
{
"server_id": "192.168.0.88",
"device": [
{
"device_id": "4",
"device_ip": "192.1.210.98",
"rank_id": "0"
}
]
}
]
}
[ModelArts Service Log]2021-11-25 22:10:59,985 - INFO - Rank table file (V1) is generated
[ModelArts Service Log]2021-11-25 22:10:59,986 - INFO - Current server
[ModelArts Service Log]2021-11-25 22:10:59,986 - INFO -
{
"server_id": "192.168.0.88",
"device": [
{
"device_id": "4",
"device_ip": "192.1.210.98",
"rank_id": "0"
}
]
}
[ModelArts Service Log]2021-11-25 22:10:59,987 - ERROR - Route plan so files not found. Please check files in /usr/local/route
[ModelArts Service Log]2021-11-25 22:10:59,988 - INFO - bootstrap proc-rank-0-device-0
[ModelArts Service Log]2021-11-25 22:10:59,997 - INFO - proc-rank-0-device-0 (pid: 81)
time="2021-11-25T22:11:00+08:00" level=info msg="local dir = /home/ma-user/modelarts/log/" file="upload.go:39" Command=obs/sync_by_channels Component=ma-training-toolkit Ctx=log_url Platform=ModelArts-Service
time="2021-11-25T22:11:00+08:00" level=info msg="obs dir = s3://rstg/MA-new-enas-11-25-22-07/log/" file="upload.go:42" Command=obs/sync_by_channels Component=ma-training-toolkit Ctx=log_url Platform=ModelArts-Service
time="2021-11-25T22:11:00+08:00" level=info msg="start the periodic upload task, upload period = 30 seconds " file="upload.go:52" Command=obs/sync_by_channels Component=ma-training-toolkit Ctx=log_url Platform=ModelArts-Service
[ModelArts Service Log][INFO][2021/11/25 22:11:01,011]: registered signal handler
WARNING:tensorflow:From /usr/local/Ascend/tfplugin/latest/tfplugin/python/site-packages/npu_bridge/estimator/npu/npu_optimizer.py:279: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.
WARNING:tensorflow:From /usr/local/Ascend/tfplugin/latest/tfplugin/python/site-packages/npu_bridge/estimator/npu/npu_optimizer.py:279: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.
WARNING:tensorflow:From /usr/local/Ascend/tfplugin/latest/tfplugin/python/site-packages/npu_bridge/estimator/npu/npu_optimizer.py:279: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.
tf.random.categorical
instead.tf.random.categorical
instead.tf.data
module.tf.data
module.Building train graph
WARNING:tensorflow:From /home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/ops/math_grad.py:1375: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W1125 22:11:17.449046 281473482445168 deprecation.py:323] From /home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/ops/math_grad.py:1375: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Building valid graph
WARNING:tensorflow:From /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/controller.py:43: SyncReplicasOptimizer.init (from tensorflow.python.training.sync_replicas_optimizer) is deprecated and will be removed in a future version.
Instructions for updating:
The SyncReplicaOptimizer
class is deprecated. For synchrononous training, please use Distribution Strategies.
W1125 22:11:19.832743 281473482445168 deprecation.py:323] From /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/controller.py:43: SyncReplicasOptimizer.init (from tensorflow.python.training.sync_replicas_optimizer) is deprecated and will be removed in a future version.
Instructions for updating:
The SyncReplicaOptimizer
class is deprecated. For synchrononous training, please use Distribution Strategies.
INFO:tensorflow:SyncReplicasV2: replicas_to_aggregate=10; total_num_replicas=1
I1125 22:11:19.833111 281473482445168 sync_replicas_optimizer.py:188] SyncReplicasV2: replicas_to_aggregate=10; total_num_replicas=1
INFO:tensorflow:Create CheckpointSaverHook.
I1125 22:11:22.893970 281473482445168 basic_session_run_hooks.py:541] Create CheckpointSaverHook.
WARNING:tensorflow:From /home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
W1125 22:11:35.191197 281473482445168 deprecation.py:323] From /home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
2021-11-25 22:11:35.464592: W tensorflow/core/platform/profile_utils/cpu_utils.cc:98] Failed to find bogomips in /proc/cpuinfo; cannot determine CPU frequency
2021-11-25 22:11:35.474593: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0xaaaae73479b0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-11-25 22:11:35.474649: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2021-11-25 22:11:44.270727: I /home/jenkins/agent/workspace/Compile_GraphEngine_Centos_ARM/tensorflow/tf_adapter/kernels/geop_npu.cc:765] The model has been compiled on the Ascend AI processor, current graph id is:1
INFO:tensorflow:Running local_init_op.
I1125 22:11:58.227085 281473482445168 session_manager.py:500] Running local_init_op.
2021-11-25 22:11:58.399832: I /home/jenkins/agent/workspace/Compile_GraphEngine_Centos_ARM/tensorflow/tf_adapter/kernels/geop_npu.cc:765] The model has been compiled on the Ascend AI processor, current graph id is:21
INFO:tensorflow:Done running local_init_op.
I1125 22:11:58.457854 281473482445168 session_manager.py:502] Done running local_init_op.
WARNING:tensorflow:From /home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py:882: start_queue_runners (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the tf.data
module.
W1125 22:11:58.648907 281473482445168 deprecation.py:323] From /home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py:882: start_queue_runners (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the tf.data
module.
2021-11-25 22:11:58.919788: I /home/jenkins/agent/workspace/Compile_GraphEngine_Centos_ARM/tensorflow/tf_adapter/kernels/geop_npu.cc:765] The model has been compiled on the Ascend AI processor, current graph id is:41
2021-11-25 22:11:59.097355: I /home/jenkins/agent/workspace/Compile_GraphEngine_Centos_ARM/tensorflow/tf_adapter/kernels/geop_npu.cc:765] The model has been compiled on the Ascend AI processor, current graph id is:51
I1125 22:12:21.824146 281473482445168 basic_session_run_hooks.py:606] Saving checkpoints for 0 into /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/output/search/model.ckpt.
2021-11-25 22:12:22.223409: I /home/jenkins/agent/workspace/Compile_GraphEngine_Centos_ARM/tensorflow/tf_adapter/kernels/geop_npu.cc:765] The model has been compiled on the Ascend AI processor, current graph id is:61
2021-11-25 22:12:32.528275: I /home/jenkins/agent/workspace/Compile_GraphEngine_Centos_ARM/tensorflow/tf_adapter/kernels/geop_npu.cc:765] The model has been compiled on the Ascend AI processor, current graph id is:81
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InternalError'>, GeOp9_0GEOP::::DoRunAsync Failed
Error Message is :
EE3001: The process has lost connection between the host and device. This might be caused by execution timeout of particular operators or unstable connection. Check the error message detail and try again.
Aicpu kernel execute failed, device_id=0, stream_id=908, task_id=3, fault so_name=, fault kernel_name=, fault op_name=input_producer/input_producer_EnqueueMany, extend_info=(info_type:4, info_len:41, msg_info:input_producer/input_producer_EnqueueMany)[FUNC:ProcessDrvErr][FILE:stream.cc][LINE:680]
Stream synchronize failed, stream = 0xfffe2c28a2f0[FUNC:StreamSynchronize][FILE:logger.cc][LINE:285]
[[{{node GeOp9_0}}]]
I1125 22:12:38.887955 281466294485472 coordinator.py:224] Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InternalError'>, GeOp9_0GEOP::::DoRunAsync Failed
Error Message is :
EE3001: The process has lost connection between the host and device. This might be caused by execution timeout of particular operators or unstable connection. Check the error message detail and try again.
Aicpu kernel execute failed, device_id=0, stream_id=908, task_id=3, fault so_name=, fault kernel_name=, fault op_name=input_producer/input_producer_EnqueueMany, extend_info=(info_type:4, info_len:41, msg_info:input_producer/input_producer_EnqueueMany)[FUNC:ProcessDrvErr][FILE:stream.cc][LINE:680]
Stream synchronize failed, stream = 0xfffe2c28a2f0[FUNC:StreamSynchronize][FILE:logger.cc][LINE:285]
[[{{node GeOp9_0}}]]
2021-11-25 22:12:39.073107: I /home/jenkins/agent/workspace/Compile_GraphEngine_Centos_ARM/tensorflow/tf_adapter/kernels/geop_npu.cc:765] The model has been compiled on the Ascend AI processor, current graph id is:91
二、软件版本:
-- CANN 版本 5.03:
--Tensorflow版本 1.15:
--Python 版本3.7:
Ascend 910 CPU:vCPUs 96GB
操作系统 windows10
pycharm modelarts插件运行
三、测试步骤:
将数据集上传到OBS桶,在本地使用Pycharm modelarts插件运行调试
pycharm tookit插件配置如下图:
提取码:
123456
提取码:
132456
提取码:
123456
你好, 我修改之后会这样的错误
2021/12/01 19:54:54 job status is Creating please wait.
2021/12/01 19:55:04 job status is Queuing please wait.
2021/12/01 19:55:14 job status is Queuing please wait.
2021/12/01 19:55:25 job status is Queuing please wait.
2021/12/01 19:55:35 job status is Queuing please wait.
2021/12/01 19:55:45 job status is Queuing please wait.
2021/12/01 19:55:56 job status is Downloading please wait.
2021/12/01 19:56:06 job status is Downloading please wait.
2021/12/01 19:56:16 job status is Downloading please wait.
2021/12/01 19:56:26 job status is Downloading please wait.
2021/12/01 19:56:36 job status is Downloading please wait.
2021/12/01 19:56:47 job status is Downloading please wait.
2021/12/01 19:56:57 job status is Downloading please wait.
2021/12/01 19:57:07 job status is Downloading please wait.
2021/12/01 19:57:17 job status is Running please wait.
[ModelArts Service Log][INFO][2021/12/01 19:57:09]: cache the content of [data_url] inputs successfully
[ModelArts Service Log][INFO][2021/12/01 19:57:09]: it can be accessed at local dir [/home/ma-user/modelarts/inputs/data_url_0]
[ModelArts Service Log][INFO][2021/12/01 19:57:10,506]: mkdir for local output dir
[ModelArts Service Log][INFO][2021/12/01 19:57:10,506]: output-handler finalized
[ModelArts Service Log][init] exiting at 2021-12-01-19:57:10
[ModelArts Service Log][init] upload_metrics_pid = 445
[ModelArts Service Log][init] stop toolkit_obs_upload_pid = 51 by signal SIGTERM
time="2021-12-01T19:57:11+08:00" level=info msg="the periodic upload task exiting..." file="upload.go:59" Command=obs/upload Component=ma-training-toolkit Platform=ModelArts-Service
[ModelArts Service Log][sidecar] toolkit_obs_upload 51 ret_code is 0
[ModelArts Service Log][init] exit with 0
time="2021-12-01T19:57:13+08:00" level=info msg="run command: mkdir -p ~/.pip; echo -e '[global]\ntrusted-host = repo.myhuaweicloud.com\nindex-url = http://repo.myhuaweicloud.com/repository/pypi/simple/' > ~/.pip/pip.conf; bash /home/ma-user/modelarts/run/run_train_v2.sh /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/bash.py --data_url=/home/ma-user/modelarts/inputs/data_url_0/ --train_url=/home/ma-user/modelarts/outputs/train_url_0/ ; ret=$?; echo $ret > /home/ma-user/modelarts/retCode; exit $ret" file="run_train.go:169" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service
[ModelArts Service Log]user: uid=1000(HwHiAiUser) gid=1000(HwHiAiUser) groups=1000(HwHiAiUser)
[ModelArts Service Log]pwd: /home/work
[ModelArts Service Log]boot_file: /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/bash.py
[ModelArts Service Log]command: /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/bash.py --data_url=/home/ma-user/modelarts/inputs/data_url_0/ --train_url=/home/ma-user/modelarts/outputs/train_url_0/
[ModelArts Service Log]local_code_dir: /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src
[ModelArts Service Log]training start at 2021-12-01-19:57:13
[ModelArts Service Log]skip install modelarts training system python packages, due to it's customized image
[ModelArts Service Log]you may install them if necessary
/home/ma-user/modelarts/user-job-dir
INFO:root:Using MoXing-v2.0.0.rc2.4b57a67b-4b57a67b
INFO:root:Using OBS-Python-SDK-3.20.9.1
time="2021-12-01T19:57:14+08:00" level=info msg="upload command: /home/ma-user/training/sidecar.sh" file="upload.go:38" Command=bootstrap/upload Component=ma-training-toolkit Platform=ModelArts-Service
[ModelArts Service Log][sidecar] toolkit_obs_upload_job_pid = 32
[ModelArts Service Log][sidecar] toolkit_obs_upload_pid = 33
[ModelArts Service Log][sidecar] running at 2021-12-01-19:57:14
[ModelArts Service Log][sidecar] outputs_handler_job_pid = 60
[ModelArts Service Log][sidecar] outputs_handler_pid = 61
[ModelArts Service Log][sidecar] toolkit_obs_sync_by_channels_job_pid = 75
[ModelArts Service Log][sidecar] toolkit_obs_sync_by_channels_pid = 77
[ModelArts Service Log][sidecar] waiting for training complete
time="2021-12-01T19:57:14+08:00" level=info msg="local dir = /home/ma-user/modelarts/log/" file="upload.go:39" Command=obs/upload Component=ma-training-toolkit Platform=ModelArts-Service
time="2021-12-01T19:57:14+08:00" level=info msg="obs dir = s3://modelarts-training-log-cn-north-4/5c998f1b-43e8-4fec-99e8-9ffbb3674001/worker-0" file="upload.go:42" Command=obs/upload Component=ma-training-toolkit Platform=ModelArts-Service
time="2021-12-01T19:57:14+08:00" level=info msg="start the periodic upload task, upload period = 5 seconds " file="upload.go:52" Command=obs/upload Component=ma-training-toolkit Platform=ModelArts-Service
time="2021-12-01T19:57:14+08:00" level=info msg="local dir = /home/ma-user/modelarts/outputs/train_url_0/" file="upload.go:39" Command=obs/sync_by_channels Component=ma-training-toolkit Ctx=train_url Platform=ModelArts-Service
time="2021-12-01T19:57:14+08:00" level=info msg="obs dir = s3://rstg/MA-new-enas-12-01-19-54/output/" file="upload.go:42" Command=obs/sync_by_channels Component=ma-training-toolkit Ctx=train_url Platform=ModelArts-Service
time="2021-12-01T19:57:14+08:00" level=info msg="start the periodic upload task, upload period = 30 seconds " file="upload.go:52" Command=obs/sync_by_channels Component=ma-training-toolkit Ctx=train_url Platform=ModelArts-Service
time="2021-12-01T19:57:14+08:00" level=info msg="local dir = /home/ma-user/modelarts/log/" file="upload.go:39" Command=obs/sync_by_channels Component=ma-training-toolkit Ctx=log_url Platform=ModelArts-Service
time="2021-12-01T19:57:14+08:00" level=info msg="obs dir = s3://rstg/MA-new-enas-12-01-19-54/log/" file="upload.go:42" Command=obs/sync_by_channels Component=ma-training-toolkit Ctx=log_url Platform=ModelArts-Service
time="2021-12-01T19:57:14+08:00" level=info msg="start the periodic upload task, upload period = 30 seconds " file="upload.go:52" Command=obs/sync_by_channels Component=ma-training-toolkit Ctx=log_url Platform=ModelArts-Service
[ModelArts Service Log]2021-12-01 19:57:14,581 - INFO - Ascend Driver: Version=21.0.2
[ModelArts Service Log]2021-12-01 19:57:14,582 - INFO - you are advised to use ASCEND_DEVICE_ID env instead of DEVICE_ID, as the DEVICE_ID env will be discarded in later versions
[ModelArts Service Log]2021-12-01 19:57:14,582 - INFO - particularly, ${ASCEND_DEVICE_ID} == ${DEVICE_ID}, it's the logical device id
[ModelArts Service Log]2021-12-01 19:57:14,582 - INFO - Davinci training command
[ModelArts Service Log]2021-12-01 19:57:14,582 - INFO - ['/home/ma-user/anaconda/bin/python', '/home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/bash.py', '--data_url=/home/ma-user/modelarts/inputs/data_url_0/', '--train_url=/home/ma-user/modelarts/outputs/train_url_0/']
[ModelArts Service Log]2021-12-01 19:57:14,582 - INFO - Wait for Rank table file ready
[ModelArts Service Log]2021-12-01 19:57:14,582 - INFO - Rank table file (K8S generated) is ready for read
[ModelArts Service Log]2021-12-01 19:57:14,583 - INFO -
{
"status": "completed",
"group_count": "1",
"group_list": [
{
"group_name": "worker",
"device_count": "1",
"instance_count": "1",
"instance_list": [
{
"pod_name": "ma-job-5c998f1b-43e8-4fec-99e8-9ffbb3674001-worker-0",
"server_id": "192.168.0.91",
"devices": [
{
"device_id": "0",
"device_ip": "192.1.216.96"
}
]
}
]
}
]
}
[ModelArts Service Log]2021-12-01 19:57:14,583 - INFO - Rank table file (V1)
[ModelArts Service Log]2021-12-01 19:57:14,583 - INFO -
{
"status": "completed",
"version": "1.0",
"server_count": "1",
"server_list": [
{
"server_id": "192.168.0.91",
"device": [
{
"device_id": "0",
"device_ip": "192.1.216.96",
"rank_id": "0"
}
]
}
]
}
[ModelArts Service Log]2021-12-01 19:57:14,583 - INFO - Rank table file (V1) is generated
[ModelArts Service Log]2021-12-01 19:57:14,584 - INFO - Current server
[ModelArts Service Log]2021-12-01 19:57:14,584 - INFO -
{
"server_id": "192.168.0.91",
"device": [
{
"device_id": "0",
"device_ip": "192.1.216.96",
"rank_id": "0"
}
]
}
[ModelArts Service Log]2021-12-01 19:57:14,584 - ERROR - Route plan so files not found. Please check files in /usr/local/route
[ModelArts Service Log]2021-12-01 19:57:14,585 - INFO - bootstrap proc-rank-0-device-0
[ModelArts Service Log]2021-12-01 19:57:14,591 - INFO - proc-rank-0-device-0 (pid: 81)
[ModelArts Service Log][INFO][2021/12/01 19:57:15,499]: registered signal handler
WARNING:tensorflow:From /usr/local/Ascend/tfplugin/latest/tfplugin/python/site-packages/npu_bridge/estimator/npu/npu_optimizer.py:279: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.
WARNING:tensorflow:From /usr/local/Ascend/tfplugin/latest/tfplugin/python/site-packages/npu_bridge/estimator/npu/npu_optimizer.py:279: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.
runType=search!
WARNING:tensorflow:From /usr/local/Ascend/tfplugin/latest/tfplugin/python/site-packages/npu_bridge/estimator/npu/npu_optimizer.py:279: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.
tf.cast
instead.tf.cast
instead.tf.random.categorical
instead.tf.random.categorical
instead.tf.cast
instead.tf.cast
instead.SyncReplicaOptimizer
class is deprecated. For synchrononous training, please use Distribution Strategies.SyncReplicaOptimizer
class is deprecated. For synchrononous training, please use Distribution Strategies.tf.data
module.tf.data
module.HParams:
{
"alpha": 0.0,
"batch_size": 64,
"best_valid_ppl_threshold": 5,
"beta": 1.0,
"bptt_steps": 20,
"controller_baseline_dec": 0.999,
"controller_entropy_weight": 1e-05,
"controller_hidden_size": 64,
"controller_learning_rate": 5e-05,
"controller_num_aggregate": 10,
"controller_num_functions": 4,
"controller_num_layers": 9,
"controller_num_train_steps": 25,
"controller_tanh_constant": 2.25,
"controller_temperature": 5.0,
"data_path": "/home/ma-user/modelarts/inputs/data_url_0/ptb/ptb.pkl",
"drop_e": 0.1,
"drop_i": 0.2,
"drop_l": 0.25,
"drop_o": 0.75,
"drop_w": 0.0,
"drop_x": 0.75,
"grad_bound": 0.1,
"hidden_size": 200,
"init_range": 0.04,
"learning_rate": 20.0,
"log_every": 200,
"num_train_batches": 726,
"num_train_epochs": 600,
"num_train_steps": 435600,
"output_dir": "/home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/output/search",
"vocab_size": 10000,
"weight_decay": 8e-07
}
INFO:tensorflow:Create CheckpointSaverHook.
I1201 19:57:35.173776 281473056653680 basic_session_run_hooks.py:541] Create CheckpointSaverHook.
WARNING:tensorflow:From /home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
W1201 19:57:47.176337 281473056653680 deprecation.py:323] From /home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
INFO:tensorflow:Graph was finalized.
I1201 19:57:47.395758 281473056653680 monitored_session.py:240] Graph was finalized.
2021-12-01 19:57:47.418316: W tensorflow/core/platform/profile_utils/cpu_utils.cc:98] Failed to find bogomips in /proc/cpuinfo; cannot determine CPU frequency
2021-12-01 19:57:47.426044: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0xaaaae124e110 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-12-01 19:57:47.426090: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2021-12-01 19:57:52.445980: I /home/jenkins/agent/workspace/Compile_GraphEngine_Centos_ARM/tensorflow/tf_adapter/kernels/geop_npu.cc:765] The model has been compiled on the Ascend AI processor, current graph id is:1
INFO:tensorflow:Running local_init_op.
I1201 19:58:04.936841 281473056653680 session_manager.py:500] Running local_init_op.
2021-12-01 19:58:05.116458: I /home/jenkins/agent/workspace/Compile_GraphEngine_Centos_ARM/tensorflow/tf_adapter/kernels/geop_npu.cc:765] The model has been compiled on the Ascend AI processor, current graph id is:21
INFO:tensorflow:Done running local_init_op.
I1201 19:58:05.172280 281473056653680 session_manager.py:502] Done running local_init_op.
2021-12-01 19:58:05.551652: I /home/jenkins/agent/workspace/Compile_GraphEngine_Centos_ARM/tensorflow/tf_adapter/kernels/geop_npu.cc:765] The model has been compiled on the Ascend AI processor, current graph id is:41
INFO:tensorflow:Saving checkpoints for 0 into /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/output/search/model.ckpt.
I1201 19:58:24.926260 281473056653680 basic_session_run_hooks.py:606] Saving checkpoints for 0 into /home/ma-user/modelarts/user-job-dir/enas_lm_npu_for_TensorFlow/enas_lm_npu_20211114162907/src/output/search/model.ckpt.
2021-12-01 19:58:25.179278: I /home/jenkins/agent/workspace/Compile_GraphEngine_Centos_ARM/tensorflow/tf_adapter/kernels/geop_npu.cc:765] The model has been compiled on the Ascend AI processor, current graph id is:51
2021-12-01 19:58:33.356756: I /home/jenkins/agent/workspace/Compile_GraphEngine_Centos_ARM/tensorflow/tf_adapter/kernels/geop_npu.cc:765] The model has been compiled on the Ascend AI processor, current graph id is:71
2021-12-01 19:58:33.641136: I /home/jenkins/agent/workspace/Compile_GraphEngine_Centos_ARM/tensorflow/tf_adapter/kernels/geop_npu.cc:765] The model has been compiled on the Ascend AI processor, current graph id is:81
2021-12-01 19:58:34.214017: I /home/jenkins/agent/workspace/Compile_GraphEngine_Centos_ARM/tensorflow/tf_adapter/kernels/geop_npu.cc:765] The model has been compiled on the Ascend AI processor, current graph id is:91
2021-12-01 19:58:36.420151: I /home/jenkins/agent/workspace/Compile_GraphEngine_Centos_ARM/tensorflow/tf_adapter/kernels/geop_npu.cc:765] The model has been compiled on the Ascend AI processor, current graph id is:101
[WARNING] TBE:2021-12-01-20:00:28 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((floordiv((((16*(m(1, 2147483647)*3328)) + 4096) - 1), 4096)4096) + 33280000), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:00:29 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((floordiv((((16(m(1, 2147483647)*3328)) + 4096) - 1), 4096)4096) + 16826368), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:00:29 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((floordiv((((16(m(1, 2147483647)*3328)) + 4096) - 1), 4096)4096) + 33280000), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:00:29 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((floordiv((((16(floordiv((m(1, 2147483647) + 15), 16)*3328)) + 4096) - 1), 4096)4096) + 16826368), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:00:30 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((floordiv((((16(floordiv((m(1, 2147483647) + 31), 32)*3328)) + 4096) - 1), 4096)4096) + 33280000), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:00:30 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((floordiv((((16(floordiv((m(1, 2147483647) + 31), 32)*53248)) + 4096) - 1), 4096)4096) + 33280000), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:00:30 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((floordiv((((16(floordiv((m(1, 2147483647) + 3), 4)*3328)) + 4096) - 1), 4096)4096) + 17039360), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:00:30 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((floordiv((((16(floordiv((m(1, 2147483647) + 63), 64)*53248)) + 4096) - 1), 4096)4096) + 33280000), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:00:30 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((floordiv((((16(floordiv((m(1, 2147483647) + 31), 32)*3328)) + 4096) - 1), 4096)4096) + 33280000), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:00:31 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((floordiv((((16(floordiv((m(1, 2147483647) + 31), 32)*53248)) + 4096) - 1), 4096)*4096) + 16719872), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:00:31 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is 18743296, while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:00:31 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((40960000 + (floordiv(((max(262144, 262144) + 4096) - 1), 4096)*4096)) + (floordiv(((max(262144, 262144) + 4096) - 1), 4096)*4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:00:31 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is 17039360, while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:00:32 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((20971520 + (floordiv(((max(262144, 262144) + 4096) - 1), 4096)*4096)) + (floordiv(((max(262144, 262144) + 4096) - 1), 4096)4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:00:32 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((floordiv((((16(floordiv((m(1, 2147483647) + 255), 256)*425984)) + 4096) - 1), 4096)4096) + 33280000), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:00:32 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is 34131968, while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:00:46 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((floordiv((((16((floordiv((m(1, 2147483647) + 15), 16)32)(floordiv((k(1, 2147483647) + 31), 32)*512))) + 4096) - 1), 4096)4096) + (floordiv((((16(floordiv((k(1, 2147483647) + 31), 32)*5120000)) + 4096) - 1), 4096)4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:00:46 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is (((floordiv((((16(floordiv((k(1, 2147483647) + 31), 32)*5120000)) + 4096) - 1), 4096)4096) + (floordiv((((16(floordiv((k(1, 2147483647) + 31), 32)*8192)) + 4096) - 1), 4096)4096)) + (floordiv((((16(floordiv((k(1, 2147483647) + 31), 32)*8192)) + 4096) - 1), 4096)4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:00 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((floordiv((((16(m(1, 2147483647)*160000)) + 4096) - 1), 4096)*4096) + 40960000), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:00 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is (33652736 + (floordiv(((max(32768, 32768) + 4096) - 1), 4096)4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:00 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is (((floordiv((((16(m(1, 2147483647)*160000)) + 4096) - 1), 4096)*4096) + 5120000) + 5120000), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:01 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is (34078720 + (floordiv(((max(max(max(65536, 65536), 65536), 65536) + 4096) - 1), 4096)*4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:01 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is (33652736 + (floordiv(((max(32768, 32768) + 4096) - 1), 4096)*4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:01 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is (33652736 + (floordiv(((max(max(max(32768, 32768), 32768), 32768) + 4096) - 1), 4096)*4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:02 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is (33652736 + (floordiv(((max(32768, 32768) + 4096) - 1), 4096)4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:02 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is (((floordiv((((16(floordiv((m(1, 2147483647) + 7), 8)*655360)) + 4096) - 1), 4096)4096) + 65536) + 65536), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:03 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((floordiv((((16(floordiv((m(1, 2147483647) + 15), 16)*643072)) + 4096) - 1), 4096)*4096) + (floordiv(((max(262144, 262144) + 4096) - 1), 4096)*4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:03 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is (33652736 + (floordiv(((max(max(max(32768, 32768), 32768), 32768) + 4096) - 1), 4096)*4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:03 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is (33652736 + (floordiv(((max(32768, 32768) + 4096) - 1), 4096)4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:03 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((floordiv((((16(floordiv((m(1, 2147483647) + 31), 32)*2560000)) + 4096) - 1), 4096)*4096) + 10240000), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:04 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is (33652736 + (floordiv(((max(max(max(32768, 32768), 32768), 32768) + 4096) - 1), 4096)4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:04 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((floordiv((((16(floordiv((m(1, 2147483647) + 63), 64)*2560000)) + 4096) - 1), 4096)*4096) + 33280000), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:04 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((10354688 + (floordiv(((max(262144, 262144) + 4096) - 1), 4096)*4096)) + (floordiv(((max(262144, 262144) + 4096) - 1), 4096)*4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:04 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is (33652736 + (floordiv(((max(32768, 32768) + 4096) - 1), 4096)*4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:04 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is 21233664, while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:04 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is (33652736 + (floordiv(((max(32768, 32768) + 4096) - 1), 4096)4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:04 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is (((floordiv((((16(floordiv((m(1, 2147483647) + 63), 64)*641024)) + 4096) - 1), 4096)4096) + 1048576) + 1048576), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:04 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((floordiv((((16(floordiv((m(1, 2147483647) + 31), 32)*163840)) + 4096) - 1), 4096)4096) + 34078720), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:04 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is (((floordiv((((16(floordiv((m(1, 2147483647) + 63), 64)*640000)) + 4096) - 1), 4096)*4096) + 1048576) + 1048576), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:05 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is (10354688 + (floordiv(((max(max(max(262144, 262144), 262144), 262144) + 4096) - 1), 4096)*4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:06 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is 21757952, while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:06 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is (33652736 + (floordiv(((max(max(max(131072, 131072), 131072), 131072) + 4096) - 1), 4096)*4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:06 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is (33652736 + (floordiv(((max(262144, 262144) + 4096) - 1), 4096)*4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:06 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((10240000 + (floordiv(((max(262144, 262144) + 4096) - 1), 4096)*4096)) + (floordiv(((max(262144, 262144) + 4096) - 1), 4096)4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:07 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is (((floordiv((((16(floordiv((m(1, 2147483647) + 255), 256)*1294336)) + 4096) - 1), 4096)*4096) + (floordiv(((max(524288, 524288) + 4096) - 1), 4096)*4096)) + (floordiv(((max(524288, 524288) + 4096) - 1), 4096)*4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:07 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((20709376 + (floordiv(((max(1048576, 1048576) + 4096) - 1), 4096)*4096)) + (floordiv(((max(1048576, 1048576) + 4096) - 1), 4096)*4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:07 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((20709376 + (floordiv(((max(524288, 524288) + 4096) - 1), 4096)*4096)) + (floordiv(((max(524288, 524288) + 4096) - 1), 4096)*4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:07 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is ((20578304 + (floordiv(((max(262144, 262144) + 4096) - 1), 4096)*4096)) + (floordiv(((max(262144, 262144) + 4096) - 1), 4096)*4096)), while memory size is 8388608 (unit is bit)
[WARNING] TBE:2021-12-01-20:01:07 [storage_rewrite_cce.cc:554] Allocation exceed bound of memory, tag local.L1. It may cause read/write out of range.Current kernel used size is (33652736 + (floordiv(((max(max(max(262144, 262144), 262144), 262144) + 4096) - 1), 4096)*4096)), while memory size is 8388608 (unit is bit)
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InternalError'>, GeOp19_0GEOP::::DoRunAsync Failed
Error Message is :
[[{{node GeOp19_0}}]]
I1201 20:02:01.978578 281465065296352 coordinator.py:224] Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InternalError'>, GeOp19_0GEOP::::DoRunAsync Failed
Error Message is :
[[{{node GeOp19_0}}]]
2021-12-01 20:02:02.178498: I /home/jenkins/agent/workspace/Compile_GraphEngine_Centos_ARM/tensorflow/tf_adapter/kernels/geop_npu.cc:765] The model has been compiled on the Ascend AI processor, current graph id is:111
time="2021-12-01T20:04:51+08:00" level=info msg="stop upload_command_pid" file="upload.go:54" Command=bootstrap/upload Component=ma-training-toolkit Platform=ModelArts-Service
time="2021-12-01T20:04:51+08:00" level=info msg="stop run_command_pid" file="run_train.go:185" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service
time="2021-12-01T20:04:51+08:00" level=error msg="signal: terminated, details info: run command err" file="run_train.go:62" Command=bootstrap/run Component=ma-training-toolkit Platform=ModelArts-Service
[ModelArts Service Log][sidecar] exiting at 2021-12-01-20:04:52
[ModelArts Service Log][sidecar] exit with
[ModelArts Service Log][sidecar] stop toolkit_obs_upload_pid = 33 by signal SIGTERM
time="2021-12-01T20:04:52+08:00" level=info msg="the periodic upload task exiting..." file="upload.go:59" Command=obs/upload Component=ma-training-toolkit Platform=ModelArts-Service
[ModelArts Service Log][sidecar] toolkit_obs_upload 33 ret_code is 0
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
inputproducer问题已解决,其他问题另外提单。
登录 后才可以发表评论