登录
注册
开源
企业版
高校版
搜索
帮助中心
使用条款
关于我们
开源
企业版
高校版
私有云
模力方舟
AI 队友
登录
注册
Gitee 2025年度个人数据报告已发布,快来看看你的成长👉
代码拉取完成,页面将自动刷新
仓库状态说明
捐赠
捐赠前请先登录
取消
前往登录
扫描微信二维码支付
取消
支付完成
支付提示
将跳转至支付宝完成支付
确定
取消
Watch
不关注
关注所有动态
仅关注版本发行动态
关注但不提醒动态
68
Star
258
Fork
192
Ascend
/
modelzoo
暂停
代码
Issues
157
Pull Requests
9
Wiki
统计
流水线
服务
JavaDoc
PHPDoc
质量分析
Jenkins for Gitee
腾讯云托管
腾讯云 Serverless
悬镜安全
阿里云 SAE
Codeblitz
SBOM
我知道了,不再自动展开
更新失败,请稍后重试!
移除标识
内容风险标识
本任务被
标识为内容中包含有代码安全 Bug 、隐私泄露等敏感信息,仓库外成员不可访问
[北邮]-[Seq2Seq]-[训练报错:Fatal Python error: Aborted]
CLOSED
#I3T0TS
训练问题
codingth
创建于
2021-05-26 14:53
问题现象(附截图):  初步分析: run on the PyCharm ToolKit ModelArts: { "status": "completed", "group_count": "1", "group_list": [ { "group_name": "job-ma-seq2seq-npu-05-25", "device_count": "1", "instance_count": "1", "instance_list": [ { "pod_name": "job63526518-job-ma-seq2seq-npu-05-25-0", "server_id": "192.168.0.244", "devices": [ { "device_id": "1", "device_ip": "192.2.112.155" } ] } ] } ] } 软件版本: -- Tensorflow/Pytorch/MindSpore 版本 (源码或二进制):TF1.15 -- Python 版本 (e.g., Python 3.7.5):python 3.7 -- 操作系统版本 (e.g., Ubuntu 18.04):Ascend: 1 * Ascend 910 CPU:24 核 96GiB 日志信息: MA-seq2seq-npu-05-25-19-03_V0003_job-ma-seq2seq-npu-05-25.0.log 请根据自己的运行环境参看以下方式搜集日志信息 do nothing [Modelarts Service Log]user: uid=1101(work) gid=1101(work) groups=1101(work),1000(HwHiAiUser) [Modelarts Service Log]pwd: /home/work [Modelarts Service Log]app_url: s3://th-base-bucket/seq2seq/model/MA-seq2seq-npu-05-25-19-03/code/ [Modelarts Service Log]boot_file: code/copy_run_copy.py [Modelarts Service Log]log_url: /tmp/log/MA-seq2seq-npu-05-25-19-03.log [Modelarts Service Log]command: code/copy_run_copy.py --data_url=s3://th-base-bucket/seq2seq/data/ --train_url=s3://th-base-bucket/seq2seq/model/MA-seq2seq-npu-05-25-19-03/output/V0003/ [Modelarts Service Log]local_code_dir: [Modelarts Service Log]Training start at 2021-05-25-19:33:57 [Modelarts Service Log][modelarts_create_log] modelarts-pipe found [ModelArts Service Log]modelarts-pipe: will create log file /tmp/log/MA-seq2seq-npu-05-25-19-03.log [Modelarts Service Log]handle inputs of training job INFO:root:Using MoXing-v1.17.3-8aa951bc INFO:root:Using OBS-Python-SDK-3.20.7 [ModelArts Service Log][INFO][2021/05/25 19:33:58]: env MA_INPUTS is not found, skip the inputs handler INFO:root:Using MoXing-v1.17.3-8aa951bc INFO:root:Using OBS-Python-SDK-3.20.7 [ModelArts Service Log]2021-05-25 19:33:59,047 - modelarts-downloader.py[line:620] - INFO: Main: modelarts-downloader starting with Namespace(dst='./', recursive=True, skip_creating_dir=False, src='s3://th-base-bucket/seq2seq/model/MA-seq2seq-npu-05-25-19-03/code/', trace=False, type='common', verbose=False) [Modelarts Service Log][modelarts_logger] modelarts-pipe found [ModelArts Service Log]modelarts-pipe: will create log file /tmp/log/MA-seq2seq-npu-05-25-19-03.log [ModelArts Service Log]modelarts-pipe: will write log file /tmp/log/MA-seq2seq-npu-05-25-19-03.log /home/work/user-job-dir [ModelArts Service Log]modelarts-pipe: param for max log length: 1073741824 [ModelArts Service Log]modelarts-pipe: param for whether exit on overflow: 0 [ModelArts Service Log]modelarts-pipe: total length: 24 [Modelarts Service Log][modelarts_logger] modelarts-pipe found [ModelArts Service Log]modelarts-pipe: will create log file /tmp/log/MA-seq2seq-npu-05-25-19-03.log [ModelArts Service Log]modelarts-pipe: will write log file /tmp/log/MA-seq2seq-npu-05-25-19-03.log [ModelArts Service Log]modelarts-pipe: param for max log length: 1073741824 [ModelArts Service Log]modelarts-pipe: param for whether exit on overflow: 0 INFO:root:Using MoXing-v1.17.3-8aa951bc INFO:root:Using OBS-Python-SDK-3.20.7 [Modelarts Service Log]2021-05-25 19:34:00,116 - WARNING - stdout log /var/log/batch-task/job63526518/job-ma-seq2seq-npu-05-25/stdout.log is not found [Modelarts Service Log]2021-05-25 19:34:00,126 - INFO - Ascend Driver: Version=20.2.0 [Modelarts Service Log]2021-05-25 19:34:00,126 - INFO - you are advised to use ASCEND_DEVICE_ID env instead of DEVICE_ID, as the DEVICE_ID env will be discarded in later versions [Modelarts Service Log]2021-05-25 19:34:00,126 - INFO - particularly, ${ASCEND_DEVICE_ID} == ${DEVICE_ID}, it's the logical device id [Modelarts Service Log]2021-05-25 19:34:00,126 - INFO - Davinci training command [Modelarts Service Log]2021-05-25 19:34:00,126 - INFO - ['/usr/bin/python', '/home/work/user-job-dir/code/copy_run_copy.py', '--data_url=s3://th-base-bucket/seq2seq/data/', '--train_url=s3://th-base-bucket/seq2seq/model/MA-seq2seq-npu-05-25-19-03/output/V0003/'] [Modelarts Service Log]2021-05-25 19:34:00,127 - INFO - Wait for Rank table file ready [Modelarts Service Log]2021-05-25 19:34:00,127 - INFO - Rank table file (K8S generated) is ready for read [Modelarts Service Log]2021-05-25 19:34:00,127 - INFO - { "status": "completed", "group_count": "1", "group_list": [ { "group_name": "job-ma-seq2seq-npu-05-25", "device_count": "1", "instance_count": "1", "instance_list": [ { "pod_name": "job63526518-job-ma-seq2seq-npu-05-25-0", "server_id": "192.168.0.244", "devices": [ { "device_id": "1", "device_ip": "192.2.112.155" } ] } ] } ] } [Modelarts Service Log]2021-05-25 19:34:00,127 - INFO - Rank table file (C7x) [Modelarts Service Log]2021-05-25 19:34:00,128 - INFO - { "status": "completed", "version": "1.0", "server_count": "1", "server_list": [ { "server_id": "192.168.0.244", "device": [ { "device_id": "1", "device_ip": "192.2.112.155", "rank_id": "0" } ] } ] } [Modelarts Service Log]2021-05-25 19:34:00,128 - INFO - Rank table file (C7x) is generated [Modelarts Service Log]2021-05-25 19:34:00,128 - INFO - Current server [Modelarts Service Log]2021-05-25 19:34:00,129 - INFO - { "server_id": "192.168.0.244", "device": [ { "device_id": "1", "device_ip": "192.2.112.155", "rank_id": "0" } ] } [Modelarts Service Log]2021-05-25 19:34:00,129 - INFO - bootstrap proc-rank-0-device-0 [Modelarts Service Log]2021-05-25 19:34:00,136 - INFO - proc-rank-0-device-0 (pid: 94) INFO:root:Using MoXing-v1.17.3-8aa951bc INFO:root:Using OBS-Python-SDK-3.20.7 ===>>>/cache/user-job-dir/workspace/device0 PYTHONUNBUFFERED=1 LD_LIBRARY_PATH=/usr/local/seccomponent/lib:/home/work/anaconda/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/lib:/usr/local/openmpi/lib:/usr/lib/aarch64-linux-gnu/hdf5/serial:/usr/local/Ascend/nnae/latest/fwkacllib/lib64:/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver: BATCH_TASK_CURRENT_HOST_IP=192.168.0.244 BATCH_GROUP_NAME=job-ma-seq2seq-npu-05-25 DLS_USE_DOWNLOADER=1 VK_TASK_INDEX=0 TF_PLUGIN_PKG=/usr/local/Ascend/tfplugin/latest/tfplugin/python/site-packages DLS_UPLOAD_LOG_OBS_DIR=s3://th-base-bucket/seq2seq/model/MA-seq2seq-npu-05-25-19-03/log/ _=/usr/bin/env DLS_KEY_PROJECT_ID=07d9825b9b8025e52f17c002007ffb7b DLS_KEY_ENDPOINT=modelarts-job-manager-internal.cn-north-4.myhuaweicloud.com:50000/v1 HOSTNAME=job63526518-job-ma-seq2seq-npu-05-25-0 OLDPWD=/home/work BATCH_JOB_ID=job63526518 BATCH_TASK_INDEX=0 NPU-VISIBLE-DEVICES=1 DLS_KEY_USE_HTTPS=1 JAVA_HOME=/home/work/jdk1.8.0_212 DLS_KEY_VERIFY_SSL=0 CLASS_PATH=.:/home/work/jdk1.8.0_212/lib/dt.jar:/home/work/jdk1.8.0_212/lib/tools.jar:/home/work/jdk1.8.0_212/jre/lib ASCEND_AICPU_PATH=/usr/local/Ascend/nnae/latest/ MA_MOUNT_SERVICE_ACCOUNT_TOKEN=false MA_ENGINE_TYPE=dengine TBE_IMPL_PATH=/usr/local/Ascend/nnae/latest/opp/op_impl/built-in/ai_core/tbe BATCH_TASK_LOG_PATH=/var/log/batch-task/job63526518/job-ma-seq2seq-npu-05-25 FWK_PYTHON_PATH=/usr/local/Ascend/nnae/latest/fwkacllib/python/site-packages PWD=/cache/user-job-dir/workspace/device0 ASCEND_DEVICE_ID=0 RANK_ID=0 HOME=/home/work LC_CTYPE=C.UTF-8 DLS_USE_UPLOADER=1 _STDBUF_E=L DEVICE_ID=0 _STDBUF_O=L S3_USE_HTTPS=1 RANK_SIZE=1 EXPERIMENTAL_DYNAMIC_PARTITION=1 RANK_TABLE_FILE=/home/work/rank_table/jobstart_hccl.json BATCH_TASK_CURRENT_INSTANCE=job63526518-job-ma-seq2seq-npu-05-25-0 S3_ENDPOINT=obs.cn-north-4.myhuaweicloud.com BATCH_CURRENT_SERVICE=job63526518-job-ma-seq2seq-npu-05-25-0.job63526518 GLOG_v=2 FMK_WORKSPACE=/home/work/user-job-dir/workspace S3_VERIFY_SSL=0 BATCH_TASK_NAME=job-ma-seq2seq-npu-05-25.0 PAAS_POD_ID=bc57c527-d65e-4dd5-863e-2d4f46d20f9a S3_REGION=cn-north-4 BATCH_OUTPUT_PATH=shm volume /dev/shm VC_TASK_INDEX=0 BATCH_CLUSTER_ID= ASCEND_OPP_PATH=/usr/local/Ascend/nnae/latest/opp HCCL_CONNECT_TIMEOUT=1800 JOB_ID=job63526518 VC_JOB-MA-SEQ2SEQ-NPU-05-25_HOSTS=job63526518-job-ma-seq2seq-npu-05-25-0.job63526518 AWS_SHARED_CREDENTIALS_FILE=/run/secrets/kubernetes.io/batch/config SHLVL=3 PYTHONPATH=/home/work/user-job-dir:/usr/local/Ascend/nnae/latest/fwkacllib/python/site-packages:/usr/local/Ascend/nnae/latest/fwkacllib/python/site-packages/auto_tune.egg:/usr/local/Ascend/nnae/latest/fwkacllib/python/site-packages/schedule_search.egg:/usr/local/Ascend/tfplugin/latest/tfplugin/python/site-packages:/usr/local/Ascend/nnae/latest/opp/op_impl/built-in/ai_core/tbe: BATCH_TASK_REPLICAS=1 BATCH_CURRENT_PORT=6666 DLS_KEY_JOB_ID=fakeJobId MA_ENABLE_SERVICE_LINK=false project_id=0b5c20a06c000fa92f6dc004a5e7fdb0 JRE_HOME=/home/work/jdk1.8.0_212/jre PATH=/home/work/anaconda/bin:/home/work/jdk1.8.0_212/bin:/home/work/jdk1.8.0_212/jre/bin:/home/work/ddk/bin/x86_64-linux-gcc5.4:/usr/local/openmpi/bin:/usr/local/ma/python3.7/bin/:/usr/local/Ascend/nnae/latest/fwkacllib/ccec_compiler/bin:/usr/local/Ascend/nnae/latest/fwkacllib/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin BATCH_CURRENT_HOST=job63526518-job-ma-seq2seq-npu-05-25-0.job63526518:6666 DATASET_ENABLE_NUMA=True LD_PRELOAD=/usr/libexec/coreutils/libstdbuf.so DLS_LOCAL_CACHE_PATH=/cache DLS_KEY_VERIFY_CODE=bGi7bt0tdxsQy9z+TowU7Q== BATCH_JOB-MA-SEQ2SEQ-NPU-05-25_HOSTS=job63526518-job-ma-seq2seq-npu-05-25-0.job63526518:6666 VC_JOB-MA-SEQ2SEQ-NPU-05-25_NUM=1 0 ===>>>Copy files from obs:s3://th-base-bucket/seq2seq/data/ to local dir:/cache/data ===>>>Copy from obs to local, time use:27(s) ===>>>Files number: 14 ===>>>Begin training: WARNING:tensorflow:From /usr/local/Ascend/tfplugin/latest/tfplugin/python/site-packages/npu_bridge/estimator/npu/npu_optimizer.py:225: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead. Looking in indexes: http://repo.myhuaweicloud.com/repository/pypi/simple Collecting nltk Downloading http://repo.myhuaweicloud.com/repository/pypi/packages/5e/37/9532ddd4b1bbb619333d5708aaad9bf1742f051a664c3c6fa6632a105fd8/nltk-3.6.2-py3-none-any.whl (1.5MB) Requirement already satisfied: click in /usr/local/ma/python3.7/lib/python3.7/site-packages (from nltk) (7.1.2) Requirement already satisfied: joblib in /usr/local/ma/python3.7/lib/python3.7/site-packages (from nltk) (1.0.1) Collecting regex (from nltk) Downloading http://repo.myhuaweicloud.com/repository/pypi/packages/38/3f/4c42a98c9ad7d08c16e7d23b2194a0e4f3b2914662da8bc88986e4e6de1f/regex-2021.4.4.tar.gz (693kB) Requirement already satisfied: tqdm in /usr/local/ma/python3.7/lib/python3.7/site-packages (from nltk) (4.46.1) Building wheels for collected packages: regex Building wheel for regex (setup.py): started Building wheel for regex (setup.py): finished with status 'done' Created wheel for regex: filename=regex-2021.4.4-cp37-cp37m-linux_aarch64.whl size=592903 sha256=5698e14a8828c9e7e59c4cae2ecf85c0e46944b4dc07749c781eaeba65b98a75 Stored in directory: /home/work/.cache/pip/wheels/a2/19/67/0ba8880c01cf30e0e1d07bd03131a269df498acb731609de6c Successfully built regex Installing collected packages: regex, nltk Successfully installed nltk-3.6.2 regex-2021.4.4 WARNING:tensorflow:From /home/work/user-job-dir/code/translate.py:99: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead. WARNING:tensorflow:From /home/work/user-job-dir/code/translate.py:474: The name tf.app.run is deprecated. Please use tf.compat.v1.app.run instead. ----------------train----------------- Preparing WMT data in /cache/data The English training set is: /cache/data/giga-fren.release2.fixed.en WARNING:tensorflow:From /home/work/user-job-dir/code/translate.py:200: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead. W0525 19:34:55.396721 281473811476496 module_wrapper.py:139] From /home/work/user-job-dir/code/translate.py:200: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead. 2021-05-25 19:34:55.436602: W tensorflow/core/platform/profile_utils/cpu_utils.cc:98] Failed to find bogomips in /proc/cpuinfo; cannot determine CPU frequency 2021-05-25 19:34:55.445052: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x39d18b60 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2021-05-25 19:34:55.445103: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version Creating 4 layers of 1000 units. WARNING:tensorflow:From /cache/user-job-dir/code/seq2seq_model.py:170: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead. W0525 19:34:55.469472 281473811476496 module_wrapper.py:139] From /cache/user-job-dir/code/seq2seq_model.py:170: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead. WARNING:tensorflow:From /cache/user-job-dir/code/seq2seq_model.py:131: calling RandomUniform.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor W0525 19:34:55.569444 281473811476496 deprecation.py:506] From /cache/user-job-dir/code/seq2seq_model.py:131: calling RandomUniform.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /cache/user-job-dir/code/seq2seq_model.py:132: The name tf.get_variable_scope is deprecated. Please use tf.compat.v1.get_variable_scope instead. W0525 19:34:55.569717 281473811476496 module_wrapper.py:139] From /cache/user-job-dir/code/seq2seq_model.py:132: The name tf.get_variable_scope is deprecated. Please use tf.compat.v1.get_variable_scope instead. WARNING:tensorflow:From /cache/user-job-dir/code/seq2seq_model.py:132: LSTMCell.__init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version. Instructions for updating: This class is equivalent as tf.keras.layers.LSTMCell, and will be replaced by that in Tensorflow 2.0. W0525 19:34:55.569888 281473811476496 deprecation.py:323] From /cache/user-job-dir/code/seq2seq_model.py:132: LSTMCell.__init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version. Instructions for updating: This class is equivalent as tf.keras.layers.LSTMCell, and will be replaced by that in Tensorflow 2.0. WARNING:tensorflow:From /cache/user-job-dir/code/seq2seq_model.py:140: MultiRNNCell.__init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version. Instructions for updating: This class is equivalent as tf.keras.layers.StackedRNNCells, and will be replaced by that in Tensorflow 2.0. W0525 19:34:55.572441 281473811476496 deprecation.py:323] From /cache/user-job-dir/code/seq2seq_model.py:140: MultiRNNCell.__init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version. Instructions for updating: This class is equivalent as tf.keras.layers.StackedRNNCells, and will be replaced by that in Tensorflow 2.0. 2021-05-25 19:34:55.576603: W tensorflow/python/util/util.cc:299] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them. WARNING:tensorflow:From /cache/user-job-dir/code/override_contrib/override_seq2seq.py:380: static_rnn (from tensorflow.python.ops.rnn) is deprecated and will be removed in a future version. Instructions for updating: Please use `keras.layers.RNN(cell, unroll=True)`, which is equivalent to this API W0525 19:34:55.578600 281473811476496 deprecation.py:323] From /cache/user-job-dir/code/override_contrib/override_seq2seq.py:380: static_rnn (from tensorflow.python.ops.rnn) is deprecated and will be removed in a future version. Instructions for updating: Please use `keras.layers.RNN(cell, unroll=True)`, which is equivalent to this API WARNING:tensorflow:From /usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/ops/rnn_cell_impl.py:958: Layer.add_variable (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version. Instructions for updating: Please use `layer.add_weight` method instead. W0525 19:34:56.095069 281473811476496 deprecation.py:323] From /usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/ops/rnn_cell_impl.py:958: Layer.add_variable (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version. Instructions for updating: Please use `layer.add_weight` method instead. WARNING:tensorflow:From /usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/ops/rnn_cell_impl.py:962: calling Zeros.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor W0525 19:34:56.106187 281473811476496 deprecation.py:506] From /usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/ops/rnn_cell_impl.py:962: calling Zeros.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /cache/user-job-dir/code/override_contrib/core_rnn_cell.py:105: calling Constant.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor W0525 19:34:56.611823 281473811476496 deprecation.py:506] From /cache/user-job-dir/code/override_contrib/core_rnn_cell.py:105: calling Constant.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /cache/user-job-dir/code/seq2seq_model.py:212: The name tf.trainable_variables is deprecated. Please use tf.compat.v1.trainable_variables instead. W0525 19:35:15.063675 281473811476496 module_wrapper.py:139] From /cache/user-job-dir/code/seq2seq_model.py:212: The name tf.trainable_variables is deprecated. Please use tf.compat.v1.trainable_variables instead. WARNING:tensorflow:From /cache/user-job-dir/code/seq2seq_model.py:216: The name tf.train.GradientDescentOptimizer is deprecated. Please use tf.compat.v1.train.GradientDescentOptimizer instead. W0525 19:35:15.064075 281473811476496 module_wrapper.py:139] From /cache/user-job-dir/code/seq2seq_model.py:216: The name tf.train.GradientDescentOptimizer is deprecated. Please use tf.compat.v1.train.GradientDescentOptimizer instead. WARNING:tensorflow:From /usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/ops/clip_ops.py:301: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where W0525 19:35:20.560996 281473811476496 deprecation.py:323] From /usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/ops/clip_ops.py:301: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where WARNING:tensorflow:From /cache/user-job-dir/code/seq2seq_model.py:225: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead. W0525 19:36:15.462924 281473811476496 module_wrapper.py:139] From /cache/user-job-dir/code/seq2seq_model.py:225: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead. WARNING:tensorflow:From /cache/user-job-dir/code/seq2seq_model.py:225: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead. W0525 19:36:15.463409 281473811476496 module_wrapper.py:139] From /cache/user-job-dir/code/seq2seq_model.py:225: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead. WARNING:tensorflow:From /home/work/user-job-dir/code/translate.py:168: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead. W0525 19:36:15.519186 281473811476496 module_wrapper.py:139] From /home/work/user-job-dir/code/translate.py:168: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead. 2021-05-25 19:36:23.259319: W tf_adapter/util/infershape_util.cc:313] The InferenceContext of node _SOURCE is null. 2021-05-25 19:36:23.259387: W tf_adapter/util/infershape_util.cc:313] The InferenceContext of node _SINK is null. 2021-05-25 19:36:23.260393: W tf_adapter/util/infershape_util.cc:313] The InferenceContext of node init is null. 2021-05-25 19:36:30.049844: W tensorflow/core/framework/op_kernel.cc:1639] Unavailable: failed 2021-05-25 19:36:33.050093: F tf_adapter/kernels/geop_npu.cc:702] GeOp1_0GEOP::::DoRunAsync Failed Fatal Python error: Aborted Thread 0x0000fffe237fe1e0 (most recent call first): File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 296 in wait File "/usr/local/ma/python3.7/lib/python3.7/multiprocessing/queues.py", line 224 in _feed File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 870 in run File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 926 in _bootstrap_inner File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 890 in _bootstrap Thread 0x0000fffe3d9791e0 (most recent call first): File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 296 in wait File "/usr/local/ma/python3.7/lib/python3.7/multiprocessing/queues.py", line 224 in _feed File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 870 in run File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 926 in _bootstrap_inner File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 890 in _bootstrap Thread 0x0000fffe3e17a1e0 (most recent call first): File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 296 in wait File "/usr/local/ma/python3.7/lib/python3.7/multiprocessing/queues.py", line 224 in _feed File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 870 in run File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 926 in _bootstrap_inner File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 890 in _bootstrap Thread 0x0000fffe44aba1e0 (most recent call first): File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 296 in wait File "/usr/local/ma/python3.7/lib/python3.7/multiprocessing/queues.py", line 224 in _feed File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 870 in run File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 926 in _bootstrap_inner File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 890 in _bootstrap Thread 0x0000fffe452bb1e0 (most recent call first): File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 296 in wait File "/usr/local/ma/python3.7/lib/python3.7/multiprocessing/queues.py", line 224 in _feed File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 870 in run File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 926 in _bootstrap_inner File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 890 in _bootstrap Thread 0x0000fffe3aff51e0 (most recent call first): File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 296 in wait File "/usr/local/ma/python3.7/lib/python3.7/multiprocessing/queues.py", line 224 in _feed File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 870 in run File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 926 in _bootstrap_inner File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 890 in _bootstrap Thread 0x0000fffe3a7f41e0 (most recent call first): File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 296 in wait File "/usr/local/ma/python3.7/lib/python3.7/multiprocessing/queues.py", line 224 in _feed File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 870 in run File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 926 in _bootstrap_inner File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 890 in _bootstrap Thread 0x0000fffe39ff31e0 (most recent call first): File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 296 in wait File "/usr/local/ma/python3.7/lib/python3.7/multiprocessing/queues.py", line 224 in _feed File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 870 in run File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 926 in _bootstrap_inner File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 890 in _bootstrap Thread 0x0000fffe397f21e0 (most recent call first): File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 296 in wait File "/usr/local/ma/python3.7/lib/python3.7/multiprocessing/queues.py", line 224 in _feed File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 870 in run File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 926 in _bootstrap_inner File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 890 in _bootstrap Thread 0x0000ffffba8bf010 (most recent call first): File "/usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443 in _call_tf_sessionrun File "/usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350 in _run_fn File "/usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365 in _do_call File "/usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1359 in _do_run File "/usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1180 in _run File "/usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956 in run File "/home/work/user-job-dir/code/translate.py", line 168 in create_model File "/home/work/user-job-dir/code/translate.py", line 203 in train File "/home/work/user-job-dir/code/translate.py", line 470 in main File "/usr/local/ma/python3.7/lib/python3.7/site-packages/absl/app.py", line 251 in _run_main File "/usr/local/ma/python3.7/lib/python3.7/site-packages/absl/app.py", line 303 in run File "/usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/platform/app.py", line 40 in run File "/home/work/user-job-dir/code/translate.py", line 474 in <module> ===>>>Training finished: ===>>>Copy files from local dir:/cache/model to obs:s3://th-base-bucket/seq2seq/model/MA-seq2seq-npu-05-25-19-03/output/V0003/result INFO:root:No files to copy. ===>>>Copy from local to obs, time use:0(s) ===>>>Files number: 0 2021-05-25 19:36:45,599 735 PCOMPILE Master process dead. worker process quiting.. 2021-05-25 19:36:45,652 734 PCOMPILE Master process dead. worker process quiting.. 2021-05-25 19:36:45,706 739 PCOMPILE Master process dead. worker process quiting.. 2021-05-25 19:36:45,753 737 PCOMPILE Master process dead. worker process quiting.. 2021-05-25 19:36:45,784 736 PCOMPILE Master process dead. worker process quiting.. 2021-05-25 19:36:45,835 732 PCOMPILE Master process dead. worker process quiting.. 2021-05-25 19:36:45,905 733 PCOMPILE Master process dead. worker process quiting.. 2021-05-25 19:36:46,008 738 PCOMPILE Master process dead. worker process quiting.. /usr/local/ma/python3.7/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 91 leaked semaphores to clean up at shutdown len(cache)) [Modelarts Service Log]2021-05-25 19:36:47,305 - INFO - Begin destroy training processes [Modelarts Service Log]2021-05-25 19:36:47,306 - INFO - proc-rank-0-device-0 (pid: 94) has exited [Modelarts Service Log]2021-05-25 19:36:47,306 - INFO - End destroy training processes [ModelArts Service Log]modelarts-pipe: total length: 25844 [Modelarts Service Log]Training end with return code: 0 [Modelarts Service Log]upload ascend-log to s3://th-base-bucket/seq2seq/model/MA-seq2seq-npu-05-25-19-03/log/ascend-log/ at 2021-05-25-19:36:47 upload_tail_log.py -l 2048 -o s3://th-base-bucket/seq2seq/model/MA-seq2seq-npu-05-25-19-03/log/ascend-log/ INFO:root:No files to copy. [Modelarts Service Log]upload ascend-log end at 2021-05-25-19:36:48 [Modelarts Service Log]handle outputs of training job [Modelarts Service Log][modelarts_logger] modelarts-pipe found [ModelArts Service Log]modelarts-pipe: will create log file /tmp/log/MA-seq2seq-npu-05-25-19-03.log [ModelArts Service Log]modelarts-pipe: will write log file /tmp/log/MA-seq2seq-npu-05-25-19-03.log [ModelArts Service Log]modelarts-pipe: param for max log length: 1073741824 [ModelArts Service Log]modelarts-pipe: param for whether exit on overflow: 0 INFO:root:Using MoXing-v1.17.3-8aa951bc INFO:root:Using OBS-Python-SDK-3.20.7 [Modelarts Service Log]Training end at 2021-05-25-19:36:48 [Modelarts Service Log]Training completed.
问题现象(附截图):  初步分析: run on the PyCharm ToolKit ModelArts: { "status": "completed", "group_count": "1", "group_list": [ { "group_name": "job-ma-seq2seq-npu-05-25", "device_count": "1", "instance_count": "1", "instance_list": [ { "pod_name": "job63526518-job-ma-seq2seq-npu-05-25-0", "server_id": "192.168.0.244", "devices": [ { "device_id": "1", "device_ip": "192.2.112.155" } ] } ] } ] } 软件版本: -- Tensorflow/Pytorch/MindSpore 版本 (源码或二进制):TF1.15 -- Python 版本 (e.g., Python 3.7.5):python 3.7 -- 操作系统版本 (e.g., Ubuntu 18.04):Ascend: 1 * Ascend 910 CPU:24 核 96GiB 日志信息: MA-seq2seq-npu-05-25-19-03_V0003_job-ma-seq2seq-npu-05-25.0.log 请根据自己的运行环境参看以下方式搜集日志信息 do nothing [Modelarts Service Log]user: uid=1101(work) gid=1101(work) groups=1101(work),1000(HwHiAiUser) [Modelarts Service Log]pwd: /home/work [Modelarts Service Log]app_url: s3://th-base-bucket/seq2seq/model/MA-seq2seq-npu-05-25-19-03/code/ [Modelarts Service Log]boot_file: code/copy_run_copy.py [Modelarts Service Log]log_url: /tmp/log/MA-seq2seq-npu-05-25-19-03.log [Modelarts Service Log]command: code/copy_run_copy.py --data_url=s3://th-base-bucket/seq2seq/data/ --train_url=s3://th-base-bucket/seq2seq/model/MA-seq2seq-npu-05-25-19-03/output/V0003/ [Modelarts Service Log]local_code_dir: [Modelarts Service Log]Training start at 2021-05-25-19:33:57 [Modelarts Service Log][modelarts_create_log] modelarts-pipe found [ModelArts Service Log]modelarts-pipe: will create log file /tmp/log/MA-seq2seq-npu-05-25-19-03.log [Modelarts Service Log]handle inputs of training job INFO:root:Using MoXing-v1.17.3-8aa951bc INFO:root:Using OBS-Python-SDK-3.20.7 [ModelArts Service Log][INFO][2021/05/25 19:33:58]: env MA_INPUTS is not found, skip the inputs handler INFO:root:Using MoXing-v1.17.3-8aa951bc INFO:root:Using OBS-Python-SDK-3.20.7 [ModelArts Service Log]2021-05-25 19:33:59,047 - modelarts-downloader.py[line:620] - INFO: Main: modelarts-downloader starting with Namespace(dst='./', recursive=True, skip_creating_dir=False, src='s3://th-base-bucket/seq2seq/model/MA-seq2seq-npu-05-25-19-03/code/', trace=False, type='common', verbose=False) [Modelarts Service Log][modelarts_logger] modelarts-pipe found [ModelArts Service Log]modelarts-pipe: will create log file /tmp/log/MA-seq2seq-npu-05-25-19-03.log [ModelArts Service Log]modelarts-pipe: will write log file /tmp/log/MA-seq2seq-npu-05-25-19-03.log /home/work/user-job-dir [ModelArts Service Log]modelarts-pipe: param for max log length: 1073741824 [ModelArts Service Log]modelarts-pipe: param for whether exit on overflow: 0 [ModelArts Service Log]modelarts-pipe: total length: 24 [Modelarts Service Log][modelarts_logger] modelarts-pipe found [ModelArts Service Log]modelarts-pipe: will create log file /tmp/log/MA-seq2seq-npu-05-25-19-03.log [ModelArts Service Log]modelarts-pipe: will write log file /tmp/log/MA-seq2seq-npu-05-25-19-03.log [ModelArts Service Log]modelarts-pipe: param for max log length: 1073741824 [ModelArts Service Log]modelarts-pipe: param for whether exit on overflow: 0 INFO:root:Using MoXing-v1.17.3-8aa951bc INFO:root:Using OBS-Python-SDK-3.20.7 [Modelarts Service Log]2021-05-25 19:34:00,116 - WARNING - stdout log /var/log/batch-task/job63526518/job-ma-seq2seq-npu-05-25/stdout.log is not found [Modelarts Service Log]2021-05-25 19:34:00,126 - INFO - Ascend Driver: Version=20.2.0 [Modelarts Service Log]2021-05-25 19:34:00,126 - INFO - you are advised to use ASCEND_DEVICE_ID env instead of DEVICE_ID, as the DEVICE_ID env will be discarded in later versions [Modelarts Service Log]2021-05-25 19:34:00,126 - INFO - particularly, ${ASCEND_DEVICE_ID} == ${DEVICE_ID}, it's the logical device id [Modelarts Service Log]2021-05-25 19:34:00,126 - INFO - Davinci training command [Modelarts Service Log]2021-05-25 19:34:00,126 - INFO - ['/usr/bin/python', '/home/work/user-job-dir/code/copy_run_copy.py', '--data_url=s3://th-base-bucket/seq2seq/data/', '--train_url=s3://th-base-bucket/seq2seq/model/MA-seq2seq-npu-05-25-19-03/output/V0003/'] [Modelarts Service Log]2021-05-25 19:34:00,127 - INFO - Wait for Rank table file ready [Modelarts Service Log]2021-05-25 19:34:00,127 - INFO - Rank table file (K8S generated) is ready for read [Modelarts Service Log]2021-05-25 19:34:00,127 - INFO - { "status": "completed", "group_count": "1", "group_list": [ { "group_name": "job-ma-seq2seq-npu-05-25", "device_count": "1", "instance_count": "1", "instance_list": [ { "pod_name": "job63526518-job-ma-seq2seq-npu-05-25-0", "server_id": "192.168.0.244", "devices": [ { "device_id": "1", "device_ip": "192.2.112.155" } ] } ] } ] } [Modelarts Service Log]2021-05-25 19:34:00,127 - INFO - Rank table file (C7x) [Modelarts Service Log]2021-05-25 19:34:00,128 - INFO - { "status": "completed", "version": "1.0", "server_count": "1", "server_list": [ { "server_id": "192.168.0.244", "device": [ { "device_id": "1", "device_ip": "192.2.112.155", "rank_id": "0" } ] } ] } [Modelarts Service Log]2021-05-25 19:34:00,128 - INFO - Rank table file (C7x) is generated [Modelarts Service Log]2021-05-25 19:34:00,128 - INFO - Current server [Modelarts Service Log]2021-05-25 19:34:00,129 - INFO - { "server_id": "192.168.0.244", "device": [ { "device_id": "1", "device_ip": "192.2.112.155", "rank_id": "0" } ] } [Modelarts Service Log]2021-05-25 19:34:00,129 - INFO - bootstrap proc-rank-0-device-0 [Modelarts Service Log]2021-05-25 19:34:00,136 - INFO - proc-rank-0-device-0 (pid: 94) INFO:root:Using MoXing-v1.17.3-8aa951bc INFO:root:Using OBS-Python-SDK-3.20.7 ===>>>/cache/user-job-dir/workspace/device0 PYTHONUNBUFFERED=1 LD_LIBRARY_PATH=/usr/local/seccomponent/lib:/home/work/anaconda/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/lib:/usr/local/openmpi/lib:/usr/lib/aarch64-linux-gnu/hdf5/serial:/usr/local/Ascend/nnae/latest/fwkacllib/lib64:/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver: BATCH_TASK_CURRENT_HOST_IP=192.168.0.244 BATCH_GROUP_NAME=job-ma-seq2seq-npu-05-25 DLS_USE_DOWNLOADER=1 VK_TASK_INDEX=0 TF_PLUGIN_PKG=/usr/local/Ascend/tfplugin/latest/tfplugin/python/site-packages DLS_UPLOAD_LOG_OBS_DIR=s3://th-base-bucket/seq2seq/model/MA-seq2seq-npu-05-25-19-03/log/ _=/usr/bin/env DLS_KEY_PROJECT_ID=07d9825b9b8025e52f17c002007ffb7b DLS_KEY_ENDPOINT=modelarts-job-manager-internal.cn-north-4.myhuaweicloud.com:50000/v1 HOSTNAME=job63526518-job-ma-seq2seq-npu-05-25-0 OLDPWD=/home/work BATCH_JOB_ID=job63526518 BATCH_TASK_INDEX=0 NPU-VISIBLE-DEVICES=1 DLS_KEY_USE_HTTPS=1 JAVA_HOME=/home/work/jdk1.8.0_212 DLS_KEY_VERIFY_SSL=0 CLASS_PATH=.:/home/work/jdk1.8.0_212/lib/dt.jar:/home/work/jdk1.8.0_212/lib/tools.jar:/home/work/jdk1.8.0_212/jre/lib ASCEND_AICPU_PATH=/usr/local/Ascend/nnae/latest/ MA_MOUNT_SERVICE_ACCOUNT_TOKEN=false MA_ENGINE_TYPE=dengine TBE_IMPL_PATH=/usr/local/Ascend/nnae/latest/opp/op_impl/built-in/ai_core/tbe BATCH_TASK_LOG_PATH=/var/log/batch-task/job63526518/job-ma-seq2seq-npu-05-25 FWK_PYTHON_PATH=/usr/local/Ascend/nnae/latest/fwkacllib/python/site-packages PWD=/cache/user-job-dir/workspace/device0 ASCEND_DEVICE_ID=0 RANK_ID=0 HOME=/home/work LC_CTYPE=C.UTF-8 DLS_USE_UPLOADER=1 _STDBUF_E=L DEVICE_ID=0 _STDBUF_O=L S3_USE_HTTPS=1 RANK_SIZE=1 EXPERIMENTAL_DYNAMIC_PARTITION=1 RANK_TABLE_FILE=/home/work/rank_table/jobstart_hccl.json BATCH_TASK_CURRENT_INSTANCE=job63526518-job-ma-seq2seq-npu-05-25-0 S3_ENDPOINT=obs.cn-north-4.myhuaweicloud.com BATCH_CURRENT_SERVICE=job63526518-job-ma-seq2seq-npu-05-25-0.job63526518 GLOG_v=2 FMK_WORKSPACE=/home/work/user-job-dir/workspace S3_VERIFY_SSL=0 BATCH_TASK_NAME=job-ma-seq2seq-npu-05-25.0 PAAS_POD_ID=bc57c527-d65e-4dd5-863e-2d4f46d20f9a S3_REGION=cn-north-4 BATCH_OUTPUT_PATH=shm volume /dev/shm VC_TASK_INDEX=0 BATCH_CLUSTER_ID= ASCEND_OPP_PATH=/usr/local/Ascend/nnae/latest/opp HCCL_CONNECT_TIMEOUT=1800 JOB_ID=job63526518 VC_JOB-MA-SEQ2SEQ-NPU-05-25_HOSTS=job63526518-job-ma-seq2seq-npu-05-25-0.job63526518 AWS_SHARED_CREDENTIALS_FILE=/run/secrets/kubernetes.io/batch/config SHLVL=3 PYTHONPATH=/home/work/user-job-dir:/usr/local/Ascend/nnae/latest/fwkacllib/python/site-packages:/usr/local/Ascend/nnae/latest/fwkacllib/python/site-packages/auto_tune.egg:/usr/local/Ascend/nnae/latest/fwkacllib/python/site-packages/schedule_search.egg:/usr/local/Ascend/tfplugin/latest/tfplugin/python/site-packages:/usr/local/Ascend/nnae/latest/opp/op_impl/built-in/ai_core/tbe: BATCH_TASK_REPLICAS=1 BATCH_CURRENT_PORT=6666 DLS_KEY_JOB_ID=fakeJobId MA_ENABLE_SERVICE_LINK=false project_id=0b5c20a06c000fa92f6dc004a5e7fdb0 JRE_HOME=/home/work/jdk1.8.0_212/jre PATH=/home/work/anaconda/bin:/home/work/jdk1.8.0_212/bin:/home/work/jdk1.8.0_212/jre/bin:/home/work/ddk/bin/x86_64-linux-gcc5.4:/usr/local/openmpi/bin:/usr/local/ma/python3.7/bin/:/usr/local/Ascend/nnae/latest/fwkacllib/ccec_compiler/bin:/usr/local/Ascend/nnae/latest/fwkacllib/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin BATCH_CURRENT_HOST=job63526518-job-ma-seq2seq-npu-05-25-0.job63526518:6666 DATASET_ENABLE_NUMA=True LD_PRELOAD=/usr/libexec/coreutils/libstdbuf.so DLS_LOCAL_CACHE_PATH=/cache DLS_KEY_VERIFY_CODE=bGi7bt0tdxsQy9z+TowU7Q== BATCH_JOB-MA-SEQ2SEQ-NPU-05-25_HOSTS=job63526518-job-ma-seq2seq-npu-05-25-0.job63526518:6666 VC_JOB-MA-SEQ2SEQ-NPU-05-25_NUM=1 0 ===>>>Copy files from obs:s3://th-base-bucket/seq2seq/data/ to local dir:/cache/data ===>>>Copy from obs to local, time use:27(s) ===>>>Files number: 14 ===>>>Begin training: WARNING:tensorflow:From /usr/local/Ascend/tfplugin/latest/tfplugin/python/site-packages/npu_bridge/estimator/npu/npu_optimizer.py:225: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead. Looking in indexes: http://repo.myhuaweicloud.com/repository/pypi/simple Collecting nltk Downloading http://repo.myhuaweicloud.com/repository/pypi/packages/5e/37/9532ddd4b1bbb619333d5708aaad9bf1742f051a664c3c6fa6632a105fd8/nltk-3.6.2-py3-none-any.whl (1.5MB) Requirement already satisfied: click in /usr/local/ma/python3.7/lib/python3.7/site-packages (from nltk) (7.1.2) Requirement already satisfied: joblib in /usr/local/ma/python3.7/lib/python3.7/site-packages (from nltk) (1.0.1) Collecting regex (from nltk) Downloading http://repo.myhuaweicloud.com/repository/pypi/packages/38/3f/4c42a98c9ad7d08c16e7d23b2194a0e4f3b2914662da8bc88986e4e6de1f/regex-2021.4.4.tar.gz (693kB) Requirement already satisfied: tqdm in /usr/local/ma/python3.7/lib/python3.7/site-packages (from nltk) (4.46.1) Building wheels for collected packages: regex Building wheel for regex (setup.py): started Building wheel for regex (setup.py): finished with status 'done' Created wheel for regex: filename=regex-2021.4.4-cp37-cp37m-linux_aarch64.whl size=592903 sha256=5698e14a8828c9e7e59c4cae2ecf85c0e46944b4dc07749c781eaeba65b98a75 Stored in directory: /home/work/.cache/pip/wheels/a2/19/67/0ba8880c01cf30e0e1d07bd03131a269df498acb731609de6c Successfully built regex Installing collected packages: regex, nltk Successfully installed nltk-3.6.2 regex-2021.4.4 WARNING:tensorflow:From /home/work/user-job-dir/code/translate.py:99: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead. WARNING:tensorflow:From /home/work/user-job-dir/code/translate.py:474: The name tf.app.run is deprecated. Please use tf.compat.v1.app.run instead. ----------------train----------------- Preparing WMT data in /cache/data The English training set is: /cache/data/giga-fren.release2.fixed.en WARNING:tensorflow:From /home/work/user-job-dir/code/translate.py:200: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead. W0525 19:34:55.396721 281473811476496 module_wrapper.py:139] From /home/work/user-job-dir/code/translate.py:200: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead. 2021-05-25 19:34:55.436602: W tensorflow/core/platform/profile_utils/cpu_utils.cc:98] Failed to find bogomips in /proc/cpuinfo; cannot determine CPU frequency 2021-05-25 19:34:55.445052: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x39d18b60 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2021-05-25 19:34:55.445103: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version Creating 4 layers of 1000 units. WARNING:tensorflow:From /cache/user-job-dir/code/seq2seq_model.py:170: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead. W0525 19:34:55.469472 281473811476496 module_wrapper.py:139] From /cache/user-job-dir/code/seq2seq_model.py:170: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead. WARNING:tensorflow:From /cache/user-job-dir/code/seq2seq_model.py:131: calling RandomUniform.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor W0525 19:34:55.569444 281473811476496 deprecation.py:506] From /cache/user-job-dir/code/seq2seq_model.py:131: calling RandomUniform.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /cache/user-job-dir/code/seq2seq_model.py:132: The name tf.get_variable_scope is deprecated. Please use tf.compat.v1.get_variable_scope instead. W0525 19:34:55.569717 281473811476496 module_wrapper.py:139] From /cache/user-job-dir/code/seq2seq_model.py:132: The name tf.get_variable_scope is deprecated. Please use tf.compat.v1.get_variable_scope instead. WARNING:tensorflow:From /cache/user-job-dir/code/seq2seq_model.py:132: LSTMCell.__init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version. Instructions for updating: This class is equivalent as tf.keras.layers.LSTMCell, and will be replaced by that in Tensorflow 2.0. W0525 19:34:55.569888 281473811476496 deprecation.py:323] From /cache/user-job-dir/code/seq2seq_model.py:132: LSTMCell.__init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version. Instructions for updating: This class is equivalent as tf.keras.layers.LSTMCell, and will be replaced by that in Tensorflow 2.0. WARNING:tensorflow:From /cache/user-job-dir/code/seq2seq_model.py:140: MultiRNNCell.__init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version. Instructions for updating: This class is equivalent as tf.keras.layers.StackedRNNCells, and will be replaced by that in Tensorflow 2.0. W0525 19:34:55.572441 281473811476496 deprecation.py:323] From /cache/user-job-dir/code/seq2seq_model.py:140: MultiRNNCell.__init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version. Instructions for updating: This class is equivalent as tf.keras.layers.StackedRNNCells, and will be replaced by that in Tensorflow 2.0. 2021-05-25 19:34:55.576603: W tensorflow/python/util/util.cc:299] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them. WARNING:tensorflow:From /cache/user-job-dir/code/override_contrib/override_seq2seq.py:380: static_rnn (from tensorflow.python.ops.rnn) is deprecated and will be removed in a future version. Instructions for updating: Please use `keras.layers.RNN(cell, unroll=True)`, which is equivalent to this API W0525 19:34:55.578600 281473811476496 deprecation.py:323] From /cache/user-job-dir/code/override_contrib/override_seq2seq.py:380: static_rnn (from tensorflow.python.ops.rnn) is deprecated and will be removed in a future version. Instructions for updating: Please use `keras.layers.RNN(cell, unroll=True)`, which is equivalent to this API WARNING:tensorflow:From /usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/ops/rnn_cell_impl.py:958: Layer.add_variable (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version. Instructions for updating: Please use `layer.add_weight` method instead. W0525 19:34:56.095069 281473811476496 deprecation.py:323] From /usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/ops/rnn_cell_impl.py:958: Layer.add_variable (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version. Instructions for updating: Please use `layer.add_weight` method instead. WARNING:tensorflow:From /usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/ops/rnn_cell_impl.py:962: calling Zeros.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor W0525 19:34:56.106187 281473811476496 deprecation.py:506] From /usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/ops/rnn_cell_impl.py:962: calling Zeros.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /cache/user-job-dir/code/override_contrib/core_rnn_cell.py:105: calling Constant.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor W0525 19:34:56.611823 281473811476496 deprecation.py:506] From /cache/user-job-dir/code/override_contrib/core_rnn_cell.py:105: calling Constant.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /cache/user-job-dir/code/seq2seq_model.py:212: The name tf.trainable_variables is deprecated. Please use tf.compat.v1.trainable_variables instead. W0525 19:35:15.063675 281473811476496 module_wrapper.py:139] From /cache/user-job-dir/code/seq2seq_model.py:212: The name tf.trainable_variables is deprecated. Please use tf.compat.v1.trainable_variables instead. WARNING:tensorflow:From /cache/user-job-dir/code/seq2seq_model.py:216: The name tf.train.GradientDescentOptimizer is deprecated. Please use tf.compat.v1.train.GradientDescentOptimizer instead. W0525 19:35:15.064075 281473811476496 module_wrapper.py:139] From /cache/user-job-dir/code/seq2seq_model.py:216: The name tf.train.GradientDescentOptimizer is deprecated. Please use tf.compat.v1.train.GradientDescentOptimizer instead. WARNING:tensorflow:From /usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/ops/clip_ops.py:301: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where W0525 19:35:20.560996 281473811476496 deprecation.py:323] From /usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/ops/clip_ops.py:301: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where WARNING:tensorflow:From /cache/user-job-dir/code/seq2seq_model.py:225: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead. W0525 19:36:15.462924 281473811476496 module_wrapper.py:139] From /cache/user-job-dir/code/seq2seq_model.py:225: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead. WARNING:tensorflow:From /cache/user-job-dir/code/seq2seq_model.py:225: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead. W0525 19:36:15.463409 281473811476496 module_wrapper.py:139] From /cache/user-job-dir/code/seq2seq_model.py:225: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead. WARNING:tensorflow:From /home/work/user-job-dir/code/translate.py:168: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead. W0525 19:36:15.519186 281473811476496 module_wrapper.py:139] From /home/work/user-job-dir/code/translate.py:168: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead. 2021-05-25 19:36:23.259319: W tf_adapter/util/infershape_util.cc:313] The InferenceContext of node _SOURCE is null. 2021-05-25 19:36:23.259387: W tf_adapter/util/infershape_util.cc:313] The InferenceContext of node _SINK is null. 2021-05-25 19:36:23.260393: W tf_adapter/util/infershape_util.cc:313] The InferenceContext of node init is null. 2021-05-25 19:36:30.049844: W tensorflow/core/framework/op_kernel.cc:1639] Unavailable: failed 2021-05-25 19:36:33.050093: F tf_adapter/kernels/geop_npu.cc:702] GeOp1_0GEOP::::DoRunAsync Failed Fatal Python error: Aborted Thread 0x0000fffe237fe1e0 (most recent call first): File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 296 in wait File "/usr/local/ma/python3.7/lib/python3.7/multiprocessing/queues.py", line 224 in _feed File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 870 in run File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 926 in _bootstrap_inner File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 890 in _bootstrap Thread 0x0000fffe3d9791e0 (most recent call first): File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 296 in wait File "/usr/local/ma/python3.7/lib/python3.7/multiprocessing/queues.py", line 224 in _feed File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 870 in run File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 926 in _bootstrap_inner File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 890 in _bootstrap Thread 0x0000fffe3e17a1e0 (most recent call first): File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 296 in wait File "/usr/local/ma/python3.7/lib/python3.7/multiprocessing/queues.py", line 224 in _feed File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 870 in run File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 926 in _bootstrap_inner File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 890 in _bootstrap Thread 0x0000fffe44aba1e0 (most recent call first): File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 296 in wait File "/usr/local/ma/python3.7/lib/python3.7/multiprocessing/queues.py", line 224 in _feed File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 870 in run File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 926 in _bootstrap_inner File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 890 in _bootstrap Thread 0x0000fffe452bb1e0 (most recent call first): File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 296 in wait File "/usr/local/ma/python3.7/lib/python3.7/multiprocessing/queues.py", line 224 in _feed File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 870 in run File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 926 in _bootstrap_inner File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 890 in _bootstrap Thread 0x0000fffe3aff51e0 (most recent call first): File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 296 in wait File "/usr/local/ma/python3.7/lib/python3.7/multiprocessing/queues.py", line 224 in _feed File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 870 in run File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 926 in _bootstrap_inner File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 890 in _bootstrap Thread 0x0000fffe3a7f41e0 (most recent call first): File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 296 in wait File "/usr/local/ma/python3.7/lib/python3.7/multiprocessing/queues.py", line 224 in _feed File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 870 in run File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 926 in _bootstrap_inner File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 890 in _bootstrap Thread 0x0000fffe39ff31e0 (most recent call first): File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 296 in wait File "/usr/local/ma/python3.7/lib/python3.7/multiprocessing/queues.py", line 224 in _feed File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 870 in run File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 926 in _bootstrap_inner File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 890 in _bootstrap Thread 0x0000fffe397f21e0 (most recent call first): File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 296 in wait File "/usr/local/ma/python3.7/lib/python3.7/multiprocessing/queues.py", line 224 in _feed File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 870 in run File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 926 in _bootstrap_inner File "/usr/local/ma/python3.7/lib/python3.7/threading.py", line 890 in _bootstrap Thread 0x0000ffffba8bf010 (most recent call first): File "/usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443 in _call_tf_sessionrun File "/usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350 in _run_fn File "/usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365 in _do_call File "/usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1359 in _do_run File "/usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1180 in _run File "/usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956 in run File "/home/work/user-job-dir/code/translate.py", line 168 in create_model File "/home/work/user-job-dir/code/translate.py", line 203 in train File "/home/work/user-job-dir/code/translate.py", line 470 in main File "/usr/local/ma/python3.7/lib/python3.7/site-packages/absl/app.py", line 251 in _run_main File "/usr/local/ma/python3.7/lib/python3.7/site-packages/absl/app.py", line 303 in run File "/usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/platform/app.py", line 40 in run File "/home/work/user-job-dir/code/translate.py", line 474 in <module> ===>>>Training finished: ===>>>Copy files from local dir:/cache/model to obs:s3://th-base-bucket/seq2seq/model/MA-seq2seq-npu-05-25-19-03/output/V0003/result INFO:root:No files to copy. ===>>>Copy from local to obs, time use:0(s) ===>>>Files number: 0 2021-05-25 19:36:45,599 735 PCOMPILE Master process dead. worker process quiting.. 2021-05-25 19:36:45,652 734 PCOMPILE Master process dead. worker process quiting.. 2021-05-25 19:36:45,706 739 PCOMPILE Master process dead. worker process quiting.. 2021-05-25 19:36:45,753 737 PCOMPILE Master process dead. worker process quiting.. 2021-05-25 19:36:45,784 736 PCOMPILE Master process dead. worker process quiting.. 2021-05-25 19:36:45,835 732 PCOMPILE Master process dead. worker process quiting.. 2021-05-25 19:36:45,905 733 PCOMPILE Master process dead. worker process quiting.. 2021-05-25 19:36:46,008 738 PCOMPILE Master process dead. worker process quiting.. /usr/local/ma/python3.7/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 91 leaked semaphores to clean up at shutdown len(cache)) [Modelarts Service Log]2021-05-25 19:36:47,305 - INFO - Begin destroy training processes [Modelarts Service Log]2021-05-25 19:36:47,306 - INFO - proc-rank-0-device-0 (pid: 94) has exited [Modelarts Service Log]2021-05-25 19:36:47,306 - INFO - End destroy training processes [ModelArts Service Log]modelarts-pipe: total length: 25844 [Modelarts Service Log]Training end with return code: 0 [Modelarts Service Log]upload ascend-log to s3://th-base-bucket/seq2seq/model/MA-seq2seq-npu-05-25-19-03/log/ascend-log/ at 2021-05-25-19:36:47 upload_tail_log.py -l 2048 -o s3://th-base-bucket/seq2seq/model/MA-seq2seq-npu-05-25-19-03/log/ascend-log/ INFO:root:No files to copy. [Modelarts Service Log]upload ascend-log end at 2021-05-25-19:36:48 [Modelarts Service Log]handle outputs of training job [Modelarts Service Log][modelarts_logger] modelarts-pipe found [ModelArts Service Log]modelarts-pipe: will create log file /tmp/log/MA-seq2seq-npu-05-25-19-03.log [ModelArts Service Log]modelarts-pipe: will write log file /tmp/log/MA-seq2seq-npu-05-25-19-03.log [ModelArts Service Log]modelarts-pipe: param for max log length: 1073741824 [ModelArts Service Log]modelarts-pipe: param for whether exit on overflow: 0 INFO:root:Using MoXing-v1.17.3-8aa951bc INFO:root:Using OBS-Python-SDK-3.20.7 [Modelarts Service Log]Training end at 2021-05-25-19:36:48 [Modelarts Service Log]Training completed.
评论 (
19
)
登录
后才可以发表评论
状态
CLOSED
TODO
ACCEPTED
Analysing
Feedback
WIP
Replied
CLOSED
DONE
REJECTED
负责人
未设置
Xueheng Zhang
GreyZzzzzzXh
负责人
协作者
+负责人
+协作者
标签
未设置
项目
未立项任务
未立项任务
里程碑
未关联里程碑
未关联里程碑
Pull Requests
未关联
未关联
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
未关联
未关联
master
开始日期   -   截止日期
-
置顶选项
不置顶
置顶等级:高
置顶等级:中
置顶等级:低
优先级
不指定
严重
主要
次要
不重要
预计工期
(小时)
参与者(4)
1
https://gitee.com/ascend/modelzoo.git
git@gitee.com:ascend/modelzoo.git
ascend
modelzoo
modelzoo
点此查找更多帮助
搜索帮助
Git 命令在线学习
如何在 Gitee 导入 GitHub 仓库
Git 仓库基础操作
企业版和社区版功能对比
SSH 公钥设置
如何处理代码冲突
仓库体积过大,如何减小?
如何找回被删除的仓库数据
Gitee 产品配额说明
GitHub仓库快速导入Gitee及同步更新
什么是 Release(发行版)
将 PHP 项目自动发布到 packagist.org
评论
仓库举报
回到顶部
登录提示
该操作需登录 Gitee 帐号,请先登录后再操作。
立即登录
没有帐号,去注册