74 Star 218 Fork 168

Ascend / modelzoo

 / 详情

【华师大】【模型训练挑战】【DIEN模型】GeOp5_0GEOP::::DoRunAsync Failed

DONE
Requirement
创建于  
2020-12-16 21:32

Environment

/device ascend
本地CPU代码更改成tensorflow1.15后成功运行
采用Pycharm在ModelArts平台上进行训练,运行失败。
Pycharm输出的Job信息:

job id: 540517, version id: 1065482

具体日志如下:

do nothing
[Modelarts Service Log]user: uid=1101(work) gid=1101(work) groups=1101(work),1000(HwHiAiUser)
[Modelarts Service Log]pwd: /home/work
[Modelarts Service Log]app_url: s3://model-trainning/DIEN/DIEN-Pycharm/MA-script_local-12-13-21/code/
[Modelarts Service Log]boot_file: code/train.py
[Modelarts Service Log]log_url: /tmp/log/MA-script_local-12-13-21.log
[Modelarts Service Log]command: code/train.py --data_url=s3://model-trainning/DIEN/DIEN-Pycharm/ --train_url=s3://model-trainning/DIEN/DIEN-Pycharm/MA-script_local-12-13-21/output/V0011/
[Modelarts Service Log]local_code_dir: 
[Modelarts Service Log][modelarts_create_log] modelarts-pipe found
[Modelarts Service Log]handle inputs of training job
INFO:root:Using MoXing-v1.17.3-8aa951bc
INFO:root:Using OBS-Python-SDK-3.20.7
[ModelArts Service Log]INFO: env MA_INPUTS is not found, skip the inputs handler
INFO:root:Using MoXing-v1.17.3-8aa951bc
INFO:root:Using OBS-Python-SDK-3.20.7
[ModelArts Service Log]2020-12-16 12:43:10,494 - modelarts-downloader.py[line:612] - INFO: Main: modelarts-downloader starting with Namespace(dst='./', recursive=True, skip_creating_dir=False, src='s3://model-trainning/DIEN/DIEN-Pycharm/MA-script_local-12-13-21/code/', trace=False, type='common', verbose=False)
[Modelarts Service Log][modelarts_logger] modelarts-pipe found
/home/work/user-job-dir
[Modelarts Service Log][modelarts_logger] modelarts-pipe found
[Modelarts Service Log]2020-12-16 12:43:26,598 - INFO - Davinci training command
[Modelarts Service Log]2020-12-16 12:43:26,598 - INFO - ['/usr/bin/python', '/home/work/user-job-dir/code/train.py', '--data_url=s3://model-trainning/DIEN/DIEN-Pycharm/', '--train_url=s3://model-trainning/DIEN/DIEN-Pycharm/MA-script_local-12-13-21/output/V0011/']
[Modelarts Service Log]2020-12-16 12:43:26,599 - INFO - Wait for Rank table file ready
[Modelarts Service Log]2020-12-16 12:43:26,599 - INFO - Rank table file (K8S generated) is ready for read
[Modelarts Service Log]2020-12-16 12:43:26,599 - INFO - 
{
    "status": "completed",
    "group_count": "1",
    "group_list": [
        {
            "group_name": "job-ma-script-local-12-1",
            "device_count": "1",
            "instance_count": "1",
            "instance_list": [
                {
                    "pod_name": "job6fd94b84-job-ma-script-local-12-1-0",
                    "server_id": "192.168.0.187",
                    "devices": [
                        {
                            "device_id": "1",
                            "device_ip": "192.2.216.232"
                        }
                    ]
                }
            ]
        }
    ]
}
[Modelarts Service Log]2020-12-16 12:43:26,600 - INFO - Rank table file (C7x)
[Modelarts Service Log]2020-12-16 12:43:26,600 - INFO - 
{
    "status": "completed",
    "version": "1.0",
    "server_count": "1",
    "server_list": [
        {
            "server_id": "192.168.0.187",
            "device": [
                {
                    "device_id": "1",
                    "device_ip": "192.2.216.232",
                    "rank_id": "0"
                }
            ]
        }
    ]
}
[Modelarts Service Log]2020-12-16 12:43:26,600 - INFO - Rank table file (C7x) is generated
[Modelarts Service Log]2020-12-16 12:43:26,600 - INFO - Slogd startup
[Modelarts Service Log]2020-12-16 12:43:26,603 - INFO - Current server
[Modelarts Service Log]2020-12-16 12:43:26,604 - INFO - 
{
    "server_id": "192.168.0.187",
    "device": [
        {
            "device_id": "1",
            "device_ip": "192.2.216.232",
            "rank_id": "0"
        }
    ]
}
[Modelarts Service Log]2020-12-16 12:43:26,604 - INFO - FMK of device1 startup
WARNING:tensorflow:From /home/work/user-job-dir/code/train.py:233: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

WARNING:tensorflow:From /home/work/user-job-dir/code/train.py:122: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From /home/work/user-job-dir/code/train.py:127: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2020-12-16 12:43:31.520430: W tensorflow/core/platform/profile_utils/cpu_utils.cc:98] Failed to find bogomips in /proc/cpuinfo; cannot determine CPU frequency
2020-12-16 12:43:31.531185: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3d7c77a0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-12-16 12:43:31.531226: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
WARNING:tensorflow:From /cache/user-job-dir/code/model.py:13: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From /cache/user-job-dir/code/model.py:29: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead.

WARNING:tensorflow:From /cache/user-job-dir/code/model.py:30: The name tf.summary.histogram is deprecated. Please use tf.compat.v1.summary.histogram instead.

WARNING:tensorflow:From /cache/user-job-dir/code/model.py:344: GRUCell.__init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This class is equivalent as tf.keras.layers.GRUCell, and will be replaced by that in Tensorflow 2.0.
WARNING:tensorflow:From /cache/user-job-dir/code/rnn.py:567: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
WARNING:tensorflow:From /usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/ops/rnn_cell_impl.py:559: Layer.add_variable (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `layer.add_weight` method instead.
WARNING:tensorflow:From /usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/ops/rnn_cell_impl.py:565: calling Constant.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
WARNING:tensorflow:From /usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/ops/rnn_cell_impl.py:575: calling Zeros.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
WARNING:tensorflow:From /cache/user-job-dir/code/rnn.py:201: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
WARNING:tensorflow:From /cache/user-job-dir/code/model.py:103: The name tf.AUTO_REUSE is deprecated. Please use tf.compat.v1.AUTO_REUSE instead.

WARNING:tensorflow:From /cache/user-job-dir/code/model.py:103: batch_normalization (from tensorflow.python.layers.normalization) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.BatchNormalization instead.  In particular, `tf.control_dependencies(tf.GraphKeys.UPDATE_OPS)` should not be used (consult the `tf.keras.layers.batch_normalization` documentation).
WARNING:tensorflow:From /usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/layers/normalization.py:327: Layer.apply (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `layer.__call__` method instead.
WARNING:tensorflow:From /cache/user-job-dir/code/model.py:104: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.Dense instead.
WARNING:tensorflow:From /cache/user-job-dir/code/model.py:97: The name tf.log is deprecated. Please use tf.math.log instead.

WARNING:tensorflow:From /cache/user-job-dir/code/utils.py:146: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

WARNING:tensorflow:From /cache/user-job-dir/code/model.py:82: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.

WARNING:tensorflow:From /cache/user-job-dir/code/model.py:83: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.

WARNING:tensorflow:From /cache/user-job-dir/code/model.py:89: The name tf.summary.merge_all is deprecated. Please use tf.compat.v1.summary.merge_all instead.

WARNING:tensorflow:From /home/work/user-job-dir/code/train.py:153: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.

2020-12-16 12:44:19.812852: I tf_adapter/optimizers/get_attr_optimize_pass.cc:64] NpuAttrs job is localhost
2020-12-16 12:44:19.813515: I tf_adapter/optimizers/get_attr_optimize_pass.cc:128] GetAttrOptimizePass_1 success. [0 ms]
2020-12-16 12:44:19.813547: I tf_adapter/optimizers/mark_start_node_pass.cc:82] job is localhost Skip the optimizer : MarkStartNodePass.
2020-12-16 12:44:19.813748: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:102] mix_compile_mode is False
2020-12-16 12:44:19.813764: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:103] iterations_per_loop is 1
2020-12-16 12:44:19.813998: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1763] OMPartition subgraph_1 begin.
2020-12-16 12:44:19.814012: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1764] mix_compile_mode is False
2020-12-16 12:44:19.814019: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1765] iterations_per_loop is 1
2020-12-16 12:44:19.814806: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:354] FindNpuSupportCandidates enableDP:0, mix_compile_mode: 0, hasMakeIteratorOp:0, hasIteratorOp:0
2020-12-16 12:44:19.814886: I tf_adapter/util/npu_ops_identifier.cc:67] [ALL] Parsing json from /usr/local/Ascend/nnae/latest/opp/framework/built-in/tensorflow/npu_supported_ops.json
2020-12-16 12:44:19.817382: I tf_adapter/util/npu_ops_identifier.cc:69] 690 ops parsed
2020-12-16 12:44:19.817415: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:228] node: Const is not in white list, so currently not support
2020-12-16 12:44:19.817745: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:484] TFadapter find Npu support candidates cost: [3 ms]
2020-12-16 12:44:19.823687: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:863] cluster Num is 1
2020-12-16 12:44:19.823715: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:870] All nodes in graph: 392, max nodes count: 390 in subgraph: GeOp1_0 minGroupSize: 1
2020-12-16 12:44:19.823814: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1851] OMPartition subgraph_1 markForPartition success.
2020-12-16 12:44:19.827709: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1494] subgraphNum: 1
2020-12-16 12:44:19.858973: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1943] OMPartition subgraph_1 SubgraphsInFunctions success.
2020-12-16 12:44:19.859507: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1954] OMPartition subgraph_1 success. [45 ms]
2020-12-16 12:44:19.859616: I tf_adapter/optimizers/dp_tf_ge_conversion_pass.cc:917] DpTfToGEConversionPassImpl::RunPass, enable data preproc is false
2020-12-16 12:44:19.881992: I tf_adapter/optimizers/add_input_pass.cc:96] job is localhost Skip the optimizer : AddInputPass.
2020-12-16 12:44:19.882557: I tf_adapter/kernels/geop_npu.cc:175] [GEOP] Begin GeOp initialize.
2020-12-16 12:44:19.883194: I tf_adapter/util/ge_plugin.cc:52] [GePlugin] new constructor
2020-12-16 12:44:19.883233: I tf_adapter/util/ge_plugin.cc:94] [GePlugin] graph run mode : 1
2020-12-16 12:44:19.897079: I tf_adapter/util/ge_plugin.cc:99] [GePlugin] device id : 0
2020-12-16 12:44:19.897159: I tf_adapter/util/ge_plugin.cc:119] [GePlugin] env RANK_TABLE_FILE:/home/work/rank_table/jobstart_hccl.json
2020-12-16 12:44:19.897249: I tf_adapter/util/ge_plugin.cc:129] [GePlugin] env RANK_ID:0
2020-12-16 12:44:19.897258: I tf_adapter/util/ge_plugin.cc:142] [GePlugin] is_tailing_optimization : 0
2020-12-16 12:44:19.897270: I tf_adapter/util/ge_plugin.cc:145] [GePlugin] profiling_mode : 0, profiling_options:training_trace, fp_point: , bp_point: 
2020-12-16 12:44:19.897312: I tf_adapter/util/ge_plugin.cc:151] [GePlugin] precision_mode : allow_fp32_to_fp16
2020-12-16 12:44:19.897319: I tf_adapter/util/ge_plugin.cc:154] [GePlugin] auto_tune_mode : 
2020-12-16 12:44:19.897324: I tf_adapter/util/ge_plugin.cc:157] [GePlugin] op_debug_level : 0
2020-12-16 12:44:19.897330: I tf_adapter/util/ge_plugin.cc:160] [GePlugin] enable_scope_fusion_passes : 
2020-12-16 12:44:19.897336: I tf_adapter/util/ge_plugin.cc:163] [GePlugin] enable_exception_dump : 0
2020-12-16 12:44:19.897342: I tf_adapter/util/ge_plugin.cc:166] [GePlugin] Start Init tdt host.
2020-12-16 12:44:19.900594: I tf_adapter/util/ge_plugin.cc:172] [GePlugin] Tdt host init succeed.
2020-12-16 12:44:22.878037: I tf_adapter/util/ge_plugin.cc:179] [GePlugin] Initialize ge success.
2020-12-16 12:44:22.885561: I tf_adapter/util/ge_plugin.cc:187] [GePlugin] Initialize parser success.
2020-12-16 12:44:22.885601: I tf_adapter/kernels/geop_npu.cc:203] [GEOP] GePlugin init success
2020-12-16 12:44:22.885649: I tf_adapter/kernels/geop_npu.cc:209] [GEOP] GeOp Initialize success, cost: [3003 ms]
2020-12-16 12:44:22.886086: I tf_adapter/kernels/geop_npu.cc:364] [GEOP] get tf session directfd8a41491659c190 from session handle.
2020-12-16 12:44:22.886163: I tf_adapter/kernels/geop_npu.cc:375] [GEOP] Node name: GeOp1_0 , tf session: directfd8a41491659c190
2020-12-16 12:44:22.886182: I tf_adapter/util/session_manager.cc:100] [GEOP] variable_acceleration :1
2020-12-16 12:44:22.886243: I tf_adapter/util/session_manager.cc:102] [GEOP] hcom_parallel :0
2020-12-16 12:44:22.886249: I tf_adapter/util/session_manager.cc:105] [GEOP] stream_max_parallel_num :
2020-12-16 12:44:22.886261: I tf_adapter/util/session_manager.cc:121] [GEOP] op_select_implmode : high_performance
2020-12-16 12:44:22.886267: I tf_adapter/util/session_manager.cc:123] [GEOP] optypelist_for_implmode : 
2020-12-16 12:44:22.886272: W tf_adapter/util/session_manager.cc:129] [GEOP] can not get DISABLE_REUSE_MEMORY in env, set to default 0
2020-12-16 12:44:22.886282: I tf_adapter/util/session_manager.cc:135] [GEOP] enable_dump :0, dump_path :, dump_step :NA, dump_mode :output, enable_dump_debug :0, dump_debug_mode :all
2020-12-16 12:44:22.886288: I tf_adapter/util/session_manager.cc:82] [GEOP] hcom_parallel :0
2020-12-16 12:44:22.886293: I tf_adapter/util/session_manager.cc:85] [GEOP] stream_max_parallel_num :
2020-12-16 12:44:22.890360: I tf_adapter/kernels/geop_npu.cc:382] [GEOP] tf session: directfd8a41491659c190 get ge session success.
2020-12-16 12:44:22.890393: I tf_adapter/kernels/geop_npu.cc:388] [GEOP] Begin GeOp::ComputeAsync, kernel_name:GeOp1_0, num_inputs:0, num_outputs:0
2020-12-16 12:44:22.890403: I tf_adapter/kernels/geop_npu.cc:251] [GEOP] tf session directfd8a41491659c190, graph id: 1 does not build yet, no need to check rebuild
2020-12-16 12:44:22.890473: I tf_adapter/util/infershape_util.cc:346] InferShapeUtil::InferShape
2020-12-16 12:44:22.890488: I tf_adapter/util/infershape_util.cc:84] The signature name of FunctionDef is GeOp1_0.
2020-12-16 12:44:22.897713: I tf_adapter/util/infershape_util.cc:96] InstantiateFunction GeOp1_0 success.
2020-12-16 12:44:22.900523: I tf_adapter/util/infershape_util.cc:101] ConvertNodeDefsToGraph GeOp1_0 success.
2020-12-16 12:44:22.902472: W tf_adapter/util/infershape_util.cc:304] The InferenceContext of node _SOURCE is null.
2020-12-16 12:44:22.902498: W tf_adapter/util/infershape_util.cc:304] The InferenceContext of node _SINK is null.
2020-12-16 12:44:22.904771: W tf_adapter/util/infershape_util.cc:304] The InferenceContext of node init is null.
2020-12-16 12:44:22.904793: I tf_adapter/util/infershape_util.cc:395] InferShapeUtil::InferShape success
2020-12-16 12:44:22.915283: I tf_adapter/kernels/geop_npu.cc:440] [GEOP] In GEOP computeAsync, kernel_name:GeOp1_0 ,TFadapter cost time: [24 ms]
2020-12-16 12:44:22.915314: I tf_adapter/kernels/geop_npu.cc:442] [GEOP] TFadpter process graph success, GE parser begin, kernel_name:GeOp1_0 ,tf session: directfd8a41491659c190 ,graph id :1
2020-12-16 12:44:23.283448: I tf_adapter/kernels/geop_npu.cc:508] [GEOP] Tensorflow graph parse to ge graph success, kernel_name:GeOp1_0 ,tf session: directfd8a41491659c190 ,graph id: 1
2020-12-16 12:44:23.284079: I tf_adapter/kernels/geop_npu.cc:539] [GEOP] Add graph to ge session success, kernel_name:GeOp1_0 ,tf session: directfd8a41491659c190 ,graph id:1
2020-12-16 12:44:23.285938: I tf_adapter/kernels/geop_npu.cc:580] [GEOP] Call ge session RunGraphAsync, kernel_name:GeOp1_0 ,tf session: directfd8a41491659c190 ,graph id: 1
2020-12-16 12:44:23.286041: I tf_adapter/kernels/geop_npu.cc:593] [GEOP] End GeOp::ComputeAsync, kernel_name:GeOp1_0, ret_status:success ,tf session: directfd8a41491659c190 ,graph id: 1 [395 ms]
2020-12-16 12:44:32.950045: I tf_adapter/kernels/geop_npu.cc:76] BuildOutputTensorInfo, num_outputs:0
2020-12-16 12:44:32.950141: I tf_adapter/kernels/geop_npu.cc:573] [GEOP] RunGraphAsync callback, status:0, kernel_name:GeOp1_0[ 9664203us]
WARNING:tensorflow:From /home/work/user-job-dir/code/train.py:154: The name tf.local_variables_initializer is deprecated. Please use tf.compat.v1.local_variables_initializer instead.

2020-12-16 12:44:33.131234: I tf_adapter/optimizers/get_attr_optimize_pass.cc:64] NpuAttrs job is localhost
2020-12-16 12:44:33.131603: I tf_adapter/optimizers/get_attr_optimize_pass.cc:128] GetAttrOptimizePass_2 success. [0 ms]
2020-12-16 12:44:33.131628: I tf_adapter/optimizers/mark_start_node_pass.cc:82] job is localhost Skip the optimizer : MarkStartNodePass.
2020-12-16 12:44:33.131660: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:102] mix_compile_mode is False
2020-12-16 12:44:33.131708: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:103] iterations_per_loop is 1
2020-12-16 12:44:33.131764: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1763] OMPartition subgraph_3 begin.
2020-12-16 12:44:33.131772: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1764] mix_compile_mode is False
2020-12-16 12:44:33.131778: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1765] iterations_per_loop is 1
2020-12-16 12:44:33.131812: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:354] FindNpuSupportCandidates enableDP:0, mix_compile_mode: 0, hasMakeIteratorOp:0, hasIteratorOp:0
2020-12-16 12:44:33.131830: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:484] TFadapter find Npu support candidates cost: [0 ms]
2020-12-16 12:44:33.131923: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:863] cluster Num is 1
2020-12-16 12:44:33.131933: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:870] All nodes in graph: 3, max nodes count: 1 in subgraph: GeOp3_0 minGroupSize: 1
2020-12-16 12:44:33.131943: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:897] Clear isolated NoOp from GeOp3_0
2020-12-16 12:44:33.131961: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1851] OMPartition subgraph_3 markForPartition success.
2020-12-16 12:44:33.131967: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1853] subgraphNum is 0
2020-12-16 12:44:33.132016: I tf_adapter/optimizers/dp_tf_ge_conversion_pass.cc:917] DpTfToGEConversionPassImpl::RunPass, enable data preproc is false
2020-12-16 12:44:33.133758: I tf_adapter/optimizers/add_input_pass.cc:96] job is localhost Skip the optimizer : AddInputPass.
2020-12-16 12:44:34.468043: I tf_adapter/optimizers/get_attr_optimize_pass.cc:64] NpuAttrs job is localhost
2020-12-16 12:44:34.468729: I tf_adapter/optimizers/get_attr_optimize_pass.cc:128] GetAttrOptimizePass_3 success. [0 ms]
2020-12-16 12:44:34.468762: I tf_adapter/optimizers/mark_start_node_pass.cc:82] job is localhost Skip the optimizer : MarkStartNodePass.
2020-12-16 12:44:34.468944: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:102] mix_compile_mode is False
2020-12-16 12:44:34.468959: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:103] iterations_per_loop is 1
2020-12-16 12:44:34.469149: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1763] OMPartition subgraph_5 begin.
2020-12-16 12:44:34.469163: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1764] mix_compile_mode is False
2020-12-16 12:44:34.469170: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1765] iterations_per_loop is 1
2020-12-16 12:44:34.470144: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:354] FindNpuSupportCandidates enableDP:0, mix_compile_mode: 0, hasMakeIteratorOp:0, hasIteratorOp:0
2020-12-16 12:44:34.470539: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:228] node: _Arg is not in white list, so currently not support
2020-12-16 12:44:34.470566: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:228] node: _Retval is not in white list, so currently not support
2020-12-16 12:44:34.470596: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:484] TFadapter find Npu support candidates cost: [0 ms]
2020-12-16 12:44:34.479779: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:863] cluster Num is 1
2020-12-16 12:44:34.479809: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:870] All nodes in graph: 516, max nodes count: 500 in subgraph: GeOp5_0 minGroupSize: 1
2020-12-16 12:44:34.479992: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1851] OMPartition subgraph_5 markForPartition success.
2020-12-16 12:44:34.484663: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1494] subgraphNum: 1
2020-12-16 12:44:34.492103: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1943] OMPartition subgraph_5 SubgraphsInFunctions success.
2020-12-16 12:44:34.492133: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1954] OMPartition subgraph_5 success. [23 ms]
2020-12-16 12:44:34.492206: I tf_adapter/optimizers/dp_tf_ge_conversion_pass.cc:917] DpTfToGEConversionPassImpl::RunPass, enable data preproc is false
2020-12-16 12:44:34.494428: I tf_adapter/optimizers/add_input_pass.cc:96] job is localhost Skip the optimizer : AddInputPass.
2020-12-16 12:44:34.494828: I tf_adapter/kernels/geop_npu.cc:175] [GEOP] Begin GeOp initialize.
2020-12-16 12:44:34.494894: I tf_adapter/util/ge_plugin.cc:69] [GePlugin] Ge has already initialized
2020-12-16 12:44:34.494920: I tf_adapter/kernels/geop_npu.cc:203] [GEOP] GePlugin init success
2020-12-16 12:44:34.494940: I tf_adapter/kernels/geop_npu.cc:209] [GEOP] GeOp Initialize success, cost: [0 ms]
2020-12-16 12:44:34.495176: I tf_adapter/kernels/geop_npu.cc:364] [GEOP] get tf session directfd8a41491659c190 from session handle.
2020-12-16 12:44:34.495277: I tf_adapter/kernels/geop_npu.cc:375] [GEOP] Node name: GeOp5_0 , tf session: directfd8a41491659c190
2020-12-16 12:44:34.495306: I tf_adapter/util/session_manager.cc:50] tf session directfd8a41491659c190 get ge session success.
2020-12-16 12:44:34.495312: I tf_adapter/kernels/geop_npu.cc:382] [GEOP] tf session: directfd8a41491659c190 get ge session success.
2020-12-16 12:44:34.495320: I tf_adapter/kernels/geop_npu.cc:388] [GEOP] Begin GeOp::ComputeAsync, kernel_name:GeOp5_0, num_inputs:10, num_outputs:4
2020-12-16 12:44:34.495337: I tf_adapter/kernels/geop_npu.cc:251] [GEOP] tf session directfd8a41491659c190, graph id: 11 does not build yet, no need to check rebuild
2020-12-16 12:44:34.496488: I tf_adapter/util/infershape_util.cc:346] InferShapeUtil::InferShape
2020-12-16 12:44:34.496516: I tf_adapter/util/infershape_util.cc:84] The signature name of FunctionDef is GeOp5_0.
2020-12-16 12:44:34.501970: I tf_adapter/util/infershape_util.cc:96] InstantiateFunction GeOp5_0 success.
2020-12-16 12:44:34.504737: I tf_adapter/util/infershape_util.cc:101] ConvertNodeDefsToGraph GeOp5_0 success.
2020-12-16 12:44:34.505320: I tf_adapter/util/infershape_util.cc:369] in_edges: rnn_1/gru1/while/Enter --> rnn_1/gru1/while/Merge
2020-12-16 12:44:34.505343: I tf_adapter/util/infershape_util.cc:369] in_edges: rnn_1/gru1/while/NextIteration --> rnn_1/gru1/while/Merge
2020-12-16 12:44:34.505357: I tf_adapter/util/infershape_util.cc:369] in_edges: rnn_2/gru2/while/Enter --> rnn_2/gru2/while/Merge
2020-12-16 12:44:34.505364: I tf_adapter/util/infershape_util.cc:369] in_edges: rnn_2/gru2/while/NextIteration --> rnn_2/gru2/while/Merge
2020-12-16 12:44:34.505377: I tf_adapter/util/infershape_util.cc:369] in_edges: rnn_1/gru1/while/Enter_1 --> rnn_1/gru1/while/Merge_1
2020-12-16 12:44:34.505382: I tf_adapter/util/infershape_util.cc:369] in_edges: rnn_1/gru1/while/NextIteration_1 --> rnn_1/gru1/while/Merge_1
2020-12-16 12:44:34.505389: I tf_adapter/util/infershape_util.cc:369] in_edges: rnn_1/gru1/while/Enter_2 --> rnn_1/gru1/while/Merge_2
2020-12-16 12:44:34.505394: I tf_adapter/util/infershape_util.cc:369] in_edges: rnn_1/gru1/while/NextIteration_2 --> rnn_1/gru1/while/Merge_2
2020-12-16 12:44:34.505405: I tf_adapter/util/infershape_util.cc:369] in_edges: rnn_2/gru2/while/Enter_2 --> rnn_2/gru2/while/Merge_2
2020-12-16 12:44:34.505411: I tf_adapter/util/infershape_util.cc:369] in_edges: rnn_2/gru2/while/NextIteration_2 --> rnn_2/gru2/while/Merge_2
2020-12-16 12:44:34.505431: I tf_adapter/util/infershape_util.cc:369] in_edges: rnn_2/gru2/while/Enter_3 --> rnn_2/gru2/while/Merge_3
2020-12-16 12:44:34.505436: I tf_adapter/util/infershape_util.cc:369] in_edges: rnn_2/gru2/while/NextIteration_3 --> rnn_2/gru2/while/Merge_3
2020-12-16 12:44:34.507529: W tf_adapter/util/infershape_util.cc:304] The InferenceContext of node _SOURCE is null.
2020-12-16 12:44:34.507560: W tf_adapter/util/infershape_util.cc:304] The InferenceContext of node _SINK is null.
2020-12-16 12:44:34.508602: W tf_adapter/util/infershape_util.cc:326] The shape of node Reshape output 0 is [?,?,36], unknown shape.
2020-12-16 12:44:34.508678: W tf_adapter/util/infershape_util.cc:326] The shape of node strided_slice_5 output 0 is [?,?,36], unknown shape.
2020-12-16 12:44:34.508757: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_1/gru1/TensorArrayUnstack/range output 0 is [?], unknown shape.
2020-12-16 12:44:34.508848: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_1/gru1/GRUCellZeroState/zeros output 0 is [?,36], unknown shape.
2020-12-16 12:44:34.508891: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_1/gru1/while/Enter_2 output 0 is [?,36], unknown shape.
2020-12-16 12:44:34.508936: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_1/gru1/while/Select/Enter output 0 is [?,36], unknown shape.
2020-12-16 12:44:34.509018: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_1/gru1/while/Merge_2 output 0 is [?,36], unknown shape.
2020-12-16 12:44:34.509086: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_1/gru1/while/Switch_2 output 0 is [?,36], unknown shape.
2020-12-16 12:44:34.509100: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_1/gru1/while/Switch_2 output 1 is [?,36], unknown shape.
2020-12-16 12:44:34.509131: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_1/gru1/while/TensorArrayReadV3 output 0 is ?, unknown shape.
2020-12-16 12:44:34.509151: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_1/gru1/TensorArrayStack/range output 0 is [?], unknown shape.
2020-12-16 12:44:34.509171: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_1/gru1/while/gru_cell/concat output 0 is [?,?], unknown shape.
2020-12-16 12:44:34.509189: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_1/gru1/TensorArrayStack/TensorArrayGatherV3 output 0 is ?, unknown shape.
2020-12-16 12:44:34.509210: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_1/gru1/while/gru_cell/MatMul output 0 is [?,72], unknown shape.
2020-12-16 12:44:34.509228: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_1/gru1/transpose output 0 is [?,?,?], unknown shape.
2020-12-16 12:44:34.509244: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_2/gru2/Shape output 0 is [?], unknown shape.
2020-12-16 12:44:34.509261: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_1/gru1/while/gru_cell/BiasAdd output 0 is [?,72], unknown shape.
2020-12-16 12:44:34.509283: W tf_adapter/util/infershape_util.cc:326] The shape of node strided_slice_3 output 0 is [?,?,?], unknown shape.
2020-12-16 12:44:34.509313: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_1/gru1/while/gru_cell/Sigmoid output 0 is [?,72], unknown shape.
2020-12-16 12:44:34.509336: W tf_adapter/util/infershape_util.cc:326] The shape of node concat_4 output 0 is [128,582,?], unknown shape.
2020-12-16 12:44:34.509352: W tf_adapter/util/infershape_util.cc:326] The shape of node concat_5 output 0 is [?,?,?], unknown shape.
2020-12-16 12:44:34.509387: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_2/gru2/TensorArrayUnstack/range output 0 is [?], unknown shape.
2020-12-16 12:44:34.509402: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_1/gru1/while/gru_cell/split output 0 is [?,36], unknown shape.
2020-12-16 12:44:34.509412: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_1/gru1/while/gru_cell/split output 1 is [?,36], unknown shape.
2020-12-16 12:44:34.509446: W tf_adapter/util/infershape_util.cc:326] The shape of node bn1gru_1/batchnorm/mul_1 output 0 is [?,?,72], unknown shape.
2020-12-16 12:44:34.509485: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_1/gru1/while/gru_cell/mul output 0 is [?,36], unknown shape.
2020-12-16 12:44:34.509500: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_1/gru1/while/gru_cell/mul_1 output 0 is [?,36], unknown shape.
2020-12-16 12:44:34.509512: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_1/gru1/while/gru_cell/sub output 0 is [?,36], unknown shape.
2020-12-16 12:44:34.509526: W tf_adapter/util/infershape_util.cc:326] The shape of node Attention_layer_1/Tile output 0 is [128,?], unknown shape.
2020-12-16 12:44:34.509546: W tf_adapter/util/infershape_util.cc:326] The shape of node bn1gru_1/batchnorm/add_1 output 0 is [?,?,72], unknown shape.
2020-12-16 12:44:34.509564: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_2/gru2/VecAttGRUCellZeroState/zeros output 0 is [?,36], unknown shape.
2020-12-16 12:44:34.509591: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_1/gru1/while/gru_cell/concat_1 output 0 is [?,?], unknown shape.
2020-12-16 12:44:34.509607: W tf_adapter/util/infershape_util.cc:326] The shape of node Attention_layer_1/Reshape output 0 is [?,?,?], unknown shape.
2020-12-16 12:44:34.509630: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_2/gru2/while/Enter_2 output 0 is [?,36], unknown shape.
2020-12-16 12:44:34.509654: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_1/gru1/while/gru_cell/MatMul_1 output 0 is [?,36], unknown shape.
2020-12-16 12:44:34.509670: W tf_adapter/util/infershape_util.cc:326] The shape of node Attention_layer_1/sub output 0 is [?,?,?], unknown shape.
2020-12-16 12:44:34.509686: W tf_adapter/util/infershape_util.cc:326] The shape of node Attention_layer_1/mul_1 output 0 is [?,?,?], unknown shape.
2020-12-16 12:44:34.509723: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_2/gru2/while/Merge_2 output 0 is [?,36], unknown shape.
2020-12-16 12:44:34.509743: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_1/gru1/while/gru_cell/BiasAdd_1 output 0 is [?,36], unknown shape.
2020-12-16 12:44:34.509759: W tf_adapter/util/infershape_util.cc:326] The shape of node Attention_layer_1/concat output 0 is [?,?,?], unknown shape.
2020-12-16 12:44:34.509801: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_2/gru2/while/Switch_2 output 0 is [?,36], unknown shape.
2020-12-16 12:44:34.509812: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_2/gru2/while/Switch_2 output 1 is [?,36], unknown shape.
2020-12-16 12:44:34.509846: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_2/gru2/while/TensorArrayReadV3 output 0 is ?, unknown shape.
2020-12-16 12:44:34.509860: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_1/gru1/while/gru_cell/Tanh output 0 is [?,36], unknown shape.
2020-12-16 12:44:34.509888: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_2/gru2/while/Exit_2 output 0 is [?,36], unknown shape.
2020-12-16 12:44:34.509915: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_2/gru2/while/gates/gates/concat output 0 is [?,?], unknown shape.
2020-12-16 12:44:34.509931: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_1/gru1/while/gru_cell/mul_2 output 0 is [?,36], unknown shape.
2020-12-16 12:44:34.509954: W tf_adapter/util/infershape_util.cc:326] The shape of node f1gru/Tensordot/Reshape output 0 is [?,?], unknown shape.
2020-12-16 12:44:34.509969: W tf_adapter/util/infershape_util.cc:326] The shape of node f1gru_1/Tensordot/Reshape output 0 is [?,?], unknown shape.
2020-12-16 12:44:34.509997: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_2/gru2/while/gates/gates/MatMul output 0 is [?,72], unknown shape.
2020-12-16 12:44:34.510013: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_1/gru1/while/gru_cell/add output 0 is [?,36], unknown shape.
2020-12-16 12:44:34.510041: W tf_adapter/util/infershape_util.cc:326] The shape of node f1gru/Tensordot/MatMul output 0 is [?,100], unknown shape.
2020-12-16 12:44:34.510058: W tf_adapter/util/infershape_util.cc:326] The shape of node f1gru_1/Tensordot/MatMul output 0 is [?,100], unknown shape.
2020-12-16 12:44:34.510075: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_2/gru2/while/gates/gates/BiasAdd output 0 is [?,72], unknown shape.
2020-12-16 12:44:34.510090: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_1/gru1/while/Select output 0 is [?,36], unknown shape.
2020-12-16 12:44:34.510105: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_1/gru1/while/Select_1 output 0 is [?,36], unknown shape.
2020-12-16 12:44:34.510124: W tf_adapter/util/infershape_util.cc:326] The shape of node f1gru/Tensordot output 0 is [?,?,100], unknown shape.
2020-12-16 12:44:34.510140: W tf_adapter/util/infershape_util.cc:326] The shape of node f1gru_1/Tensordot output 0 is [?,?,100], unknown shape.
2020-12-16 12:44:34.510157: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_2/gru2/while/Sigmoid output 0 is [?,72], unknown shape.
2020-12-16 12:44:34.510177: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_1/gru1/while/NextIteration_2 output 0 is [?,36], unknown shape.
2020-12-16 12:44:34.510194: W tf_adapter/util/infershape_util.cc:326] The shape of node Attention_layer_1/f1_att1_1/Tensordot/Reshape output 0 is [?,?], unknown shape.
2020-12-16 12:44:34.510209: W tf_adapter/util/infershape_util.cc:326] The shape of node f1gru/BiasAdd output 0 is [?,?,100], unknown shape.
2020-12-16 12:44:34.510223: W tf_adapter/util/infershape_util.cc:326] The shape of node f1gru_1/BiasAdd output 0 is [?,?,100], unknown shape.
2020-12-16 12:44:34.510244: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_2/gru2/while/split output 0 is [?,36], unknown shape.
2020-12-16 12:44:34.510255: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_2/gru2/while/split output 1 is [?,36], unknown shape.
2020-12-16 12:44:34.510272: W tf_adapter/util/infershape_util.cc:326] The shape of node Attention_layer_1/f1_att1_1/Tensordot/MatMul output 0 is [?,80], unknown shape.
2020-12-16 12:44:34.510287: W tf_adapter/util/infershape_util.cc:326] The shape of node Sigmoid output 0 is [?,?,100], unknown shape.
2020-12-16 12:44:34.510300: W tf_adapter/util/infershape_util.cc:326] The shape of node Sigmoid_2 output 0 is [?,?,100], unknown shape.
2020-12-16 12:44:34.510319: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_2/gru2/while/mul output 0 is [?,36], unknown shape.
2020-12-16 12:44:34.510334: W tf_adapter/util/infershape_util.cc:326] The shape of node Attention_layer_1/f1_att1_1/Tensordot output 0 is [?,?,80], unknown shape.
2020-12-16 12:44:34.510362: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_2/gru2/while/candidate/candidate/concat output 0 is [?,?], unknown shape.
2020-12-16 12:44:34.510378: W tf_adapter/util/infershape_util.cc:326] The shape of node Attention_layer_1/f1_att1_1/BiasAdd output 0 is [?,?,80], unknown shape.
2020-12-16 12:44:34.510417: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_2/gru2/while/candidate/candidate/MatMul output 0 is [?,36], unknown shape.
2020-12-16 12:44:34.510433: W tf_adapter/util/infershape_util.cc:326] The shape of node Attention_layer_1/f1_att1_1/Sigmoid output 0 is [?,?,80], unknown shape.
2020-12-16 12:44:34.510476: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_2/gru2/while/candidate/candidate/BiasAdd output 0 is [?,36], unknown shape.
2020-12-16 12:44:34.510511: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_2/gru2/while/Tanh output 0 is [?,36], unknown shape.
2020-12-16 12:44:34.510537: W tf_adapter/util/infershape_util.cc:326] The shape of node f2gru/Tensordot/Reshape output 0 is [?,?], unknown shape.
2020-12-16 12:44:34.510551: W tf_adapter/util/infershape_util.cc:326] The shape of node f2gru_1/Tensordot/Reshape output 0 is [?,?], unknown shape.
2020-12-16 12:44:34.510582: W tf_adapter/util/infershape_util.cc:326] The shape of node f2gru/Tensordot/MatMul output 0 is [?,50], unknown shape.
2020-12-16 12:44:34.510597: W tf_adapter/util/infershape_util.cc:326] The shape of node f2gru_1/Tensordot/MatMul output 0 is [?,50], unknown shape.
2020-12-16 12:44:34.510619: W tf_adapter/util/infershape_util.cc:326] The shape of node f2gru/Tensordot output 0 is [?,?,50], unknown shape.
2020-12-16 12:44:34.510634: W tf_adapter/util/infershape_util.cc:326] The shape of node f2gru_1/Tensordot output 0 is [?,?,50], unknown shape.
2020-12-16 12:44:34.510651: W tf_adapter/util/infershape_util.cc:326] The shape of node Attention_layer_1/f2_att1_1/Tensordot/Reshape output 0 is [?,?], unknown shape.
2020-12-16 12:44:34.510665: W tf_adapter/util/infershape_util.cc:326] The shape of node f2gru/BiasAdd output 0 is [?,?,50], unknown shape.
2020-12-16 12:44:34.510679: W tf_adapter/util/infershape_util.cc:326] The shape of node f2gru_1/BiasAdd output 0 is [?,?,50], unknown shape.
2020-12-16 12:44:34.510698: W tf_adapter/util/infershape_util.cc:326] The shape of node Attention_layer_1/f2_att1_1/Tensordot/MatMul output 0 is [?,40], unknown shape.
2020-12-16 12:44:34.510714: W tf_adapter/util/infershape_util.cc:326] The shape of node Sigmoid_1 output 0 is [?,?,50], unknown shape.
2020-12-16 12:44:34.510727: W tf_adapter/util/infershape_util.cc:326] The shape of node Sigmoid_3 output 0 is [?,?,50], unknown shape.
2020-12-16 12:44:34.510746: W tf_adapter/util/infershape_util.cc:326] The shape of node Attention_layer_1/f2_att1_1/Tensordot output 0 is [?,?,40], unknown shape.
2020-12-16 12:44:34.510773: W tf_adapter/util/infershape_util.cc:326] The shape of node Attention_layer_1/f2_att1_1/BiasAdd output 0 is [?,?,40], unknown shape.
2020-12-16 12:44:34.510813: W tf_adapter/util/infershape_util.cc:326] The shape of node Attention_layer_1/f2_att1_1/Sigmoid output 0 is [?,?,40], unknown shape.
2020-12-16 12:44:34.510884: W tf_adapter/util/infershape_util.cc:326] The shape of node f3gru/Tensordot/Reshape output 0 is [?,?], unknown shape.
2020-12-16 12:44:34.510899: W tf_adapter/util/infershape_util.cc:326] The shape of node f3gru_1/Tensordot/Reshape output 0 is [?,?], unknown shape.
2020-12-16 12:44:34.510928: W tf_adapter/util/infershape_util.cc:326] The shape of node f3gru/Tensordot/MatMul output 0 is [?,2], unknown shape.
2020-12-16 12:44:34.510943: W tf_adapter/util/infershape_util.cc:326] The shape of node f3gru_1/Tensordot/MatMul output 0 is [?,2], unknown shape.
2020-12-16 12:44:34.510965: W tf_adapter/util/infershape_util.cc:326] The shape of node f3gru/Tensordot output 0 is [?,?,2], unknown shape.
2020-12-16 12:44:34.510980: W tf_adapter/util/infershape_util.cc:326] The shape of node f3gru_1/Tensordot output 0 is [?,?,2], unknown shape.
2020-12-16 12:44:34.510998: W tf_adapter/util/infershape_util.cc:326] The shape of node Attention_layer_1/f3_att1_1/Tensordot/Reshape output 0 is [?,?], unknown shape.
2020-12-16 12:44:34.511012: W tf_adapter/util/infershape_util.cc:326] The shape of node f3gru/BiasAdd output 0 is [?,?,2], unknown shape.
2020-12-16 12:44:34.511025: W tf_adapter/util/infershape_util.cc:326] The shape of node f3gru_1/BiasAdd output 0 is [?,?,2], unknown shape.
2020-12-16 12:44:34.511043: W tf_adapter/util/infershape_util.cc:326] The shape of node Attention_layer_1/f3_att1_1/Tensordot/MatMul output 0 is [?,1], unknown shape.
2020-12-16 12:44:34.511058: W tf_adapter/util/infershape_util.cc:326] The shape of node Softmax output 0 is [?,?,2], unknown shape.
2020-12-16 12:44:34.511072: W tf_adapter/util/infershape_util.cc:326] The shape of node Softmax_1 output 0 is [?,?,2], unknown shape.
2020-12-16 12:44:34.511091: W tf_adapter/util/infershape_util.cc:326] The shape of node Attention_layer_1/f3_att1_1/Tensordot output 0 is [?,?,1], unknown shape.
2020-12-16 12:44:34.511106: W tf_adapter/util/infershape_util.cc:326] The shape of node add output 0 is [?,?,2], unknown shape.
2020-12-16 12:44:34.511119: W tf_adapter/util/infershape_util.cc:326] The shape of node add_1 output 0 is [?,?,2], unknown shape.
2020-12-16 12:44:34.511138: W tf_adapter/util/infershape_util.cc:326] The shape of node Attention_layer_1/f3_att1_1/BiasAdd output 0 is [?,?,1], unknown shape.
2020-12-16 12:44:34.511152: W tf_adapter/util/infershape_util.cc:326] The shape of node strided_slice_7 output 0 is [?,?], unknown shape.
2020-12-16 12:44:34.511168: W tf_adapter/util/infershape_util.cc:326] The shape of node strided_slice_8 output 0 is [?,?], unknown shape.
2020-12-16 12:44:34.511190: W tf_adapter/util/infershape_util.cc:326] The shape of node Attention_layer_1/Reshape_1 output 0 is [?,1,?], unknown shape.
2020-12-16 12:44:34.511204: W tf_adapter/util/infershape_util.cc:326] The shape of node Log output 0 is [?,?], unknown shape.
2020-12-16 12:44:34.511219: W tf_adapter/util/infershape_util.cc:326] The shape of node sub output 0 is [?,?], unknown shape.
2020-12-16 12:44:34.511242: W tf_adapter/util/infershape_util.cc:326] The shape of node Reshape_1 output 0 is [?,?], unknown shape.
2020-12-16 12:44:34.511257: W tf_adapter/util/infershape_util.cc:326] The shape of node Log_1 output 0 is [?,?], unknown shape.
2020-12-16 12:44:34.511274: W tf_adapter/util/infershape_util.cc:326] The shape of node Attention_layer_1/ones_like_1 output 0 is [?,1,?], unknown shape.
2020-12-16 12:44:34.511289: W tf_adapter/util/infershape_util.cc:326] The shape of node Neg output 0 is [?,?], unknown shape.
2020-12-16 12:44:34.511302: W tf_adapter/util/infershape_util.cc:326] The shape of node Reshape_2 output 0 is [?,?], unknown shape.
2020-12-16 12:44:34.511322: W tf_adapter/util/infershape_util.cc:326] The shape of node Attention_layer_1/mul_2 output 0 is [?,1,?], unknown shape.
2020-12-16 12:44:34.511336: W tf_adapter/util/infershape_util.cc:326] The shape of node ArithmeticOptimizer/HoistCommonFactor_Add_add_2 output 0 is [?,?], unknown shape.
2020-12-16 12:44:34.511382: W tf_adapter/util/infershape_util.cc:326] The shape of node Attention_layer_1/Reshape_2 output 0 is [?,?], unknown shape.
2020-12-16 12:44:34.511402: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_2/ExpandDims output 0 is [?,?,1], unknown shape.
2020-12-16 12:44:34.511423: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_2/gru2/while/Enter_3 output 0 is [?,?,1], unknown shape.
2020-12-16 12:44:34.511444: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_2/gru2/while/Merge_3 output 0 is [?,?,1], unknown shape.
2020-12-16 12:44:34.511466: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_2/gru2/while/Switch_3 output 0 is [?,?,1], unknown shape.
2020-12-16 12:44:34.511477: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_2/gru2/while/Switch_3 output 1 is [?,?,1], unknown shape.
2020-12-16 12:44:34.511495: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_2/gru2/while/NextIteration_3 output 0 is [?,?,1], unknown shape.
2020-12-16 12:44:34.511510: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_2/gru2/while/strided_slice output 0 is [?,1], unknown shape.
2020-12-16 12:44:34.511530: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_2/gru2/while/sub output 0 is [?,1], unknown shape.
2020-12-16 12:44:34.511550: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_2/gru2/while/mul_1 output 0 is [?,36], unknown shape.
2020-12-16 12:44:34.511568: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_2/gru2/while/mul_2 output 0 is [?,36], unknown shape.
2020-12-16 12:44:34.511582: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_2/gru2/while/sub_1 output 0 is [?,36], unknown shape.
2020-12-16 12:44:34.511599: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_2/gru2/while/mul_3 output 0 is [?,36], unknown shape.
2020-12-16 12:44:34.511616: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_2/gru2/while/add_1 output 0 is [?,36], unknown shape.
2020-12-16 12:44:34.511641: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_2/gru2/while/Select_1 output 0 is [?,36], unknown shape.
2020-12-16 12:44:34.511665: W tf_adapter/util/infershape_util.cc:326] The shape of node rnn_2/gru2/while/NextIteration_2 output 0 is [?,36], unknown shape.
2020-12-16 12:44:34.511696: I tf_adapter/util/infershape_util.cc:395] InferShapeUtil::InferShape success
2020-12-16 12:44:34.513871: I tf_adapter/kernels/geop_npu.cc:705] [GEOP] no OUTPUT_DESC: rnn_1/gru1/while/Merge <-- rnn_1/gru1/while/NextIteration
2020-12-16 12:44:34.513927: I tf_adapter/kernels/geop_npu.cc:705] [GEOP] no OUTPUT_DESC: rnn_2/gru2/while/Merge <-- rnn_2/gru2/while/NextIteration
2020-12-16 12:44:34.515539: I tf_adapter/kernels/geop_npu.cc:705] [GEOP] no OUTPUT_DESC: rnn_1/gru1/while/Merge_1 <-- rnn_1/gru1/while/NextIteration_1
2020-12-16 12:44:34.515739: I tf_adapter/kernels/geop_npu.cc:705] [GEOP] no OUTPUT_DESC: rnn_1/gru1/while/Merge_2 <-- rnn_1/gru1/while/NextIteration_2
2020-12-16 12:44:34.517236: I tf_adapter/kernels/geop_npu.cc:705] [GEOP] no OUTPUT_DESC: rnn_2/gru2/while/Merge_2 <-- rnn_2/gru2/while/NextIteration_2
2020-12-16 12:44:34.521413: I tf_adapter/kernels/geop_npu.cc:705] [GEOP] no OUTPUT_DESC: rnn_2/gru2/while/Merge_3 <-- rnn_2/gru2/while/NextIteration_3
2020-12-16 12:44:34.528089: I tf_adapter/kernels/geop_npu.cc:440] [GEOP] In GEOP computeAsync, kernel_name:GeOp5_0 ,TFadapter cost time: [32 ms]
2020-12-16 12:44:34.528238: I tf_adapter/kernels/geop_npu.cc:442] [GEOP] TFadpter process graph success, GE parser begin, kernel_name:GeOp5_0 ,tf session: directfd8a41491659c190 ,graph id :11
2020-12-16 12:44:34.644291: I tf_adapter/kernels/geop_npu.cc:508] [GEOP] Tensorflow graph parse to ge graph success, kernel_name:GeOp5_0 ,tf session: directfd8a41491659c190 ,graph id: 11
2020-12-16 12:44:34.644962: I tf_adapter/kernels/geop_npu.cc:539] [GEOP] Add graph to ge session success, kernel_name:GeOp5_0 ,tf session: directfd8a41491659c190 ,graph id:11
2020-12-16 12:44:34.647298: I tf_adapter/kernels/geop_npu.cc:580] [GEOP] Call ge session RunGraphAsync, kernel_name:GeOp5_0 ,tf session: directfd8a41491659c190 ,graph id: 11
2020-12-16 12:44:34.647393: I tf_adapter/kernels/geop_npu.cc:593] [GEOP] End GeOp::ComputeAsync, kernel_name:GeOp5_0, ret_status:success ,tf session: directfd8a41491659c190 ,graph id: 11 [152 ms]
2020-12-16 12:44:34.771236: W tensorflow/core/framework/op_kernel.cc:1639] Unavailable: Prepare Graph infershape failed
2020-12-16 12:44:37.771368: F tf_adapter/kernels/geop_npu.cc:570] GeOp5_0GEOP::::DoRunAsync Failed
[Modelarts Service Log]2020-12-16 12:44:54,697 - ERROR - FMK of device1 (pid: [216]) has exited with non-zero code: -6
[Modelarts Service Log]2020-12-16 12:44:54,698 - INFO - Begin destroy FMK processes
[Modelarts Service Log]2020-12-16 12:44:54,698 - INFO - FMK of device1 (pid: [216]) has exited
[Modelarts Service Log]2020-12-16 12:44:54,698 - INFO - End destroy FMK processes
=== begin proc exit    ===
=== begin stop slogd   ===
===   end pro exit     ===
2020-12-16 12:44:55,218 825 PCOMPILE Master process dead. worker process quiting..
2020-12-16 12:44:55,225 824 PCOMPILE Master process dead. worker process quiting..
2020-12-16 12:44:55,273 826 PCOMPILE Master process dead. worker process quiting..
2020-12-16 12:44:55,331 821 PCOMPILE Master process dead. worker process quiting..
2020-12-16 12:44:55,391 823 PCOMPILE Master process dead. worker process quiting..
2020-12-16 12:44:55,406 820 PCOMPILE Master process dead. worker process quiting..
2020-12-16 12:44:55,454 827 PCOMPILE Master process dead. worker process quiting..
2020-12-16 12:44:55,589 822 PCOMPILE Master process dead. worker process quiting..
/usr/local/ma/python3.7/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 35 leaked semaphores to clean up at shutdown
  len(cache))
[Modelarts Service Log]Training end with return code: 250
[Modelarts Service Log]training end at 2020-12-16-12:44:55
[Modelarts Service Log]Training completed.

评论 (33)

dingjiepin 创建了Bug-Report
dingjiepin 关联仓库设置为Ascend/modelzoo
展开全部操作日志

初步定位还是某个算子InferShape阶段报错,具体哪个算子等我们获取日志后再分析一把

2020-12-16 12:44:34.771236: W tensorflow/core/framework/op_kernel.cc:1639] Unavailable: Prepare Graph infershape failed
2020-12-16 12:44:37.771368: F tf_adapter/kernels/geop_npu.cc:570] GeOp5_0GEOP::::DoRunAsync Failed
zhengtao 任务状态TODO 修改为Analysing

初步定位还是某个算子InferShape阶段报错,具体哪个算子等我们获取日志后再分析一把

2020-12-16 12:44:34.771236: W tensorflow/core/framework/op_kernel.cc:1639] Unavailable: Prepare Graph infershape failed
2020-12-16 12:44:37.771368: F tf_adapter/kernels/geop_npu.cc:570] GeOp5_0GEOP::::DoRunAsync Failed

@zhengtao 好的

@zhengtao 好的

@dingjiepin 我们已经获取后台日子,确实是TensorArrayScatterV3 Shape推导阶段报错,我们会尽快解决您的问题
关键日志如下:

[ERROR] TBE(216,python):2020-12-16-12:44:34.747.102 [ops/built-in/op_proto/util/common_shape_fns.cpp:179][OP_PROTO] Merge:179 OpName:[rnn_1/gru1/TensorArrayUnstack/TensorArrayScatter/TensorArrayScatterV3] "Dimension number of two shapes are not equal: 1 != 2."
[ERROR] TBE(216,python):2020-12-16-12:44:34.747.247 [ops/built-in/op_proto/data_flow_ops.cpp:835][OP_PROTO] TensorArrayScatterInfer:835 OpName:[rnn_1/gru1/TensorArrayUnstack/TensorArrayScatter/TensorArrayScatterV3] "Merge tensorArray element shape[-1] and value subshape[128, 36] failed."
[ERROR] GE(216,python):2020-12-16-12:44:34.747.288 [common/graph/./op_desc.cc:1314]836 CallInferFunc: ErrorNo: -1(failed) rnn_1/gru1/TensorArrayUnstack/TensorArrayScatter/TensorArrayScatterV3 call infer func. ret: 4294967295
[ERROR] GE(216,python):2020-12-16-12:44:34.747.330 [common/graph/./shape_refiner.cc:676]836 InferShapeAndType: ErrorNo: -1(failed) rnn_1/gru1/TensorArrayUnstack/TensorArrayScatter/TensorArrayScatterV3 call infer function failed.
[ERROR] GE(216,python):2020-12-16-12:44:34.747.687 [framework/domi/graph/passes/infershape_pass.cc:36]836 Run: ErrorNo: 1343242270(Prepare Graph infershape failed) infershape failed. node: rnn_1/gru1/TensorArrayUnstack/TensorArrayScatter/TensorArrayScatterV3
[ERROR] GE(216,python):2020-12-16-12:44:34.747.726 [framework/domi/graph/passes/base_pass.cc:88]836 RunPasses: ErrorNo: 1343225860(Internal errors) Failed to process pass InferShapePass on node rnn_1/gru1/TensorArrayUnstack/TensorArrayScatter/TensorArrayScatterV3, result 1343242270, the passes will be terminated immediately.
[ERROR] GE(216,python):2020-12-16-12:44:34.747.772 [framework/domi/graph/passes/base_pass.cc:216]836 RunPassesOneGraph: ErrorNo: 1343242270(Prepare Graph infershape failed) Failed to process passes on node rnn_1/gru1/TensorArrayUnstack/TensorArrayScatter/TensorArrayScatterV3 type TensorArrayScatter, error code: 1343242270
[ERROR] GE(216,python):2020-12-16-12:44:34.748.092 [framework/domi/graph/preprocess/graph_preprocess.cc:1609]836 InferShapeForPreprocess: ErrorNo: 1343242270(Prepare Graph infershape failed) Run ge_passes infershape for preprocess failed, ret:1343242270.
[ERROR] GE(216,python):2020-12-16-12:44:34.748.135 [framework/domi/graph/preprocess/graph_preprocess.cc:1241]836 FormatAndShapeProcess: ErrorNo: 1343242270(Prepare Graph infershape failed) Prepare Graph infershape failed
[ERROR] GE(216,python):2020-12-16-12:44:34.748.167 [framework/domi/graph/preprocess/graph_preprocess.cc:1375]836 PrepareDynShape: ErrorNo: 1343242270(Prepare Graph infershape failed) Failed to process Prepare_FormatAndShapeProcess
[EVENT] GE(216,python):2020-12-16-12:44:34.748.199 [framework/domi/graph/manager/graph_manager.cc:570]836 PreRunOptimizeOriginalGraph:[GEPERFTRACE] The time cost of GraphManager::graph_preparer_.PrepareDynShape is [48800] micro second.
[ERROR] GE(216,python):2020-12-16-12:44:34.748.230 [framework/domi/graph/manager/graph_manager.cc:570]836 PreRunOptimizeOriginalGraph: ErrorNo: 1343242270(Prepare Graph infershape failed) Failed to process GraphManager_graph_preparer_.PrepareDynShape
[ERROR] GE(216,python):2020-12-16-12:44:34.748.262 [framework/domi/graph/manager/graph_manager.cc:674]836 PreRun: ErrorNo: 1343242270(Prepare Graph infershape failed) Run PreRunOptimizeOriginalGraph failed for graph:ge_default_20201216124434.
[ERROR] GE(216,python):2020-12-16-12:44:34.771.152 [framework/domi/graph/manager/graph_manager.cc:2542]836 ReturnError: ErrorNo: 1343242270(Prepare Graph infershape failed) PreRun Failed, thread exit...
zhengtao 负责人设置为朱晶晶

@dingjiepin 需要获取Dump的计算图进行进一步的分析,如何获取请参考老谭的这篇教程:
https://gitee.com/ascend/modelzoo/wikis/Loss%E6%98%AF%E6%94%B6%E6%95%9B%E4%BA%86%EF%BC%8C%E7%B2%BE%E5%BA%A6%E4%B8%8D%E5%A4%9F%E6%80%8E%E4%B9%88%E5%8A%9E%EF%BC%9F?sort_id=3148793

@zhengtao 您好,我的代码是采用tf.session(config=config)实现的,config变量按照迁移手册进行了设置。做了如下尝试:

    # run on npu config
    config = tf.ConfigProto()
    custom_op = config.graph_options.rewrite_options.custom_optimizers.add()
    custom_op.name = "NpuOptimizer"
    custom_op.parameter_map["use_off_line"].b = True  # 在昇腾AI处理器执行训练
    config.graph_options.rewrite_options.remapping = RewriterConfig.OFF  # 关闭remap开关
    dump_config = DumpConfig(enable_dump=True, dump_path=TMP_CACHE_DIR, dump_step="0")
    npu_config = NPURunConfig(
        dump_config = dump_config,
        session_config = config
    )
    with tf.Session(config=npu_config) as sess:

导致报错:

WARNING:tensorflow:From /home/work/user-job-dir/code/train.py:254: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.
WARNING:tensorflow:From /home/work/user-job-dir/code/train.py:254: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.
WARNING:tensorflow:From /home/work/user-job-dir/code/train.py:130: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.
WARNING:tensorflow:From /home/work/user-job-dir/code/train.py:130: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.
WARNING:tensorflow:From /home/work/user-job-dir/code/train.py:140: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.
WARNING:tensorflow:From /home/work/user-job-dir/code/train.py:140: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.
Traceback (most recent call last):
  File "/home/work/user-job-dir/code/train.py", line 263, in <module>
    train(model_type="DIEN", seed=SEED)
  File "/home/work/user-job-dir/code/train.py", line 140, in train
    with tf.Session(config=npu_config) as sess:
  File "/usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1585, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 678, in __init__
    type(config))
TypeError: config must be a tf.ConfigProto, but got <class 'npu_bridge.estimator.npu.npu_config.NPURunConfig'>
Exception ignored in: <function BaseSession.__del__ at 0xffffaac207a0>
Traceback (most recent call last):
  File "/usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 761, in __del__
    if self._session is not None:
AttributeError: 'Session' object has no attribute '_session'
INFO:Current training job status: Running failed


@zhengtao 您好,我的代码是采用tf.session(config=config)实现的,config变量按照迁移手册进行了设置。做了如下尝试:

    # run on npu config
    config = tf.ConfigProto()
    custom_op = config.graph_options.rewrite_options.custom_optimizers.add()
    custom_op.name = "NpuOptimizer"
    custom_op.parameter_map["use_off_line"].b = True  # 在昇腾AI处理器执行训练
    config.graph_options.rewrite_options.remapping = RewriterConfig.OFF  # 关闭remap开关
    dump_config = DumpConfig(enable_dump=True, dump_path=TMP_CACHE_DIR, dump_step="0")
    npu_config = NPURunConfig(
        dump_config = dump_config,
        session_config = config
    )
    with tf.Session(config=npu_config) as sess:

导致报错:

WARNING:tensorflow:From /home/work/user-job-dir/code/train.py:254: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.
WARNING:tensorflow:From /home/work/user-job-dir/code/train.py:254: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.
WARNING:tensorflow:From /home/work/user-job-dir/code/train.py:130: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.
WARNING:tensorflow:From /home/work/user-job-dir/code/train.py:130: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.
WARNING:tensorflow:From /home/work/user-job-dir/code/train.py:140: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.
WARNING:tensorflow:From /home/work/user-job-dir/code/train.py:140: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.
Traceback (most recent call last):
  File "/home/work/user-job-dir/code/train.py", line 263, in <module>
    train(model_type="DIEN", seed=SEED)
  File "/home/work/user-job-dir/code/train.py", line 140, in train
    with tf.Session(config=npu_config) as sess:
  File "/usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1585, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 678, in __init__
    type(config))
TypeError: config must be a tf.ConfigProto, but got <class 'npu_bridge.estimator.npu.npu_config.NPURunConfig'>
Exception ignored in: <function BaseSession.__del__ at 0xffffaac207a0>
Traceback (most recent call last):
  File "/usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 761, in __del__
    if self._session is not None:
AttributeError: 'Session' object has no attribute '_session'
INFO:Current training job status: Running failed


@dingjiepin 是TensorArrayScatter算子未适配动态shape场景,算子工程师会尽快支持该特性并协助您通过同步版本解决问题,请耐心等待。

您好,需要提供一下dump图分析TensorArrayScatterV3算子的输入

您好,需要提供一下dump图分析TensorArrayScatterV3算子的输入

@朱晶晶 您好,我们是基于session.run(config=config)实现的,有相关的dump的教程么,我按照前面回复的参考链接进行尝试失败了。。。具体尝试我放在前面的回复了。
参考教程链接:
https://gitee.com/ascend/modelzoo/wikis/Loss%E6%98%AF%E6%94%B6%E6%95%9B%E4%BA%86%EF%BC%8C%E7%B2%BE%E5%BA%A6%E4%B8%8D%E5%A4%9F%E6%80%8E%E4%B9%88%E5%8A%9E%EF%BC%9F?sort_id=3148793

@neoming 你好, 看你的用法应该是用的session.run模式,则配置方法如下,请检查下:

/////////////////////////////////begin////////////////////////////////////////////

config = tf.ConfigProto()
custom_op = config.graph_options.rewrite_options.custom_optimizers.add()
custom_op.name = "NpuOptimizer"
custom_op.parameter_map["use_off_line"].b = True

是否开启dump功能

custom_op.parameter_map["enable_dump"].b = True

dump文件保存路径,系统参数默认配置情况下,生成的dump数据存放在/var/log/npu/ide_daemon/dump/{dump_path}目录下

custom_op.parameter_map["dump_path"].s = tf.compat.as_bytes("/tmp")

dump哪些迭代的数据,不配置或者配置为None,表示dump所有迭代的数据。多个迭代用“|”分割,例如:0|5|10;也可以用"-"指定迭代范围,例如:0|3-5|10

custom_op.parameter_map["dump_step"].s = tf.compat.as_bytes("0|5|10")

dump模式,默认仅dump算子输出数据,还可以dump算子输入数据,取值:input/output/all

custom_op.parameter_map["dump_mode"].s = tf.compat.as_bytes("all")

with tf.Session(config=config) as sess:
print(sess.run(cost))
/////////////////////////////////end////////////////////////////////////////////

您好,已经成功导出dump文件。设置如下:

config = tf.ConfigProto()
custom_op = config.graph_options.rewrite_options.custom_optimizers.add()
custom_op.name = "NpuOptimizer"
custom_op.parameter_map["use_off_line"].b = True  # 在昇腾AI处理器执行训练
config.graph_options.rewrite_options.remapping = RewriterConfig.OFF  # 关闭remap开关
custom_op.parameter_map["enable_dump"].b = True
custom_op.parameter_map["dump_path"].s = tf.compat.as_bytes(TMP_CACHE_DIR)
custom_op.parameter_map["dump_mode"].s = tf.compat.as_bytes("all")

导出的文件较大,该怎么发给你?我这边先尝试上传到gitee附件

您好,已经成功导出dump文件。设置如下:

config = tf.ConfigProto()
custom_op = config.graph_options.rewrite_options.custom_optimizers.add()
custom_op.name = "NpuOptimizer"
custom_op.parameter_map["use_off_line"].b = True  # 在昇腾AI处理器执行训练
config.graph_options.rewrite_options.remapping = RewriterConfig.OFF  # 关闭remap开关
custom_op.parameter_map["enable_dump"].b = True
custom_op.parameter_map["dump_path"].s = tf.compat.as_bytes(TMP_CACHE_DIR)
custom_op.parameter_map["dump_mode"].s = tf.compat.as_bytes("all")

导出的文件较大,该怎么发给你?我这边先尝试上传到gitee附件

@dingjiepin @朱晶晶 obs路径如下:
obs://model-trainning/DIEN/DIEN-Pycharm/dump/
文件也上传到了gitee:
https://gitee.com/dingjiepin/dien_-model-arts_dump.git

dingjiepin 修改了标题
dingjiepin 修改了描述

@dingjiepin @朱晶晶 obs路径如下:
obs://model-trainning/DIEN/DIEN-Pycharm/dump/
文件也上传到了gitee:
https://gitee.com/dingjiepin/dien_-model-arts_dump.git

@dingjiepin 您好,是dump图,不是dump算子的数据
输入图片说明
https://gitee.com/ascend/modelzoo/wikis/Loss%E6%98%AF%E6%94%B6%E6%95%9B%E4%BA%86%EF%BC%8C%E7%B2%BE%E5%BA%A6%E4%B8%8D%E5%A4%9F%E6%80%8E%E4%B9%88%E5%8A%9E%EF%BC%9F?sort_id=3148793

@neoming 你好, 看你的用法应该是用的session.run模式,则配置方法如下,请检查下:
/////////////////////////////////begin////////////////////////////////////////////
config = tf.ConfigProto()
custom_op = config.graph_options.rewrite_options.custom_optimizers.add()
custom_op.name = "NpuOptimizer"
custom_op.parameter_map["use_off_line"].b = True

是否开启dump功能

custom_op.parameter_map["enable_dump"].b = True

dump文件保存路径,系统参数默认配置情况下,生成的dump数据存放在/var/log/npu/ide_daemon/dump/{dump_path}目录下

custom_op.parameter_map["dump_path"].s = tf.compat.as_bytes("/tmp")

dump哪些迭代的数据,不配置或者配置为None,表示dump所有迭代的数据。多个迭代用“|”分割,例如:0|5|10;也可以用"-"指定迭代范围,例如:0|3-5|10

custom_op.parameter_map["dump_step"].s = tf.compat.as_bytes("0|5|10")

dump模式,默认仅dump算子输出数据,还可以dump算子输入数据,取值:input/output/all

custom_op.parameter_map["dump_mode"].s = tf.compat.as_bytes("all")
with tf.Session(config=config) as sess:
print(sess.run(cost))
/////////////////////////////////end////////////////////////////////////////////

@yanqingshang 您好,有详细文档么,按照你的例子进行了尝试,结果还是只有dump算子数据。具体尝试如下:

    TMP_CACHE_DIR = '/cache/data'
    os.makedirs(TMP_CACHE_DIR)
# run on npu config
    config = tf.ConfigProto()
    custom_op = config.graph_options.rewrite_options.custom_optimizers.add()
    custom_op.name = "NpuOptimizer"
    custom_op.parameter_map["use_off_line"].b = True  # 在昇腾AI处理器执行训练
    config.graph_options.rewrite_options.remapping = RewriterConfig.OFF  # 关闭remap开关
    custom_op.parameter_map["enable_dump"].b = True
    custom_op.parameter_map["dump_path"].s = tf.compat.as_bytes(TMP_CACHE_DIR)
    custom_op.parameter_map["dump_step"].s = tf.compat.as_bytes("0|5|10")
    custom_op.parameter_map["dump_mode"].s = tf.compat.as_bytes("all")
    print("[INFO]before session.........................................................")

    with tf.Session(config=config) as sess:
        print("[INFO] session begin.........................................................")
        train_data = DataIterator(train_file, uid_voc, mid_voc, cat_voc, batch_size, maxlen, shuffle_each_epoch=False)
        test_data = DataIterator(test_file, uid_voc, mid_voc, cat_voc, batch_size, maxlen)
        n_uid, n_mid, n_cat = train_data.get_n()
        if model_type == 'DNN':
            model = Model_DNN(n_uid, n_mid, n_cat, EMBEDDING_DIM, HIDDEN_SIZE, ATTENTION_SIZE)
        elif model_type == 'PNN':
            model = Model_PNN(n_uid, n_mid, n_cat, EMBEDDING_DIM, HIDDEN_SIZE, ATTENTION_SIZE)
        elif model_type == 'Wide':
            model = Model_WideDeep(n_uid, n_mid, n_cat, EMBEDDING_DIM, HIDDEN_SIZE, ATTENTION_SIZE)
        elif model_type == 'DIN':
            model = Model_DIN(n_uid, n_mid, n_cat, EMBEDDING_DIM, HIDDEN_SIZE, ATTENTION_SIZE)
        elif model_type == 'DIN-V2-gru-att-gru':
            model = Model_DIN_V2_Gru_att_Gru(n_uid, n_mid, n_cat, EMBEDDING_DIM, HIDDEN_SIZE, ATTENTION_SIZE)
        elif model_type == 'DIN-V2-gru-gru-att':
            model = Model_DIN_V2_Gru_Gru_att(n_uid, n_mid, n_cat, EMBEDDING_DIM, HIDDEN_SIZE, ATTENTION_SIZE)
        elif model_type == 'DIN-V2-gru-qa-attGru':
            model = Model_DIN_V2_Gru_QA_attGru(n_uid, n_mid, n_cat, EMBEDDING_DIM, HIDDEN_SIZE, ATTENTION_SIZE)
        elif model_type == 'DIN-V2-gru-vec-attGru':
            model = Model_DIN_V2_Gru_Vec_attGru(n_uid, n_mid, n_cat, EMBEDDING_DIM, HIDDEN_SIZE, ATTENTION_SIZE)
        elif model_type == 'DIEN':
            model = Model_DIN_V2_Gru_Vec_attGru_Neg(n_uid, n_mid, n_cat, EMBEDDING_DIM, HIDDEN_SIZE, ATTENTION_SIZE)
        else:
            print ("Invalid model_type : %s", model_type)
            return
        # model = Model_DNN(n_uid, n_mid, n_cat, EMBEDDING_DIM, HIDDEN_SIZE, ATTENTION_SIZE)
        sess.run(tf.global_variables_initializer())
        sess.run(tf.local_variables_initializer())
        print("[INFO] before copy.........................................................")
        mox.file.copy_parallel(TMP_CACHE_DIR, "obs://model-trainning/DIEN/DIEN-Pycharm/dump/")
        print("[INFO] after copy.........................................................")
        sys.stdout.flush()
        print('                                                                                      test_auc: %.4f ---- test_loss: %.4f ---- test_accuracy: %.4f ---- test_aux_loss: %.4f' % eval(sess, test_data, model, best_model_path))
        sys.stdout.flush()

        start_time = time.time()
        iter = 0
        lr = 0.001
        for itr in range(3):
            loss_sum = 0.0
            accuracy_sum = 0.
            aux_loss_sum = 0.
            for src, tgt in train_data:
                uids, mids, cats, mid_his, cat_his, mid_mask, target, sl, noclk_mids, noclk_cats = prepare_data(src, tgt, maxlen, return_neg=True)
                print("*******************************************************")
                for i in prepare_data(src, tgt, maxlen, return_neg=True):
                    print(i.shape)
                print("*******************************************************")
                loss, acc, aux_loss = model.train(sess, [uids, mids, cats, mid_his, cat_his, mid_mask, target, sl, lr, noclk_mids, noclk_cats])
                loss_sum += loss
                accuracy_sum += acc
                aux_loss_sum += aux_loss
                iter += 1
                sys.stdout.flush()
                if (iter % test_iter) == 0:
                    print('iter: %d ----> train_loss: %.4f ---- train_accuracy: %.4f ---- tran_aux_loss: %.4f' % \
                                          (iter, loss_sum / test_iter, accuracy_sum / test_iter, aux_loss_sum / test_iter))
                    print('                                                                                          test_auc: %.4f ----test_loss: %.4f ---- test_accuracy: %.4f ---- test_aux_loss: %.4f' % eval(sess, test_data, model, best_model_path))
                    loss_sum = 0.0
                    accuracy_sum = 0.0
                    aux_loss_sum = 0.0
                if (iter % save_iter) == 0:
                    print('save model iter: %d' %(iter))
                    model.save(sess, model_path+"--"+str(iter))
            lr *= 0.5

@yanqingshang 您好,有详细文档么,按照你的例子进行了尝试,结果还是没有。具体尝试如下:

# run on npu config
    config = tf.ConfigProto()
    custom_op = config.graph_options.rewrite_options.custom_optimizers.add()
    custom_op.name = "NpuOptimizer"
    custom_op.parameter_map["use_off_line"].b = True  # 在昇腾AI处理器执行训练
    config.graph_options.rewrite_options.remapping = RewriterConfig.OFF  # 关闭remap开关
    custom_op.parameter_map["enable_dump"].b = True
    custom_op.parameter_map["dump_path"].s = tf.compat.as_bytes(TMP_CACHE_DIR)
    custom_op.parameter_map["dump_step"].s = tf.compat.as_bytes("0|5|10")
    custom_op.parameter_map["dump_mode"].s = tf.compat.as_bytes("all")
    print("[INFO]before session.........................................................")

    with tf.Session(config=config) as sess:
        print("[INFO] session begin.........................................................")
        train_data = DataIterator(train_file, uid_voc, mid_voc, cat_voc, batch_size, maxlen, shuffle_each_epoch=False)
        test_data = DataIterator(test_file, uid_voc, mid_voc, cat_voc, batch_size, maxlen)
        n_uid, n_mid, n_cat = train_data.get_n()
        if model_type == 'DNN':
            model = Model_DNN(n_uid, n_mid, n_cat, EMBEDDING_DIM, HIDDEN_SIZE, ATTENTION_SIZE)
        elif model_type == 'PNN':
            model = Model_PNN(n_uid, n_mid, n_cat, EMBEDDING_DIM, HIDDEN_SIZE, ATTENTION_SIZE)
        elif model_type == 'Wide':
            model = Model_WideDeep(n_uid, n_mid, n_cat, EMBEDDING_DIM, HIDDEN_SIZE, ATTENTION_SIZE)
        elif model_type == 'DIN':
            model = Model_DIN(n_uid, n_mid, n_cat, EMBEDDING_DIM, HIDDEN_SIZE, ATTENTION_SIZE)
        elif model_type == 'DIN-V2-gru-att-gru':
            model = Model_DIN_V2_Gru_att_Gru(n_uid, n_mid, n_cat, EMBEDDING_DIM, HIDDEN_SIZE, ATTENTION_SIZE)
        elif model_type == 'DIN-V2-gru-gru-att':
            model = Model_DIN_V2_Gru_Gru_att(n_uid, n_mid, n_cat, EMBEDDING_DIM, HIDDEN_SIZE, ATTENTION_SIZE)
        elif model_type == 'DIN-V2-gru-qa-attGru':
            model = Model_DIN_V2_Gru_QA_attGru(n_uid, n_mid, n_cat, EMBEDDING_DIM, HIDDEN_SIZE, ATTENTION_SIZE)
        elif model_type == 'DIN-V2-gru-vec-attGru':
            model = Model_DIN_V2_Gru_Vec_attGru(n_uid, n_mid, n_cat, EMBEDDING_DIM, HIDDEN_SIZE, ATTENTION_SIZE)
        elif model_type == 'DIEN':
            model = Model_DIN_V2_Gru_Vec_attGru_Neg(n_uid, n_mid, n_cat, EMBEDDING_DIM, HIDDEN_SIZE, ATTENTION_SIZE)
        else:
            print ("Invalid model_type : %s", model_type)
            return
        # model = Model_DNN(n_uid, n_mid, n_cat, EMBEDDING_DIM, HIDDEN_SIZE, ATTENTION_SIZE)
        sess.run(tf.global_variables_initializer())
        sess.run(tf.local_variables_initializer())
        print("[INFO] before copy.........................................................")
        mox.file.copy_parallel(TMP_CACHE_DIR, "obs://model-trainning/DIEN/DIEN-Pycharm/dump/")
        print("[INFO] after copy.........................................................")
        sys.stdout.flush()
        print('                                                                                      test_auc: %.4f ---- test_loss: %.4f ---- test_accuracy: %.4f ---- test_aux_loss: %.4f' % eval(sess, test_data, model, best_model_path))
        sys.stdout.flush()

        start_time = time.time()
        iter = 0
        lr = 0.001
        for itr in range(3):
            loss_sum = 0.0
            accuracy_sum = 0.
            aux_loss_sum = 0.
            for src, tgt in train_data:
                uids, mids, cats, mid_his, cat_his, mid_mask, target, sl, noclk_mids, noclk_cats = prepare_data(src, tgt, maxlen, return_neg=True)
                print("*******************************************************")
                for i in prepare_data(src, tgt, maxlen, return_neg=True):
                    print(i.shape)
                print("*******************************************************")
                loss, acc, aux_loss = model.train(sess, [uids, mids, cats, mid_his, cat_his, mid_mask, target, sl, lr, noclk_mids, noclk_cats])
                loss_sum += loss
                accuracy_sum += acc
                aux_loss_sum += aux_loss
                iter += 1
                sys.stdout.flush()
                if (iter % test_iter) == 0:
                    print('iter: %d ----> train_loss: %.4f ---- train_accuracy: %.4f ---- tran_aux_loss: %.4f' % \
                                          (iter, loss_sum / test_iter, accuracy_sum / test_iter, aux_loss_sum / test_iter))
                    print('                                                                                          test_auc: %.4f ----test_loss: %.4f ---- test_accuracy: %.4f ---- test_aux_loss: %.4f' % eval(sess, test_data, model, best_model_path))
                    loss_sum = 0.0
                    accuracy_sum = 0.0
                    aux_loss_sum = 0.0
                if (iter % save_iter) == 0:
                    print('save model iter: %d' %(iter))
                    model.save(sess, model_path+"--"+str(iter))
            lr *= 0.5

@dingjiepin 您好,下方链接的wiki有介绍如何dump计算图,点击下方链接,搜索“在NPU上dump出参数文件和计算图”,然后根据你的平台选择相应的dump计算图方式,只要dump计算图就可以,脚本里只要适配wiki中红框圈的代码
https://gitee.com/ascend/modelzoo/wikis/Loss%E6%98%AF%E6%94%B6%E6%95%9B%E4%BA%86%EF%BC%8C%E7%B2%BE%E5%BA%A6%E4%B8%8D%E5%A4%9F%E6%80%8E%E4%B9%88%E5%8A%9E%EF%BC%9F?sort_id=3148793

@dingjiepin 您好,下方链接的wiki有介绍如何dump计算图,点击下方链接,搜索“在NPU上dump出参数文件和计算图”,然后根据你的平台选择相应的dump计算图方式,只要dump计算图就可以,脚本里只要适配wiki中红框圈的代码
https://gitee.com/ascend/modelzoo/wikis/Loss%E6%98%AF%E6%94%B6%E6%95%9B%E4%BA%86%EF%BC%8C%E7%B2%BE%E5%BA%A6%E4%B8%8D%E5%A4%9F%E6%80%8E%E4%B9%88%E5%8A%9E%EF%BC%9F?sort_id=3148793

@朱晶晶 您好,在之前的回复中,我已经按照你发的文档进行尝试了。并且失败了。我在前面已经回复的很清楚了。我是基于session.run实现的,教程中是基于estimater的。我现在想问,有session.run的详细教程么?不知道我这么说你能不能理解,你已经把同样的教程发了三次了

@dingjiepin 您好,下方链接的wiki有介绍如何dump计算图,点击下方链接,搜索“在NPU上dump出参数文件和计算图”,然后根据你的平台选择相应的dump计算图方式,只要dump计算图就可以,脚本里只要适配wiki中红框圈的代码
https://gitee.com/ascend/modelzoo/wikis/Loss%E6%98%AF%E6%94%B6%E6%95%9B%E4%BA%86%EF%BC%8C%E7%B2%BE%E5%BA%A6%E4%B8%8D%E5%A4%9F%E6%80%8E%E4%B9%88%E5%8A%9E%EF%BC%9F?sort_id=3148793

@yanqingshang 您好,想问一下有详细教程吗,我尝试了一下,还是没有成功输出dump的计算图。还想请教一下,代码具体运行到哪一块的时候才会进行dump。因为我们的代码目前在npu上是没有成功运行的,我看例子当中都是加在train()之后。怀疑是不是因为这个问题才导致无法dump计算图。

@朱晶晶 您好,在之前的回复中,我已经按照你发的文档进行尝试了。并且失败了。我在前面已经回复的很清楚了。我是基于session.run实现的,教程中是基于estimater的。我现在想问,有session.run的详细教程么?不知道我这么说你能不能理解,你已经把同样的教程发了三次了

@dingjiepin 可以提供一下您的网络代码给我们吗?

@yanqingshang 您好,有详细文档么,按照你的例子进行了尝试,结果还是只有dump算子数据。具体尝试如下:

    TMP_CACHE_DIR = '/cache/data'
    os.makedirs(TMP_CACHE_DIR)
# run on npu config
    config = tf.ConfigProto()
    custom_op = config.graph_options.rewrite_options.custom_optimizers.add()
    custom_op.name = "NpuOptimizer"
    custom_op.parameter_map["use_off_line"].b = True  # 在昇腾AI处理器执行训练
    config.graph_options.rewrite_options.remapping = RewriterConfig.OFF  # 关闭remap开关
    custom_op.parameter_map["enable_dump"].b = True
    custom_op.parameter_map["dump_path"].s = tf.compat.as_bytes(TMP_CACHE_DIR)
    custom_op.parameter_map["dump_step"].s = tf.compat.as_bytes("0|5|10")
    custom_op.parameter_map["dump_mode"].s = tf.compat.as_bytes("all")
    print("[INFO]before session.........................................................")

    with tf.Session(config=config) as sess:
        print("[INFO] session begin.........................................................")
        train_data = DataIterator(train_file, uid_voc, mid_voc, cat_voc, batch_size, maxlen, shuffle_each_epoch=False)
        test_data = DataIterator(test_file, uid_voc, mid_voc, cat_voc, batch_size, maxlen)
        n_uid, n_mid, n_cat = train_data.get_n()
        if model_type == 'DNN':
            model = Model_DNN(n_uid, n_mid, n_cat, EMBEDDING_DIM, HIDDEN_SIZE, ATTENTION_SIZE)
        elif model_type == 'PNN':
            model = Model_PNN(n_uid, n_mid, n_cat, EMBEDDING_DIM, HIDDEN_SIZE, ATTENTION_SIZE)
        elif model_type == 'Wide':
            model = Model_WideDeep(n_uid, n_mid, n_cat, EMBEDDING_DIM, HIDDEN_SIZE, ATTENTION_SIZE)
        elif model_type == 'DIN':
            model = Model_DIN(n_uid, n_mid, n_cat, EMBEDDING_DIM, HIDDEN_SIZE, ATTENTION_SIZE)
        elif model_type == 'DIN-V2-gru-att-gru':
            model = Model_DIN_V2_Gru_att_Gru(n_uid, n_mid, n_cat, EMBEDDING_DIM, HIDDEN_SIZE, ATTENTION_SIZE)
        elif model_type == 'DIN-V2-gru-gru-att':
            model = Model_DIN_V2_Gru_Gru_att(n_uid, n_mid, n_cat, EMBEDDING_DIM, HIDDEN_SIZE, ATTENTION_SIZE)
        elif model_type == 'DIN-V2-gru-qa-attGru':
            model = Model_DIN_V2_Gru_QA_attGru(n_uid, n_mid, n_cat, EMBEDDING_DIM, HIDDEN_SIZE, ATTENTION_SIZE)
        elif model_type == 'DIN-V2-gru-vec-attGru':
            model = Model_DIN_V2_Gru_Vec_attGru(n_uid, n_mid, n_cat, EMBEDDING_DIM, HIDDEN_SIZE, ATTENTION_SIZE)
        elif model_type == 'DIEN':
            model = Model_DIN_V2_Gru_Vec_attGru_Neg(n_uid, n_mid, n_cat, EMBEDDING_DIM, HIDDEN_SIZE, ATTENTION_SIZE)
        else:
            print ("Invalid model_type : %s", model_type)
            return
        # model = Model_DNN(n_uid, n_mid, n_cat, EMBEDDING_DIM, HIDDEN_SIZE, ATTENTION_SIZE)
        sess.run(tf.global_variables_initializer())
        sess.run(tf.local_variables_initializer())
        print("[INFO] before copy.........................................................")
        mox.file.copy_parallel(TMP_CACHE_DIR, "obs://model-trainning/DIEN/DIEN-Pycharm/dump/")
        print("[INFO] after copy.........................................................")
        sys.stdout.flush()
        print('                                                                                      test_auc: %.4f ---- test_loss: %.4f ---- test_accuracy: %.4f ---- test_aux_loss: %.4f' % eval(sess, test_data, model, best_model_path))
        sys.stdout.flush()

        start_time = time.time()
        iter = 0
        lr = 0.001
        for itr in range(3):
            loss_sum = 0.0
            accuracy_sum = 0.
            aux_loss_sum = 0.
            for src, tgt in train_data:
                uids, mids, cats, mid_his, cat_his, mid_mask, target, sl, noclk_mids, noclk_cats = prepare_data(src, tgt, maxlen, return_neg=True)
                print("*******************************************************")
                for i in prepare_data(src, tgt, maxlen, return_neg=True):
                    print(i.shape)
                print("*******************************************************")
                loss, acc, aux_loss = model.train(sess, [uids, mids, cats, mid_his, cat_his, mid_mask, target, sl, lr, noclk_mids, noclk_cats])
                loss_sum += loss
                accuracy_sum += acc
                aux_loss_sum += aux_loss
                iter += 1
                sys.stdout.flush()
                if (iter % test_iter) == 0:
                    print('iter: %d ----> train_loss: %.4f ---- train_accuracy: %.4f ---- tran_aux_loss: %.4f' % \
                                          (iter, loss_sum / test_iter, accuracy_sum / test_iter, aux_loss_sum / test_iter))
                    print('                                                                                          test_auc: %.4f ----test_loss: %.4f ---- test_accuracy: %.4f ---- test_aux_loss: %.4f' % eval(sess, test_data, model, best_model_path))
                    loss_sum = 0.0
                    accuracy_sum = 0.0
                    aux_loss_sum = 0.0
                if (iter % save_iter) == 0:
                    print('save model iter: %d' %(iter))
                    model.save(sess, model_path+"--"+str(iter))
            lr *= 0.5

@dingjiepin 你的训练脚本里有添加“os.environ['DUMP_GE_GRAPH'] = '2'”这样的一行代码吗?

@dingjiepin 你的训练脚本里有添加“os.environ['DUMP_GE_GRAPH'] = '2'”这样的一行代码吗?

@朱晶晶 您好,在模型训练前加了,如下:

if __name__ == '__main__':
    if len(sys.argv) == 4:
        SEED = int(sys.argv[3])
    else:
        SEED = 3
    os.environ['DUMP_GE_GRAPH'] = '2'
    tf.set_random_seed(SEED)
    numpy.random.seed(SEED)
    random.seed(SEED)
    train(model_type="DIEN", seed=SEED)

@朱晶晶 您好,在模型训练前加了,如下:

if __name__ == '__main__':
    if len(sys.argv) == 4:
        SEED = int(sys.argv[3])
    else:
        SEED = 3
    os.environ['DUMP_GE_GRAPH'] = '2'
    tf.set_random_seed(SEED)
    numpy.random.seed(SEED)
    random.seed(SEED)
    train(model_type="DIEN", seed=SEED)

@dingjiepin 好的,可以提供给我们您的代码吗?

@dingjiepin 好的,可以提供给我们您的代码吗?

@朱晶晶 您好,代码已经上传,见链接:
https://gitee.com/dingjiepin/dien_-model-arts_dump/tree/master/code
数据大小超过限制,上传至百度云,链接如下,和代码放在同级目录:
https://pan.baidu.com/s/1rRpCt3AIfaaaehhtZPdP2g 提取码: 4kby

@朱晶晶 您好,代码已经上传,见链接:
https://gitee.com/dingjiepin/dien_-model-arts_dump/tree/master/code
数据大小超过限制,上传至百度云,链接如下,和代码放在同级目录:
https://pan.baidu.com/s/1rRpCt3AIfaaaehhtZPdP2g 提取码: 4kby

@dingjiepin 好的,我们会尽快本地复现您的问题

@dingjiepin 好的,我们会尽快本地复现您的问题

@朱晶晶
您好,我们组的模型也遇到了这个问题,并提交了issue,issue链接:#I2BMN6:[华师大] 训练时遇到GeOp5_0GEOP::::DoRunAsync Failed

@dingjiepin 您好,经过本地复现,该问题的报错为TensorArray算子插件存在bug,我们在最新版本已修复。
修复完上述bug后遇到TensorArrayScatterV3算子动态shape适配的问题,我们这边正在适配,当修复版本发布时会及时通知您

朱晶晶 任务状态Analysing 修改为WIP
朱晶晶 计划截止日期设置为2021-02-28
朱晶晶 计划开始日期设置为2021-01-01

@朱晶晶
您好,我们组的模型也遇到了这个问题,并提交了issue,issue链接:#I2BMN6:[华师大] 训练时遇到GeOp5_0GEOP::::DoRunAsync Failed

@hxqc 您好,您那边的issue从目前的定位信息来看和该issue可能不太一样

@hxqc 您好,您那边的issue从目前的定位信息来看和该issue可能不太一样

@朱晶晶
好的,非常感谢。
不过今天稍早一点,我通过将模型开启混合精度模型后,出现了和上面类似的unknow shape的问题,请您看一下,是否与这个问题相关,因为在issue看到您回复过unknow shape的问题,并提供了附件解决了这样的问题,因此想再次麻烦您一下
输入图片说明
回复链接如下:#I2BMN6:[华师大] 训练时遇到GeOp5_0GEOP::::DoRunAsync Failed#note_3992097

@朱晶晶
好的,非常感谢。
不过今天稍早一点,我通过将模型开启混合精度模型后,出现了和上面类似的unknow shape的问题,请您看一下,是否与这个问题相关,因为在issue看到您回复过unknow shape的问题,并提供了附件解决了这样的问题,因此想再次麻烦您一下
输入图片说明
回复链接如下:#I2BMN6:[华师大] 训练时遇到GeOp5_0GEOP::::DoRunAsync Failed#note_3992097

@hxqc 您好,问题不一样的,请您耐心等待,我们committer会尽快给您分析问题的

朱晶晶 任务类型Bug-Report 修改为Requirement

您好,这边最新社区版本已发布,可以取最新版本尝试
https://ascend.huawei.com/#/software/cann/community

您好,这边最新社区版本已发布,可以取最新版本尝试
https://ascend.huawei.com/#/software/cann/community

@朱晶晶 您好,请问cann软件包只有基于LINUX的吗?在windowS的pycharm运行训练脚本,model arts的配置是tf1.15-python3.7-aarch64,有适合上述环境的cann软件包吗?

@朱晶晶 您好,请问cann软件包只有基于LINUX的吗?在windowS的pycharm运行训练脚本,model arts的配置是tf1.15-python3.7-aarch64,有适合上述环境的cann软件包吗?

@fyh-1 您好,下方链接有指导如何制作ModelArts的镜像
https://gitee.com/ascend/modelzoo/wikis/ModelArts%E8%87%AA%E5%AE%9A%E4%B9%89NPU%E8%AE%AD%E7%BB%83%E7%8E%AF%E5%A2%83%E9%95%9C%E5%83%8F%E6%89%8B%E5%86%8C%E3%80%90%E5%9F%BA%E7%A1%80%E7%89%88%E3%80%91?sort_id=3205360

@朱晶晶 要么在linux操作系统 要么用镜像?另外请问怎么分辨设备是不是昇腾AI设备以及属于哪类昇腾AI?

朱晶晶 任务状态WIP 修改为Feedback

@朱晶晶 要么在linux操作系统 要么用镜像?另外请问怎么分辨设备是不是昇腾AI设备以及属于哪类昇腾AI?

@fyh-1 您好,昇腾相关资料,可以在这里查看https://developer.huaweicloud.com/techfield/ascend.html

王位 任务状态Feedback 修改为DONE
吴定远 关联仓库Ascend/modelzoo-his 修改为Ascend/modelzoo

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(7)
8105229 neoming 1614668116 8175427 yanqingshang 1673856410
1
https://gitee.com/ascend/modelzoo.git
git@gitee.com:ascend/modelzoo.git
ascend
modelzoo
modelzoo

搜索帮助