74 Star 218 Fork 167

Ascend / modelzoo

 / 详情

(华师大)struct-vrnn。模型npu上运行报错,不知道错因在哪

DONE
Bug-Report
创建于  
2020-12-24 20:44

2020-12-24 12:30:33.131444: I tf_adapter/optimizers/get_attr_optimize_pass.cc:64] NpuAttrs job is localhost
2020-12-24 12:30:33.131970: I tf_adapter/optimizers/get_attr_optimize_pass.cc:128] GetAttrOptimizePass_19 success. [0 ms]
2020-12-24 12:30:33.132101: I tf_adapter/optimizers/mark_start_node_pass.cc:82] job is localhost Skip the optimizer : MarkStartNodePass.
2020-12-24 12:30:33.132137: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:102] mix_compile_mode is False
2020-12-24 12:30:33.132146: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:103] iterations_per_loop is 1
2020-12-24 12:30:33.132157: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:108] Mark function as no need optimize [__inference_Dataset_from_generator_finalize_fn_156]
2020-12-24 12:30:33.132165: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:108] Mark function as no need optimize [__inference_Dataset_from_generator_finalize_fn_25]
2020-12-24 12:30:33.132172: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:108] Mark function as no need optimize [__inference_Dataset_from_generator_generator_next_fn_151]
2020-12-24 12:30:33.132180: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:108] Mark function as no need optimize [__inference_Dataset_from_generator_get_iterator_id_fn_140]
2020-12-24 12:30:33.132187: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:108] Mark function as no need optimize [__inference_Dataset_flat_map_flat_map_fn_159]
2020-12-24 12:30:33.132193: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:108] Mark function as no need optimize [__inference_Dataset_from_generator_generator_next_fn_20]
2020-12-24 12:30:33.132202: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:108] Mark function as no need optimize [__inference_Dataset_from_generator_get_iterator_id_fn_9]
2020-12-24 12:30:33.132209: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:108] Mark function as no need optimize [_inference_Dataset_interleave<class 'functools.partial'>_112]
2020-12-24 12:30:33.132253: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:108] Mark function as no need optimize [_inference_Dataset_interleave<class 'functools.partial'>_243]
2020-12-24 12:30:33.132295: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:108] Mark function as no need optimize [__inference_Dataset_flat_map_flat_map_fn_28]
2020-12-24 12:30:33.132359: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1763] OMPartition subgraph_37 begin.
2020-12-24 12:30:33.132366: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1764] mix_compile_mode is False
2020-12-24 12:30:33.132371: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1765] iterations_per_loop is 1
2020-12-24 12:30:33.132465: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:354] FindNpuSupportCandidates enableDP:0, mix_compile_mode: 0, hasMakeIteratorOp:0, hasIteratorOp:0
2020-12-24 12:30:33.132514: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:484] TFadapter find Npu support candidates cost: [0 ms]
2020-12-24 12:30:33.132809: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:863] cluster Num is 13
2020-12-24 12:30:33.132837: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:870] All nodes in graph: 41, max nodes count: 2 in subgraph: GeOp37_0 minGroupSize: 1
2020-12-24 12:30:33.132951: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:641] TFadapter merge clusters cost: [0 ms]
2020-12-24 12:30:33.132984: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1851] OMPartition subgraph_37 markForPartition success.
2020-12-24 12:30:33.133422: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1494] subgraphNum: 1
2020-12-24 12:30:33.134950: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1943] OMPartition subgraph_37 SubgraphsInFunctions success.
2020-12-24 12:30:33.134989: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1954] OMPartition subgraph_37 success. [2 ms]
2020-12-24 12:30:33.135059: I tf_adapter/optimizers/dp_tf_ge_conversion_pass.cc:917] DpTfToGEConversionPassImpl::RunPass, enable data preproc is false
2020-12-24 12:30:33.141215: I tf_adapter/optimizers/add_input_pass.cc:96] job is localhost Skip the optimizer : AddInputPass.
2020-12-24 12:30:33.141959: I tf_adapter/kernels/geop_npu.cc:175] [GEOP] Begin GeOp initialize.
2020-12-24 12:30:33.142022: I tf_adapter/util/ge_plugin.cc:69] [GePlugin] Ge has already initialized
2020-12-24 12:30:33.142033: I tf_adapter/kernels/geop_npu.cc:203] [GEOP] GePlugin init success
2020-12-24 12:30:33.142053: I tf_adapter/kernels/geop_npu.cc:209] [GEOP] GeOp Initialize success, cost: [0 ms]
2020-12-24 12:30:33.142496: I tf_adapter/kernels/geop_npu.cc:364] [GEOP] get tf session direct07e72e85f31f06c1 from session handle.
2020-12-24 12:30:33.142572: I tf_adapter/kernels/geop_npu.cc:375] [GEOP] Node name: GeOp37_0 , tf session: direct07e72e85f31f06c1
2020-12-24 12:30:33.142586: I tf_adapter/util/session_manager.cc:50] tf session direct07e72e85f31f06c1 get ge session success.
2020-12-24 12:30:33.142593: I tf_adapter/kernels/geop_npu.cc:382] [GEOP] tf session: direct07e72e85f31f06c1 get ge session success.
2020-12-24 12:30:33.142602: I tf_adapter/kernels/geop_npu.cc:388] [GEOP] Begin GeOp::ComputeAsync, kernel_name:GeOp37_0, num_inputs:0, num_outputs:13
2020-12-24 12:30:33.142612: I tf_adapter/kernels/geop_npu.cc:251] [GEOP] tf session direct07e72e85f31f06c1, graph id: 41 does not build yet, no need to check rebuild
2020-12-24 12:30:33.142953: I tf_adapter/util/infershape_util.cc:346] InferShapeUtil::InferShape
2020-12-24 12:30:33.142967: I tf_adapter/util/infershape_util.cc:84] The signature name of FunctionDef is GeOp37_0.
2020-12-24 12:30:33.143368: I tf_adapter/util/infershape_util.cc:96] InstantiateFunction GeOp37_0 success.
2020-12-24 12:30:33.143613: I tf_adapter/util/infershape_util.cc:101] ConvertNodeDefsToGraph GeOp37_0 success.
2020-12-24 12:30:33.143785: W tf_adapter/util/infershape_util.cc:304] The InferenceContext of node _SOURCE is null.
2020-12-24 12:30:33.143796: W tf_adapter/util/infershape_util.cc:304] The InferenceContext of node _SINK is null.
2020-12-24 12:30:33.143936: I tf_adapter/util/infershape_util.cc:395] InferShapeUtil::InferShape success
2020-12-24 12:30:33.144625: I tf_adapter/kernels/geop_npu.cc:440] [GEOP] In GEOP computeAsync, kernel_name:GeOp37_0 ,TFadapter cost time: [2 ms]
2020-12-24 12:30:33.144640: I tf_adapter/kernels/geop_npu.cc:442] [GEOP] TFadpter process graph success, GE parser begin, kernel_name:GeOp37_0 ,tf session: direct07e72e85f31f06c1 ,graph id :41
2020-12-24 12:30:33.154132: I tf_adapter/kernels/geop_npu.cc:508] [GEOP] Tensorflow graph parse to ge graph success, kernel_name:GeOp37_0 ,tf session: direct07e72e85f31f06c1 ,graph id: 41
2020-12-24 12:30:33.154315: I tf_adapter/kernels/geop_npu.cc:539] [GEOP] Add graph to ge session success, kernel_name:GeOp37_0 ,tf session: direct07e72e85f31f06c1 ,graph id:41
2020-12-24 12:30:33.154448: I tf_adapter/kernels/geop_npu.cc:580] [GEOP] Call ge session RunGraphAsync, kernel_name:GeOp37_0 ,tf session: direct07e72e85f31f06c1 ,graph id: 41
2020-12-24 12:30:33.154531: I tf_adapter/kernels/geop_npu.cc:593] [GEOP] End GeOp::ComputeAsync, kernel_name:GeOp37_0, ret_status:success ,tf session: direct07e72e85f31f06c1 ,graph id: 41 [11 ms]
2020-12-24 12:30:33.281587: I tf_adapter/kernels/geop_npu.cc:76] BuildOutputTensorInfo, num_outputs:13
2020-12-24 12:30:33.281664: I tf_adapter/kernels/geop_npu.cc:103] BuildOutputTensorInfo, output index:0, total_bytes:1, shape:, tensor_ptr:281464239413120, output281464239311184
2020-12-24 12:30:33.281675: I tf_adapter/kernels/geop_npu.cc:103] BuildOutputTensorInfo, output index:1, total_bytes:1, shape:, tensor_ptr:281464239320640, output281464239445200
2020-12-24 12:30:33.281684: I tf_adapter/kernels/geop_npu.cc:103] BuildOutputTensorInfo, output index:2, total_bytes:1, shape:, tensor_ptr:281464239460224, output281464239336032
2020-12-24 12:30:33.281691: I tf_adapter/kernels/geop_npu.cc:103] BuildOutputTensorInfo, output index:3, total_bytes:1, shape:, tensor_ptr:281464239299968, output281464239415056
2020-12-24 12:30:33.281699: I tf_adapter/kernels/geop_npu.cc:103] BuildOutputTensorInfo, output index:4, total_bytes:1, shape:, tensor_ptr:281464239362176, output281464239335472
2020-12-24 12:30:33.281705: I tf_adapter/kernels/geop_npu.cc:103] BuildOutputTensorInfo, output index:5, total_bytes:1, shape:, tensor_ptr:281464239346240, output281464239461840
2020-12-24 12:30:33.281712: I tf_adapter/kernels/geop_npu.cc:103] BuildOutputTensorInfo, output index:6, total_bytes:1, shape:, tensor_ptr:281464239345920, output281464239312272
2020-12-24 12:30:33.281718: I tf_adapter/kernels/geop_npu.cc:103] BuildOutputTensorInfo, output index:7, total_bytes:1, shape:, tensor_ptr:281464239436288, output281464239320416
2020-12-24 12:30:33.281725: I tf_adapter/kernels/geop_npu.cc:103] BuildOutputTensorInfo, output index:8, total_bytes:1, shape:, tensor_ptr:281464239311360, output281464239302368
2020-12-24 12:30:33.281733: I tf_adapter/kernels/geop_npu.cc:103] BuildOutputTensorInfo, output index:9, total_bytes:1, shape:, tensor_ptr:281464239311424, output281464239306528
2020-12-24 12:30:33.281740: I tf_adapter/kernels/geop_npu.cc:103] BuildOutputTensorInfo, output index:10, total_bytes:1, shape:, tensor_ptr:281464239383488, output281464239330192
2020-12-24 12:30:33.281747: I tf_adapter/kernels/geop_npu.cc:103] BuildOutputTensorInfo, output index:11, total_bytes:1, shape:, tensor_ptr:281464239349696, output281464239298672
2020-12-24 12:30:33.281753: I tf_adapter/kernels/geop_npu.cc:103] BuildOutputTensorInfo, output index:12, total_bytes:1, shape:, tensor_ptr:281464239349888, output281464239307488
2020-12-24 12:30:33.281761: I tf_adapter/kernels/geop_npu.cc:573] [GEOP] RunGraphAsync callback, status:0, kernel_name:GeOp37_0[ 127313us]
2020-12-24 12:30:34.500281: I tf_adapter/optimizers/get_attr_optimize_pass.cc:64] NpuAttrs job is localhost
2020-12-24 12:30:34.500728: I tf_adapter/optimizers/get_attr_optimize_pass.cc:128] GetAttrOptimizePass_20 success. [0 ms]
2020-12-24 12:30:34.500754: I tf_adapter/optimizers/mark_start_node_pass.cc:82] job is localhost Skip the optimizer : MarkStartNodePass.
2020-12-24 12:30:34.500787: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:102] mix_compile_mode is False
2020-12-24 12:30:34.500797: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:103] iterations_per_loop is 1
2020-12-24 12:30:34.500811: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:108] Mark function as no need optimize [__inference_Dataset_from_generator_finalize_fn_156]
2020-12-24 12:30:34.500819: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:108] Mark function as no need optimize [__inference_Dataset_from_generator_finalize_fn_25]
2020-12-24 12:30:34.500826: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:108] Mark function as no need optimize [__inference_Dataset_from_generator_generator_next_fn_151]
2020-12-24 12:30:34.500835: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:108] Mark function as no need optimize [__inference_Dataset_from_generator_get_iterator_id_fn_140]
2020-12-24 12:30:34.500842: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:108] Mark function as no need optimize [__inference_Dataset_flat_map_flat_map_fn_159]
2020-12-24 12:30:34.500849: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:108] Mark function as no need optimize [__inference_Dataset_from_generator_generator_next_fn_20]
2020-12-24 12:30:34.500857: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:108] Mark function as no need optimize [__inference_Dataset_from_generator_get_iterator_id_fn_9]
2020-12-24 12:30:34.500864: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:108] Mark function as no need optimize [_inference_Dataset_interleave<class 'functools.partial'>_112]
2020-12-24 12:30:34.500911: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:108] Mark function as no need optimize [_inference_Dataset_interleave<class 'functools.partial'>_243]
2020-12-24 12:30:34.500957: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:108] Mark function as no need optimize [__inference_Dataset_flat_map_flat_map_fn_28]
2020-12-24 12:30:34.501019: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1763] OMPartition subgraph_39 begin.
2020-12-24 12:30:34.501028: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1764] mix_compile_mode is False
2020-12-24 12:30:34.501034: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1765] iterations_per_loop is 1
2020-12-24 12:30:34.501145: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:354] FindNpuSupportCandidates enableDP:0, mix_compile_mode: 0, hasMakeIteratorOp:0, hasIteratorOp:0
2020-12-24 12:30:34.501214: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:484] TFadapter find Npu support candidates cost: [0 ms]
2020-12-24 12:30:34.501741: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:863] cluster Num is 1
2020-12-24 12:30:34.501757: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:870] All nodes in graph: 54, max nodes count: 52 in subgraph: GeOp39_0 minGroupSize: 1
2020-12-24 12:30:34.501783: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1851] OMPartition subgraph_39 markForPartition success.
2020-12-24 12:30:34.502339: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1494] subgraphNum: 1
2020-12-24 12:30:34.504182: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1943] OMPartition subgraph_39 SubgraphsInFunctions success.
2020-12-24 12:30:34.504213: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1954] OMPartition subgraph_39 success. [3 ms]
2020-12-24 12:30:34.504274: I tf_adapter/optimizers/dp_tf_ge_conversion_pass.cc:917] DpTfToGEConversionPassImpl::RunPass, enable data preproc is false
2020-12-24 12:30:34.508534: I tf_adapter/optimizers/add_input_pass.cc:96] job is localhost Skip the optimizer : AddInputPass.
2020-12-24 12:30:34.509143: I tf_adapter/kernels/geop_npu.cc:175] [GEOP] Begin GeOp initialize.
2020-12-24 12:30:34.509209: I tf_adapter/util/ge_plugin.cc:69] [GePlugin] Ge has already initialized
2020-12-24 12:30:34.509221: I tf_adapter/kernels/geop_npu.cc:203] [GEOP] GePlugin init success
2020-12-24 12:30:34.509242: I tf_adapter/kernels/geop_npu.cc:209] [GEOP] GeOp Initialize success, cost: [0 ms]
2020-12-24 12:30:34.509348: I tf_adapter/kernels/geop_npu.cc:364] [GEOP] get tf session direct07e72e85f31f06c1 from session handle.
2020-12-24 12:30:34.509408: I tf_adapter/kernels/geop_npu.cc:375] [GEOP] Node name: GeOp39_0 , tf session: direct07e72e85f31f06c1
2020-12-24 12:30:34.509418: I tf_adapter/util/session_manager.cc:50] tf session direct07e72e85f31f06c1 get ge session success.
2020-12-24 12:30:34.509424: I tf_adapter/kernels/geop_npu.cc:382] [GEOP] tf session: direct07e72e85f31f06c1 get ge session success.
2020-12-24 12:30:34.509432: I tf_adapter/kernels/geop_npu.cc:388] [GEOP] Begin GeOp::ComputeAsync, kernel_name:GeOp39_0, num_inputs:0, num_outputs:0
2020-12-24 12:30:34.509440: I tf_adapter/kernels/geop_npu.cc:251] [GEOP] tf session direct07e72e85f31f06c1, graph id: 51 does not build yet, no need to check rebuild
2020-12-24 12:30:34.509513: I tf_adapter/util/infershape_util.cc:346] InferShapeUtil::InferShape
2020-12-24 12:30:34.509524: I tf_adapter/util/infershape_util.cc:84] The signature name of FunctionDef is GeOp39_0.
2020-12-24 12:30:34.510176: I tf_adapter/util/infershape_util.cc:96] InstantiateFunction GeOp39_0 success.
2020-12-24 12:30:34.510538: I tf_adapter/util/infershape_util.cc:101] ConvertNodeDefsToGraph GeOp39_0 success.
2020-12-24 12:30:34.510848: W tf_adapter/util/infershape_util.cc:304] The InferenceContext of node _SOURCE is null.
2020-12-24 12:30:34.510864: W tf_adapter/util/infershape_util.cc:304] The InferenceContext of node _SINK is null.
2020-12-24 12:30:34.511117: W tf_adapter/util/infershape_util.cc:304] The InferenceContext of node init_2 is null.
2020-12-24 12:30:34.511128: I tf_adapter/util/infershape_util.cc:395] InferShapeUtil::InferShape success
2020-12-24 12:30:34.512295: I tf_adapter/kernels/geop_npu.cc:440] [GEOP] In GEOP computeAsync, kernel_name:GeOp39_0 ,TFadapter cost time: [2 ms]
2020-12-24 12:30:34.512318: I tf_adapter/kernels/geop_npu.cc:442] [GEOP] TFadpter process graph success, GE parser begin, kernel_name:GeOp39_0 ,tf session: direct07e72e85f31f06c1 ,graph id :51
2020-12-24 12:30:34.525710: I tf_adapter/kernels/geop_npu.cc:508] [GEOP] Tensorflow graph parse to ge graph success, kernel_name:GeOp39_0 ,tf session: direct07e72e85f31f06c1 ,graph id: 51
2020-12-24 12:30:34.525864: I tf_adapter/kernels/geop_npu.cc:539] [GEOP] Add graph to ge session success, kernel_name:GeOp39_0 ,tf session: direct07e72e85f31f06c1 ,graph id:51
2020-12-24 12:30:34.526073: I tf_adapter/kernels/geop_npu.cc:580] [GEOP] Call ge session RunGraphAsync, kernel_name:GeOp39_0 ,tf session: direct07e72e85f31f06c1 ,graph id: 51
2020-12-24 12:30:34.526140: I tf_adapter/kernels/geop_npu.cc:593] [GEOP] End GeOp::ComputeAsync, kernel_name:GeOp39_0, ret_status:success ,tf session: direct07e72e85f31f06c1 ,graph id: 51 [16 ms]
2020-12-24 12:30:35.055527: W tensorflow/core/framework/op_kernel.cc:1639] Unavailable: failed
2020-12-24 12:30:38.055686: F tf_adapter/kernels/geop_npu.cc:570] GeOp39_0GEOP::::DoRunAsync Failed
2020-12-24 12:30:48,736 1025 PCOMPILE Master process dead. worker process quiting..
2020-12-24 12:30:48,737 1024 PCOMPILE Master process dead. worker process quiting..
2020-12-24 12:30:48,780 1028 PCOMPILE Master process dead. worker process quiting..
2020-12-24 12:30:48,806 1026 PCOMPILE Master process dead. worker process quiting..
2020-12-24 12:30:48,806 1029 PCOMPILE Master process dead. worker process quiting..
2020-12-24 12:30:48,850 1030 PCOMPILE Master process dead. worker process quiting..
2020-12-24 12:30:48,888 1031 PCOMPILE Master process dead. worker process quiting..
2020-12-24 12:30:49,036 1027 PCOMPILE Master process dead. worker process quiting..
/usr/local/ma/python3.7/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 35 leaked semaphores to clean up at shutdown
len(cache))
[Modelarts Service Log]2020-12-24 12:30:49,367 - ERROR - FMK of device1 (pid: [413]) has exited with non-zero code: -6
[Modelarts Service Log]2020-12-24 12:30:49,368 - INFO - Begin destroy FMK processes
[Modelarts Service Log]2020-12-24 12:30:49,368 - INFO - FMK of device1 (pid: [413]) has exited
[Modelarts Service Log]2020-12-24 12:30:49,368 - INFO - End destroy FMK processes
=== begin proc exit ===
=== begin stop slogd ===
=== end pro exit ===
[Modelarts Service Log]Training end with return code: 250
[Modelarts Service Log]training end at 2020-12-24-12:30:50
[Modelarts Service Log]Training completed.
[ModelArts Service Log]modelarts-pipe: total length: 127431

附件
LiuZhenyu 2020-12-30 21:07

评论 (42)

Allthingsone 创建了Bug-Report
Allthingsone 关联仓库设置为Ascend/modelzoo
展开全部操作日志

问题已收录,有进展第一时间通知您

这个是
#I2AMBS:(华师大)模型:struct-vrnn。关于pycharm插件上训练显示Unknown command line flag 'data_url'
后续设DEVICEID为1还报错吧,这个问下相关项目经理,分配你们用的DEVICE_ID是什么,再试试看呢!

曹仁平 负责人设置为曹仁平
曹仁平 任务状态TODO 修改为Analysing

之前的问题我们认为是没有给data_url命令在代码中写出对应映射,增加后解决了。我觉得应该不是devie_ID的问题

所以这个问题确定是device_id的么?

还请尽快回复下,谢谢

这个是
#I2AMBS:(华师大)模型:struct-vrnn。关于pycharm插件上训练显示Unknown command line flag 'data_url'
后续设DEVICEID为1还报错吧,这个问下相关项目经理,分配你们用的DEVICE_ID是什么,再试试看呢!

@曹仁平 还请尽快回复下,谢谢

@曹仁平 还请尽快回复下,谢谢

@Allthingsone 关于devideid的问题,前面有人换了就好的,这个没有好,我再找人确认

@Allthingsone 关于devideid的问题,前面有人换了就好的,这个没有好,我再找人确认

@曹仁平 好的,我这边再试试

@Allthingsone 关于devideid的问题,前面有人换了就好的,这个没有好,我再找人确认

@曹仁平 但是日志中没有get logic device id by DEVICE_ID failed这样的错误,是这方面的问题么?

@曹仁平 但是日志中没有get logic device id by DEVICE_ID failed这样的错误,是这方面的问题么?

@Allthingsone 错误提示是:

ERROR - FMK of device1 (pid: [413]) has exited with non-zero code: -6

好的,那具体要怎么改呢?

好的,那具体要怎么改呢?

@Allthingsone
export DEVICE_ID=n
n从0~7之间,一个不行换一个,这么弄有点麻烦了,我再找人商量策略

好的,谢谢

@Allthingsone
不客气,你先试试看,好不好都麻烦回一下吧

@Allthingsone
不客气,你先试试看,好不好都麻烦回一下吧

@曹仁平 我加了os.system("export DEVICE_ID=n")
但不管尝试0-7都不行,错误还是一样输入图片说明

@曹仁平 我加了os.system("export DEVICE_ID=n")
但不管尝试0-7都不行,错误还是一样输入图片说明

@Allthingsone 好吧,有其他日志吗,我再找人看看

@Allthingsone 好吧,有其他日志吗,我再找人看看

@曹仁平 日志链接:https://pan.baidu.com/s/1iIHG_ACyXBdR20wDPwcCZA
提取码:opgs
新跑了一个训练任务
trainjob-vrnn12-30 | jobf544b020

@曹仁平 日志链接:https://pan.baidu.com/s/1iIHG_ACyXBdR20wDPwcCZA
提取码:opgs
新跑了一个训练任务
trainjob-vrnn12-30 | jobf544b020

@Allthingsone

{
    "status": "completed",
    "group_count": "1",
    "group_list": [
        {
            "group_name": "job-trainjob-vrnn12-30",
            "device_count": "1",
            "instance_count": "1",
            "instance_list": [
                {
                    "pod_name": "jobf544b020-job-trainjob-vrnn12-30-0",
                    "server_id": "192.168.0.222",
                    "devices": [
                        {
                            "device_id": "2",
                            "device_ip": "192.3.166.96"
                        }
                    ]
                }
            ]
        }
    ]
}

这日志本身看不出问题,我先取下后台日志看看

曹仁平 任务状态Analysing 修改为TODO
zhengtao 负责人曹仁平 修改为未设置
zhengtao 负责人设置为张韦全
张韦全 任务状态TODO 修改为Analysing
张韦全 添加了
 
bug
标签
张韦全 添加协作者张韦全
张韦全 负责人张韦全 修改为LiuZhenyu
张韦全 取消协作者张韦全
LiuZhenyu 上传了附件libopsproto.so

@Allthingsone 这个问题似乎与我们之前定位的diag_part_fusion推导shape错误很相似,请更换下附件中的so,重新跑下程序,更换前注意备份。/home/HwHiAiUser/Ascend/ascend-toolkit/20.1.rc1/arm64-linux/opp/op_proto/built-in/libopsproto.so

@Allthingsone 这个问题似乎与我们之前定位的diag_part_fusion推导shape错误很相似,请更换下附件中的so,重新跑下程序,更换前注意备份。/home/HwHiAiUser/Ascend/ascend-toolkit/20.1.rc1/arm64-linux/opp/op_proto/built-in/libopsproto.so

@LiuZhenyu 更换指的是把附件里的东西和什么对换?

@LiuZhenyu 更换指的是把附件里的东西和什么对换?

@Allthingsone 使用附件中的so替换原系统中的/home/HwHiAiUser/Ascend/ascend-toolkit/20.1.rc1/arm64-linux/opp/op_proto/built-in/libopsproto.so,替换前注意备份。

@Allthingsone 使用附件中的so替换原系统中的/home/HwHiAiUser/Ascend/ascend-toolkit/20.1.rc1/arm64-linux/opp/op_proto/built-in/libopsproto.so,替换前注意备份。

@LiuZhenyu 好的

@LiuZhenyu 好的

@Allthingsone 所以是用ssh指令对服务器上的文件进行替换么?那怎么备份?(没有这么试过,有些不懂,不好意思)

@Allthingsone 所以是用ssh指令对服务器上的文件进行替换么?那怎么备份?(没有这么试过,有些不懂,不好意思)

@Allthingsone
备份方法 cp /home/HwHiAiUser/Ascend/ascend-toolkit/20.1.rc1/arm64-linux/opp/op_proto/built-in/libopsproto.so /home/HwHiAiUser/Ascend/ascend-toolkit/20.1.rc1/arm64-linux/opp/op_proto/built-in/libopsproto.so.bak
替换方法 使用sftp将附件文件传输到linux服务器上,并将附件文件放在上面的路径中。以mobaxterm为例
输入图片说明
请尝试一下,有问题请留言

@Allthingsone
备份方法 cp /home/HwHiAiUser/Ascend/ascend-toolkit/20.1.rc1/arm64-linux/opp/op_proto/built-in/libopsproto.so /home/HwHiAiUser/Ascend/ascend-toolkit/20.1.rc1/arm64-linux/opp/op_proto/built-in/libopsproto.so.bak
替换方法 使用sftp将附件文件传输到linux服务器上,并将附件文件放在上面的路径中。以mobaxterm为例
输入图片说明
请尝试一下,有问题请留言

@LiuZhenyu 服务器ip就是图里的183.129.171.130?我训练模型的时候没有分配服务器啊。

@Allthingsone
备份方法 cp /home/HwHiAiUser/Ascend/ascend-toolkit/20.1.rc1/arm64-linux/opp/op_proto/built-in/libopsproto.so /home/HwHiAiUser/Ascend/ascend-toolkit/20.1.rc1/arm64-linux/opp/op_proto/built-in/libopsproto.so.bak
替换方法 使用sftp将附件文件传输到linux服务器上,并将附件文件放在上面的路径中。以mobaxterm为例
输入图片说明
请尝试一下,有问题请留言

@LiuZhenyu 183.129.171.130这个ip的服务器连接不上,输入图片说明

@Allthingsone 请问问题是否解决,解决过程中还有什么疑问吗?

@Allthingsone 请问问题是否解决,解决过程中还有什么疑问吗?

@LiuZhenyu 暂时没有

@Allthingsone 现在新的社区版本已经发布,可以使用新的社区版本调试。新版本已经解决了描述的问题。
https://ascend.huawei.com/#/software/cann

@Allthingsone 现在新的社区版本已经发布,可以使用新的社区版本调试。新版本已经解决了描述的问题。
https://ascend.huawei.com/#/software/cann

@LiuZhenyu 申请的esc服务器上好像也不存在/home/HwHiAiUser/Ascend/ascend-toolkit/20.1.rc1/arm64-linux/opp/op_proto/built-in/libopsproto.so这样的路径

@Allthingsone 现在新的社区版本已经发布,可以使用新的社区版本调试。新版本已经解决了描述的问题。
https://ascend.huawei.com/#/software/cann

@LiuZhenyu 下好的Dockerfile应该上传到esc服务器哪个路径下?

@Allthingsone 你好,你可以使用我们发布的镜像尝试一下:
社区版本已发布:https://ascend.huawei.com/#/software/cann/community
modilarts 可以参考如下方法:
https://gitee.com/ascend/modelzoo/wikis/%E5%8D%8E%E4%B8%BA%E4%BA%91ModelArts%E8%87%AA%E5%AE%9A%E4%B9%89%E9%95%9C%E5%83%8F%E8%AE%AD%E7%BB%83%E5%8A%9F%E8%83%BD%E6%93%8D%E4%BD%9C%E6%8C%87%E5%8D%97?sort_id=3154815
最新的镜像已更新到如下列表:
https://gitee.com/ascend/modelzoo/wikis/Modelarts%E5%85%AC%E5%BC%80%E9%95%9C%E5%83%8F%E5%88%97%E8%A1%A8%E4%BF%A1%E6%81%AF?sort_id=3413562

@张韦全 替换.so文件后在pycharm上运行显示:
2021/01/22 15:50:56 Current training job status: Initializing
2021/01/22 15:50:58 Current training job status: Deploying
2021/01/22 15:51:00 Current training job status: Initializing
2021/01/22 15:51:33 Current training job status: Running
do nothing
[Modelarts Service Log]user: uid=0(root) gid=0(root) groups=0(root)
[Modelarts Service Log]pwd: /home/work
[Modelarts Service Log]app_url: obs://model-train-fzk/struct-vrnn/video_structure/
[Modelarts Service Log]boot_file: video_structure/train_npu.py
[Modelarts Service Log]log_url: /tmp/log/demo.log
[Modelarts Service Log]command: video_structure/train_npu.py --data_url s3://model-train-fzk/struct-vrnn/video_structure/testdata/ --train_url s3://model-train-fzk/struct-vrnn/MA-video_structure-12-24-19-26/output/V0028/
[Modelarts Service Log]local_code_dir:
[Modelarts Service Log][modelarts_create_log] modelarts-pipe found
[Modelarts Service Log]handle inputs of training job
INFO:root:Using MoXing-v1.17.3-8aa951bc
INFO:root:Using OBS-Python-SDK-3.20.7
[ModelArts Service Log]INFO: env MA_INPUTS is not found, skip the inputs handler
INFO:root:Using MoXing-v1.17.3-8aa951bc
INFO:root:Using OBS-Python-SDK-3.20.7
[ModelArts Service Log]2021-01-22 07:51:19,997 - modelarts-downloader.py[line:612] - INFO: Main: modelarts-downloader starting with Namespace(dst='./', recursive=True, skip_creating_dir=False, src='obs://model-train-fzk/struct-vrnn/video_structure/', trace=False, type='common', verbose=False)
[ModelArts Service Log]2021-01-22 07:51:19,997 - modelarts-downloader.py[line:103] - ERROR: modelarts-downloader.py: Invalid input: (source URL | destination URL) is illegal
[Modelarts Service Log][modelarts_logger] modelarts-pipe found
[Modelarts Service Log]App download error:
INFO:root:Using MoXing-v1.17.3-8aa951bc
INFO:root:Using OBS-Python-SDK-3.20.7
[ModelArts Service Log]2021-01-22 07:51:19,997 - modelarts-downloader.py[line:612] - INFO: Main: modelarts-downloader starting with Namespace(dst='./', recursive=True, skip_creating_dir=False, src='obs://model-train-fzk/struct-vrnn/video_structure/', trace=False, type='common', verbose=False)
[ModelArts Service Log]2021-01-22 07:51:19,997 - modelarts-downloader.py[line:103] - ERROR: modelarts-downloader.py: Invalid input: (source URL | destination URL) is illegal
[Modelarts Service Log]training end at 2021-01-22-07:51:20
[Modelarts Service Log]Training completed.
想问下这个无效的输入是因为什么?

@Allthingsone 这个建议你再提一个issue,会有专人为你解答。我也同步咨询下,如有进展会及时告诉你。

@Allthingsone 这个建议你再提一个issue,会有专人为你解答。我也同步咨询下,如有进展会及时告诉你。

@张韦全 昨天新建了,但是还没回复,麻烦催一下

@张韦全 昨天新建了,但是还没回复,麻烦催一下

@Allthingsone 好的,可能是周末了,响应有些慢

@Allthingsone 好的,可能是周末了,响应有些慢

@张韦全 新建的issue问题还没被收录,是不是漏了?

@张韦全 新建的issue问题还没被收录,是不是漏了?

@Allthingsone 好的 我再问问,另外你也再核对下执行modelarts的入参有没有问题,从报错上看是URL非法,可以重点排查下填写的URL是否有问题
[ModelArts Service Log]2021-01-22 07:51:19,997 - modelarts-downloader.py[line:103] - ERROR: modelarts-downloader.py: Invalid input: (source URL | destination URL) is illegal

张韦全 任务状态Analysing 修改为Feedback

@Allthingsone 你好,你这个网络存在重复issue,为方便跟踪答复,我们先关闭当前issue,统一使用#I2EGDS:(华师大)pycharm上用自定义镜像在modelarts上跑代码报错跟踪。

张韦全 任务状态Feedback 修改为DONE
吴定远 关联仓库Ascend/modelzoo-his 修改为Ascend/modelzoo

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(4)
1
https://gitee.com/ascend/modelzoo.git
git@gitee.com:ascend/modelzoo.git
ascend
modelzoo
modelzoo

搜索帮助