76 Star 222 Fork 168

Ascend / modelzoo

 / 详情

【众智】【武汉大学】【VSL】使用msame工具出现om模型加载失败

DONE
推理问题
创建于  
2021-11-17 13:34

一、问题现象(附报错日志上下文):
首先通过atc转换命令将pb模型转换为om模型后,然后使用msame工具对om模型进行推理验证时报出了om模型加载失败的错误。如下图所示:
输入图片说明
输入图片说明

二、软件版本:
-- CANN 版本 (e.g., CANN 3.0.x,5.x.x): cann5.0.4
--Tensorflow版本:1.15
--Python 版本 (e.g., Python 3.7.5):3.7.5
--操作系统版本 (e.g., Ubuntu 18.04):华为镜像服务器

四、日志信息:
使用msame推理是打印出error的输出如下:

[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.188.665 [logger.cc:1477] 11353 KernelTaskEventLogProc: device_id=0, stream_id=7, task_id=1, kernel_name=(none), task_type=AiCpuKernel, task_launched_num=13
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.188.672 [npu_driver.cc:376] 11353 CommandSend: Command send success, device_id=0, ts_id=0, sq_id=7, reportCount=1, command=0xffc0070040, cmdCount=1.
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.188.682 [api_impl.cc:455] 11350 StreamSynchronize: stream_id=7
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.188.689 [osal.cc:262] 11350 CreateNotifier: Runtime_alloc_size 104
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.188.702 [npu_driver.cc:350] 11353 CommandOccupy: sqId=7, deviceId=0, tsId=0, command=0xffc0070080, cmdCount=1.
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.188.708 [engine.cc:840] 11353 SendingRun: CommandOccupy. sqId=7, cqId=0, deviceId=0, retCode=0.
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.188.713 [task.cc:1652] 11353 ToCommandBody: streamId=7, task_id=2, eventid=1031, flag=0xe, Task flag=0xe, timelineBase=0xffffffffffffffff, offset=0xffffffff, waitCqflag=0, waitCqId=0
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.188.720 [stream.cc:883] 11353 AddTaskToStream: recorded public task to stream, stream_id=7, task_id=2, task_type=2, head=1, tail=3
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.188.726 [logger.cc:1520] 11353 TaskLaunchedEx: device_id=0, stream_id=7, task_id=2, event_id=1031,task_type=EventRecord, task_launched_num=14
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.188.732 [npu_driver.cc:376] 11353 CommandSend: Command send success, device_id=0, ts_id=0, sq_id=7, reportCount=2, command=0xffc0070080, cmdCount=1.
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.190.375 [engine.cc:651] 11354 ProcessTaskReport: RTS_DRIVER: report receive, stream_id=7, sq_id=7, task_id=1, sq_head=3
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.190.386 [task.cc:872] 11354 SetResult: Kernel payLoad=2416312362, highTaskId=0, deviceId=0, rtCode=0x2a, errorTaskId=576, errorStreamId=96
[ERROR] RUNTIME(11350,msame):2021-11-17-13:01:59.190.397 [engine.cc:964]11354 ReportExceptProc:Task exception! device_id=0, stream_id=7, task_id=1, type=1, retCode=0x2a.
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.190.408 [device_error_proc.cc:625] 11354 ProcErrorInfo: Begin to process device error info.
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.191.933 [device_error_proc.cc:573] 11354 ProcessOneElementInRingBuffer: it needs process 1 error messages, tschVer=2, headSize=24, elementSize=11350.
[ERROR] RUNTIME(11350,msame):2021-11-17-13:01:59.191.948 [device_error_proc.cc:554]11354 ProcessAicpuErrorInfo:An exception occurred during AICPU execution, stream_id:0, task_id:0, errcode:4294967303, msg:aicpu execute failed.
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.191.955 [device_error_proc.cc:665] 11354 ProcErrorInfo: finished to process device error info, retCode=0.
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.192.074 [engine.cc:973] 11354 ReportExceptProc: excptCallBack_ is null.
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.192.086 [stream.cc:929] 11354 TryDelRecordedTask: del public task from stream, stream_id=7, tailTaskId=1, delTaskId=1, head=2, tail=3
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.192.094 [logger.cc:1583] 11354 TaskFinished: device_id=0, stream_id=7, task_id=1, task_type=1,task_finish_num=13
[ERROR] RUNTIME(11350,msame):2021-11-17-13:01:59.192.114 [task.cc:806]11354 PreCheckTaskErr:Kernel task happen error, retCode=0x2a, [aicpu exception].
[ERROR] RUNTIME(11350,msame):2021-11-17-13:01:59.192.134 [task.cc:679]11354 GetFirstExtendInfoForAicpuTask:Get extend info for tf kernel, device_id=0, stream_id=7, task_id=1, extInfoLen=0, extInfoAddr=0. extInfoLen is invalid, valid range is [8, 18446744073709551615].
[ERROR] RUNTIME(11350,msame):2021-11-17-13:01:59.192.184 [task.cc:615]11354 PrintAicpuErrorInfo:Aicpu kernel execute failed, device_id=0, stream_id=7, task_id=1, fault so_name=, fault kernel_name=, fault op_name=, extend_info=.
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.192.192 [task.cc:108] 11354 TaskFailCallBack: rtCode=0x715002a,[aicpu exception], errorTaskId=1, errorStreamId=7
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.192.200 [engine.cc:577] 11354 ProcessTask: ProcessTask.
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.192.207 [engine.cc:651] 11354 ProcessTaskReport: RTS_DRIVER: report receive, stream_id=7, sq_id=7, task_id=2, sq_head=3
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.192.216 [engine.cc:651] 11354 ProcessTaskReport: RTS_DRIVER: report receive, stream_id=7, sq_id=7, task_id=2, sq_head=3
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.192.222 [stream.cc:929] 11354 TryDelRecordedTask: del public task from stream, stream_id=7, tailTaskId=2, delTaskId=2, head=3, tail=3
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.192.228 [logger.cc:1583] 11354 TaskFinished: device_id=0, stream_id=7, task_id=2, task_type=2,task_finish_num=14
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.192.239 [engine.cc:577] 11354 ProcessTask: ProcessTask.
[ERROR] RUNTIME(11350,msame):2021-11-17-13:01:59.192.256 [stream.cc:704]11350 GetError:[LOAD][DEFAULT]Stream Synchronize failed, stream_id=7 retCode=0x2a, [aicpu exception].
[ERROR] RUNTIME(11350,msame):2021-11-17-13:01:59.192.273 [logger.cc:285]11350 StreamSynchronize:[LOAD][DEFAULT]Stream synchronize failed, stream = 0x55ab46e6ff10
[ERROR] RUNTIME(11350,msame):2021-11-17-13:01:59.192.299 [api_c.cc:597]11350 rtStreamSynchronize:[LOAD][DEFAULT]ErrCode=507018, desc=[aicpu exception], InnerCode=0x715002a
[ERROR] RUNTIME(11350,msame):2021-11-17-13:01:59.192.307 [error_message_manage.cc:41]11350 ReportFuncErrorReason:[LOAD][DEFAULT]rtStreamSynchronize execute failed, reason=[aicpu exception]
[ERROR] GE(11350,msame):2021-11-17-13:01:59.192.337 [model_manager.cc:163]11350 KernelLaunchEx: ErrorNo: 1343225859(Failed to call runtime API!) [LOAD][DEFAULT][Call][RtStreamSynchronize] failed. ret:0x7BC8A
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.192.351 [logger.cc:356] 11350 DevFree: Device memory free. mem=0x108080059000
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.192.362 [stream.cc:609] 11350 TearDown: stream_id=7, delay recycle task num=0.
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.192.379 [npu_driver.cc:350] 11353 CommandOccupy: sqId=7, deviceId=0, tsId=0, command=0xffc00700c0, cmdCount=1.
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.192.385 [engine.cc:840] 11353 SendingRun: CommandOccupy. sqId=7, cqId=0, deviceId=0, retCode=0.
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.192.392 [stream.cc:883] 11353 AddTaskToStream: recorded public task to stream, stream_id=7, task_id=3, task_type=6, head=3, tail=4
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.192.398 [logger.cc:1572] 11353 TaskLaunchedEx: device_id=0, stream_id=7, task_id=3, task_type=6,task_launched_num=15
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.192.404 [npu_driver.cc:376] 11353 CommandSend: Command send success, device_id=0, ts_id=0, sq_id=7, reportCount=1, command=0xffc00700c0, cmdCount=1.
[ERROR] GE(11350,msame):2021-11-17-13:01:59.192.420 [kernel_ex_task_info.cc:173]11350 Init: ErrorNo: 507018() [LOAD][DEFAULT][Create][AicpuSession] error. session id:15056924531701601793
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.192.440 [engine.cc:651] 11354 ProcessTaskReport: RTS_DRIVER: report receive, stream_id=7, sq_id=7, task_id=3, sq_head=4
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.192.446 [stream.cc:929] 11354 TryDelRecordedTask: del public task from stream, stream_id=7, tailTaskId=3, delTaskId=3, head=4, tail=4
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.192.452 [logger.cc:1583] 11354 TaskFinished: device_id=0, stream_id=7, task_id=3, task_type=6,task_finish_num=15
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.192.461 [stream.cc:381] 11354 FreeStreamId: free stream_id=7,
[ERROR] GE(11350,msame):2021-11-17-13:01:59.192.483 [davinci_model.cc:3228]11350 InitTaskInfo: ErrorNo: 507018() [LOAD][DEFAULT][Init][Task] index:62 failed, ret:507018.
[ERROR] GE(11350,msame):2021-11-17-13:01:59.192.491 [davinci_model.cc:489]11350 DoTaskSink: ErrorNo: 4294967295(failed) [LOAD][DEFAULT][Init][TaskInfo] failed, model_id:1.
[ERROR] GE(11350,msame):2021-11-17-13:01:59.192.499 [davinci_model.cc:643]11350 Init: ErrorNo: 4294967295(failed) [LOAD][DEFAULT][Call][DoTaskSink] failed, model_id:1.
[ERROR] GE(11350,msame):2021-11-17-13:01:59.192.506 [model_manager.cc:1400]11350 LoadModelOffline: ErrorNo: 4294967295(failed) [LOAD][DEFAULT][Init][DavinciModel] failed, ret:507018.
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.192.535 [npu_driver.cc:506] 11354 NormalSqCqFree: Normal sqcq free start, sqId=7, cqId=0, deviceId=0, tsId=0.
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.192.545 [stream.cc:168] 11354 DeAllocStreamSqCq: [SqCqManage]success to release sq, sq_id=7, cq_id=0, stream_id=7, is_sq_need_release=1, is_cq_need_reuse=1.
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.192.554 [npu_driver.cc:275] 11354 StreamIdFree: Stream id free success, stream_id=7, device_id=0, tsId=0
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.192.569 [engine.cc:577] 11354 ProcessTask: ProcessTask.
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.192.587 [logger.cc:764] 11350 ModelUnbindStream: model unbind stream, model_id=0, stream_id=6,
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.192.595 [osal.cc:262] 11350 CreateNotifier: Runtime_alloc_size 104
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.192.612 [npu_driver.cc:350] 11353 CommandOccupy: sqId=6, deviceId=0, tsId=0, command=0xffc0060080, cmdCount=1.
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.192.617 [engine.cc:840] 11353 SendingRun: CommandOccupy. sqId=6, cqId=0, deviceId=0, retCode=0.
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.192.623 [task.cc:1652] 11353 ToCommandBody: streamId=6, task_id=2, eventid=1030, flag=0xe, Task flag=0xe, timelineBase=0xffffffffffffffff, offset=0xffffffff, waitCqflag=0, waitCqId=0
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.192.633 [stream.cc:874] 11353 AddTaskToStream: recorded persistent task to stream, stream_id=6, task_id=2, task_type=2, head=0, tail=1, delay recycle task num=1
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.206.618 [logger.cc:418] 11350 ManagedMemFree: managed memory free. mem = 0x10804006f200
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.206.623 [logger.cc:356] 11350 DevFree: Device memory free. mem=0x108080055000
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.206.628 [logger.cc:418] 11350 ManagedMemFree: managed memory free. mem = 0x10804006f400
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.206.633 [logger.cc:356] 11350 DevFree: Device memory free. mem=0x108080056000
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.206.638 [logger.cc:418] 11350 ManagedMemFree: managed memory free. mem = 0x10804006f600
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.206.643 [logger.cc:356] 11350 DevFree: Device memory free. mem=0x108080057000
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.268.532 [logger.cc:418] 11350 ManagedMemFree: managed memory free. mem = 0x10804006f800
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.268.554 [logger.cc:356] 11350 DevFree: Device memory free. mem=0x108080058000
[ERROR] GE(11350,msame):2021-11-17-13:01:59.280.191 [graph_loader.cc:230]11350 LoadModelFromData: ErrorNo: 507018() [LOAD][DEFAULT][Load][Model] failed, model_id:1.
[ERROR] ASCENDCL(11350,msame):2021-11-17-13:01:59.283.041 [model.cpp:226]11350 ModelLoadFromFileWithMem: [LOAD][DEFAULT][Model][FromData]load model from data failed, ge result[507018]
[ERROR] load model from file failed, model file is /home/HwHiAiUser/AscendProjects/modelnet10.om
[ERROR] load model from file failed
[WARN] no model had been loaded, unload failed
[ERROR] Sample process failed
[INFO] ASCENDCL(11350,msame):2021-11-17-13:01:59.287.416 [stream.cpp:46]11350 aclrtDestroyStream: start to execute aclrtDestroyStream
[INFO] GE(11350,msame):2021-11-17-13:01:59.287.503 [single_op_manager.cc:65]11350 ReleaseResource:ReleaseResource in. resource id = 0x55ab46718a70
[INFO] GE(11350,msame):2021-11-17-13:01:59.287.584 [graph_caching_allocator.cc:372]11350 TryFreeBlocks:Try free blocks.
[INFO] GE(11350,msame):2021-11-17-13:01:59.287.648 [graph_caching_allocator.cc:328]11350 FreeCachedBlocks:Free cached blocks
[EVENT] GE(11350,msame):2021-11-17-13:01:59.287.707 [graph_caching_allocator.cc:440]11350 PrintStatics:Called counts[malloc:          0 free:          0].
[EVENT] GE(11350,msame):2021-11-17-13:01:59.287.772 [graph_caching_allocator.cc:390]11350 PrintCount:Malloc total[size:          0 count:          0].
[EVENT] GE(11350,msame):2021-11-17-13:01:59.287.830 [graph_caching_allocator.cc:390]11350 PrintCount: Using total[size:          0 count:          0].
[EVENT] GE(11350,msame):2021-11-17-13:01:59.287.882 [graph_caching_allocator.cc:390]11350 PrintCount:  Free total[size:          0 count:          0].
[INFO] ASCENDCL(11350,msame):2021-11-17-13:01:59.287.936 [stream_executor.cpp:181]11350 RemoveExecutor: To remove executor by stream = 94194109614704
[INFO] ASCENDCL(11350,msame):2021-11-17-13:01:59.288.004 [stream.cpp:51]11350 aclrtDestroyStream: release singleop success
[INFO] RUNTIME(11350,msame):2021-11-17-13:01:59.288.235 [stream.cc:609] 11350 TearDown: stream_id=4, delay recycle task num=0.
[INFO] ASCENDCL(11350,msame):2021-11-17-13:01:59.288.316 [stream.cpp:57]11350 aclrtDestroyStream: aclrtDestroyStream success
[INFO] end to destroy stream

使用的atc转换命令为:

atc --model=/home/HwHiAiUser/AscendProjects/VSL/pb_model/modelnet10.pb --framework=3 --output=/home/HwHiAiUser/AscendProjects/modelnet10 --soc_version=Ascend310 --input_shape="Placeholder:100,30,30,30,1" --log=info --out_nodes="latent_feature:0"

使用的msame推理命令为:

./msame --model "/home/HwHiAiUser/AscendProjects/modelnet10.om" --input "/home/HwHiAiUser/AscendProjects/VSL/ModelNet10.bin" --output "/home/HwHiAiUser/AscendProjects/VSL/om/" --outfmt TXT --loop 1

该过程中使用到的pb模型和om模型链接为:
URL:
https://e-share.obs-website.cn-north-1.myhuaweicloud.com?token=kgEgUsFhjaxRtkatN8fRvEKqNmGdhWHz1UyQoJdfU7rrfMD9NrS3GM1+vwgyl6yK/CBdVKT6dXhDCAUmzlHlH1h7b5VSkHXff8erm4qUqiNuUGNqgI0//HxtKQBtZefAccAIf2o4r7hkILQ3I9f4RNrI3C9PgDPtl79PKACumCKDx6F8hRU/gZWhk3D4os72ALusNVan09k4AMJp9vDvs0qB32hFRBi8D3FumiNXWegksOWvFhT0ZdZQPD+QHPP/hJFozxn3DXOIbq5dUtQp0h0KgWyjYHVrocvu7jA3zd7k4BWG3WNGoGprlpP7k7jUF0M8cqwXK68n7BVBZx/MfzhAj8p1VYO0NOIl+xf6VFavMxnthip0lJ402bjvpR85t7PWstVUnTtZPRpQm+WwhdKD7OgHKtkFd5qjtaeeBB7+6XXLNsq01BWGIVsk8Z5bSQczXPWL1Udeh6Lola8Efit+YlUFCgVms9ZUA+2f8K2xCiuk3+14dKKyNPXgedr6N3bnFA/QvcrrqPbU7yCWwmO8gZip/eTU8gN0HE31SAkpQUBltaOfVbbS1J2ZGRJ3LbkjBWt2+Y0PkKqpu97s+LoEDrdfbZLWDUmiQTIbzBs=

提取码:
123456

*有效期至: 2022/11/12 13:28:32 GMT+08:00

评论 (2)

wentaozou 创建了推理问题
wangxiaodan1103 任务状态TODO 修改为Analysing
wangxiaodan1103 负责人设置为张晓龙
wentaozou 修改了描述
展开全部操作日志

msame版本配套问题,当前ModelArts的离线推理环境是5.0.3前的版本,CANN包和msame需要使用配套版本。

调整版本后,离线推理流程已经通过。

颜亚文 任务状态Analysing 修改为DONE
吴定远 关联仓库Ascend/modelzoo-his 修改为Ascend/modelzoo

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(2)
1
https://gitee.com/ascend/modelzoo.git
git@gitee.com:ascend/modelzoo.git
ascend
modelzoo
modelzoo

搜索帮助

53164aa7 5694891 3bd8fe86 5694891