1、问题描述
使用 samples/inference/modelInference/sampleResnetAIPP/cpp 路径下的 ResNet50 网络进行图片分类程序在容器运行测试验证,发现如果使用多个容器运行分类程序时会出现异常。
异常现象为,如果只起了一个容器A,在容器A里面运行 RestNet50 分类程序是可以正常运行,并且能得到推理结果,但是如果同时起两个容器,如容器 A 和容器B,在容器 A 运行分类程序是可以是可以正常运行,但是在容器 B 就会出现 Dev open failed. (dev=/dev/davinci0; errno=16; ret=87) 的设备无法打开的异常现象。
容器A正常运行打印
root@5229c5763d72:/home/samples/inference/modelInference/sampleResnetAIPP/cpp/scripts# ./sample_run.sh
[INFO] The sample starts to run
......
[INFO] top 1: index[162] value[0.902208] class[beagle]
[INFO] top 2: index[161] value[0.096589] class[bassetbasset hound]
[INFO] top 3: index[166] value[0.000621] class[Walker houndWalker foxhound]
[INFO] top 4: index[167] value[0.000444] class[English foxhound]
[INFO] top 5: index[163] value[0.000055] class[bloodhound sleuthhound]
......
[INFO] The program runs successfully
容器B运行的异常打印
root@8325ee2d9488:/home/samples/inference/modelInference/sampleResnetAIPP/cpp/scripts# ./sample_run.sh
[INFO] The sample starts to run
[ERROR] DRV(114567,main):2024-06-12-01:57:05.086.409 [uda_user.c:110][tsdrv] [uda_dev_access 110] Dev open failed. (dev=/dev/davinci0; errno=16; ret=87)
[ERROR] DRV(114567,main):2024-06-12-01:57:05.086.742 [devdrv_manager.c:244][devmng] [devdrv_open_device_manager 244] devdrv_do_container return error. (ret=87, fd=3)
[ERROR] DRV(114567,main):2024-06-12-01:57:05.087.031 [devmng_user_common.c:33][devmng] [dmanage_common_ioctl 33] Device is busy. (ret=-87)
[ERROR] DRV(114567,main):2024-06-12-01:57:05.087.081 [devdrv_info.c:269][devmng] [dmanage_get_container_flag 269] ioctl failed, ret(87).
[ERROR] DRV(114567,main):2024-06-12-01:57:05.087.118 [devdrv_manager.c:3293][devmng] [drvManagerInit 3293] get container flag failed.
[ERROR] DRV(114567,main):2024-06-12-01:57:05.087.242 [uda_user.c:110][tsdrv] [uda_dev_access 110] Dev open failed. (dev=/dev/davinci0; errno=16; ret=87)
[ERROR] DRV(114567,main):2024-06-12-01:57:05.087.499 [devdrv_manager.c:244][devmng] [devdrv_open_device_manager 244] devdrv_do_container return error. (ret=87, fd=5)
[ERROR] DRV(114567,main):2024-06-12-01:57:05.087.580 [devdrv_pcie.c:788][devmng] [drvGetHostPhyMachFlag 788] open device manager failed, fd(-87). devid(0)
[ERROR] DRV(114567,main):2024-06-12-01:57:05.088.089 [uda_user.c:110][tsdrv] [uda_dev_access 110] Dev open failed. (dev=/dev/davinci0; errno=16; ret=87)
[ERROR] DRV(114567,main):2024-06-12-01:57:05.088.303 [uda_user.c:110][tsdrv] [uda_dev_access 110] Dev open failed. (dev=/dev/davinci0; errno=16; ret=87)
[ERROR] DRV(114567,main):2024-06-12-01:57:05.088.546 [devdrv_manager.c:244][devmng] [devdrv_open_device_manager 244] devdrv_do_container return error. (ret=87, fd=5)
[ERROR] DRV(114567,main):2024-06-12-01:57:05.088.588 [devdrv_manager.c:307][devmng] [drvCommonIoctl 307] open device manager failed, device is busy.
2、软件环境
套件:Atlas 200I A2 加速模块
系统:Ubuntu 22.04.4 LTS
CANN 版本:7.0.RC1
(base) root@davinci-mini:/home/samples/inference/modelInference/sampleResnetAIPP# npu-smi info
+--------------------------------------------------------------------------------------------------------+
| npu-smi 23.0.rc3 Version: 23.0.rc3 |
+-------------------------------+-----------------+------------------------------------------------------+
| NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page) |
| Chip Device | Bus-Id | AICore(%) Memory-Usage(MB) |
+===============================+=================+======================================================+
| 0 310B4 | OK | 7.7 48 15 / 15 |
| 0 0 | NA | 0 2878 / 3513 |
+===============================+=================+======================================================+
3、测试步骤
测试的容器和docker脚本
(base) root@davinci-mini:/home/samples/inference/modelInference/sampleResnetAIPP# docker pull ubuntu:22.04
(base) root@davinci-mini:/home/samples/inference/modelInference/sampleResnetAIPP# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
ubuntu 22.04 2465309f578e 2 years ago 68.7MB
(base) root@davinci-mini:/home/samples/inference/modelInference/sampleResnetAIPP# vim ./docker_ascend_run.sh
docker run -it --pid=host \
--device=/dev/upgrade:/dev/upgrade \
--device=/dev/davinci0:/dev/davinci0 \
--device=/dev/davinci_manager_docker:/dev/davinci_manager \
--device=/dev/vdec:/dev/vdec \
--device=/dev/vpc:/dev/vpc \
--device=/dev/pngd:/dev/pngd \
--device=/dev/venc:/dev/venc \
--device=/dev/sys:/dev/sys \
--device=/dev/svm0 \
--device=/dev/acodec:/dev/acodec \
--device=/dev/ai:/dev/ai \
--device=/dev/ao:/dev/ao \
--device=/dev/hdmi:/dev/hdmi \
--device=/dev/ts_aisle:/dev/ts_aisle \
--device=/dev/dvpp_cmdlist:/dev/dvpp_cmdlist \
-v /usr/local/Ascend/:/usr/local/Ascend \
-v /etc/alternatives:/etc/alternatives/ \
-v /usr/lib/aarch64-linux-gnu:/usr/lib/aarch64-linux-gnu \
-v /usr/lib64/:/usr/lib64/ \
-v /usr/lib:/usr/lib \
-v /etc/sys_version.conf:/etc/sys_version.conf:ro \
-v /etc/hdcBasic.cfg:/etc/hdcBasic.cfg:ro \
-v /usr/lib64/libaicpu_processer.so:/usr/lib64/libaicpu_processer.so:ro \
-v /usr/lib64/libaicpu_prof.so:/usr/lib64/libaicpu_prof.so:ro \
-v /usr/lib64/libaicpu_sharder.so:/usr/lib64/libaicpu_sharder.so:ro \
-v /usr/lib64/libadump.so:/usr/lib64/libadump.so:ro \
-v /usr/lib64/libtsd_eventclient.so:/usr/lib64/libtsd_eventclient.so:ro \
-v /usr/lib64/libaicpu_scheduler.so:/usr/lib64/libaicpu_scheduler.so:ro \
-v /usr/lib/aarch64-linux-gnu/libcrypto.so.1.1:/usr/lib64/libcrypto.so.1.1:ro \
-v /usr/lib/aarch64-linux-gnu/libyaml-0.so.2.0.6:/usr/lib64/libyaml-0.so.2:ro \
-v /usr/lib64/libdcmi.so:/usr/lib64/libdcmi.so:ro \
-v /usr/lib64/libmpi_dvpp_adapter.so:/usr/lib64/libmpi_dvpp_adapter.so:ro \
-v /usr/lib64/aicpu_kernels/:/usr/lib64/aicpu_kernels/:ro \
-v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi:ro \
-v /usr/lib64/libstackcore.so:/usr/lib64/libstackcore.so:ro \
-v /usr/lib64/libunified_timer.so:/usr/lib64/libunified_timer.so \
-v /usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64:ro \
-v /var/slogd:/var/slogd:ro \
-v /var/dmp_daemon:/var/dmp_daemon:ro \
-v ${PWD}:${PWD} -w ${PWD} \
ubuntu:22.04 /bin/bash
运行验证过程
# 1、编译程序(宿主机)
# 按照README.md编译
# 2、进入容器(宿主机)
root@davinci-mini:/home/samples/inference/modelInference/sampleResnetAIPP# ./docker_run.sh
# 3、设置环境变量(容器内)
root@544119117df2:/home/samples/inference/modelInference/sampleResnetAIPP#
export LD_LIBRARY_PATH=/usr/lib64:/etc/alternatives:/usr/lib/aarch64-linux-gnu:/usr/local/Ascend/nnrt/latest:/usr/local/Ascend/nnrt/latest/lib64/:/usr/lib64$LD_LIBRARY_PATH
export DDK_PATH=/usr/local/Ascend/ascend-toolkit/latest
export NPU_HOST_LIB=$DDK_PATH/runtime/lib64/stub
source /usr/local/Ascend/ascend-toolkit/latest/bin/setenv.bash
# 4、运行分类程序(容器内)
root@544119117df2:/home/samples/inference/modelInference/sampleResnetAIPP/cpp/scripts# ./sample_run.sh
[INFO] The sample starts to run
[ERROR] DRV(115866,main):2024-06-12-02:54:59.141.239 [uda_user.c:110][tsdrv] [uda_dev_access 110] Dev open failed. (dev=/dev/davinci0; errno=16; ret=87)
[ERROR] DRV(115866,main):2024-06-12-02:54:59.141.587 [devdrv_manager.c:244][devmng] [devdrv_open_device_manager 244] devdrv_do_container return error. (ret=87, fd=3)
[ERROR] DRV(115866,main):2024-06-12-02:54:59.141.643 [devmng_user_common.c:33][devmng] [dmanage_common_ioctl 33] Device is busy. (ret=-87)
[ERROR] DRV(115866,main):2024-06-12-02:54:59.141.679 [devdrv_info.c:269][devmng] [dmanage_get_container_flag 269] ioctl failed, ret(87).
[ERROR] DRV(115866,main):2024-06-12-02:54:59.141.752 [devdrv_manager.c:3293][devmng] [drvManagerInit 3293] get container flag failed.
[ERROR] DRV(115866,main):2024-06-12-02:54:59.141.876 [uda_user.c:110][tsdrv] [uda_dev_access 110] Dev open failed. (dev=/dev/davinci0; errno=16; ret=87)
[ERROR] DRV(115866,main):2024-06-12-02:54:59.142.057 [devdrv_manager.c:244][devmng] [devdrv_open_device_manager 244] devdrv_do_container return error. (ret=87, fd=5)
[ERROR] DRV(115866,main):2024-06-12-02:54:59.142.100 [devdrv_pcie.c:788][devmng] [drvGetHostPhyMachFlag 788] open device manager failed, fd(-87). devid(0)
[ERROR] DRV(115866,main):2024-06-12-02:54:59.142.573 [uda_user.c:110][tsdrv] [uda_dev_access 110] Dev open failed. (dev=/dev/davinci0; errno=16; ret=87)
[ERROR] DRV(115866,main):2024-06-12-02:54:59.142.769 [uda_user.c:110][tsdrv] [uda_dev_access 110] Dev open failed. (dev=/dev/davinci0; errno=16; ret=87)
[ERROR] DRV(115866,main):2024-06-12-02:54:59.142.921 [devdrv_manager.c:244][devmng] [devdrv_open_device_manager 244] devdrv_do_container return error. (ret=87, fd=5)
[ERROR] DRV(115866,main):2024-06-12-02:54:59.142.956 [devdrv_manager.c:307][devmng] [drvCommonIoctl 307] open device manager failed, device is busy.
[ERROR] DRV(115866,main):2024-06-12-02:54:59.142.988 [devdrv_manager.c:545][devmng] [drvGetPlatformInfo 545] ioctl failed, ret = 87.
......
您好,收到您的问题。
同时起两个容器,如容器 A 和容器B,在容器 A 运行分类程序是可以是可以正常运行,但是在容器 B 就报错,只起容器A,是可以的吗?如果只起容器B,是可以的吗?
您的sample有改动什么吗?
给个报错的debug日志呢。
附上给日志的方法:
执行 export ASCEND_SLOG_PRINT_TO_STDOUT=1;export ASCEND_GLOBAL_LOG_LEVEL=0(日志级别改为debug模式),然后执行命令最后加>>error.log将打屏的所有信息重定向到error.log中,上传error.log以供分析
日志修改回来的命令是:export ASCEND_SLOG_PRINT_TO_STDOUT=0;export ASCEND_GLOBAL_LOG_LEVEL=3(默认ERROR级别)
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
谢谢回复,如单独起容器 A,是可以运行的,并且单独起容器 B 也是可以运行的,发生设备打不开异常报错的情况下,是要同时起了两个或多个以上的容器才会出现问题。
测试 sample 例子是没有任何代码修改的,目前是想测试验证 NPU 在多容器里面运行推理。
附加一个有用的信息,如果起容器的命令添加 --privileged 权限选项,那么起多个容器时是不再会有设备打不开的情况发生了, 但是使用 --privileged 权限选项同时也带来了一些安全风险,如何在容器不使用 --privileged 选项时,也能多容器使用 NPU 推理。
找不到添加附件的方式,error.log 有上万行啊,其实报错信息大概是下面的日志。
[INFO] The sample starts to run
[ERROR] DRV(115866,main):2024-06-12-02:54:59.141.239 [uda_user.c:110][tsdrv] [uda_dev_access 110] Dev open failed. (dev=/dev/davinci0; errno=16; ret=87)
[ERROR] DRV(115866,main):2024-06-12-02:54:59.141.587 [devdrv_manager.c:244][devmng] [devdrv_open_device_manager 244] devdrv_do_container return error. (ret=87, fd=3)
[ERROR] DRV(115866,main):2024-06-12-02:54:59.141.643 [devmng_user_common.c:33][devmng] [dmanage_common_ioctl 33] Device is busy. (ret=-87)
[ERROR] DRV(115866,main):2024-06-12-02:54:59.141.679 [devdrv_info.c:269][devmng] [dmanage_get_container_flag 269] ioctl failed, ret(87).
[ERROR] DRV(115866,main):2024-06-12-02:54:59.141.752 [devdrv_manager.c:3293][devmng] [drvManagerInit 3293] get container flag failed.
[ERROR] DRV(115866,main):2024-06-12-02:54:59.141.876 [uda_user.c:110][tsdrv] [uda_dev_access 110] Dev open failed. (dev=/dev/davinci0; errno=16; ret=87)
[ERROR] DRV(115866,main):2024-06-12-02:54:59.142.057 [devdrv_manager.c:244][devmng] [devdrv_open_device_manager 244] devdrv_do_container return error. (ret=87, fd=5)
[ERROR] DRV(115866,main):2024-06-12-02:54:59.142.100 [devdrv_pcie.c:788][devmng] [drvGetHostPhyMachFlag 788] open device manager failed, fd(-87). devid(0)
[ERROR] DRV(115866,main):2024-06-12-02:54:59.142.573 [uda_user.c:110][tsdrv] [uda_dev_access 110] Dev open failed. (dev=/dev/davinci0; errno=16; ret=87)
[ERROR] DRV(115866,main):2024-06-12-02:54:59.142.769 [uda_user.c:110][tsdrv] [uda_dev_access 110] Dev open failed. (dev=/dev/davinci0; errno=16; ret=87)
[ERROR] DRV(115866,main):2024-06-12-02:54:59.142.921 [devdrv_manager.c:244][devmng] [devdrv_open_device_manager 244] devdrv_do_container return error. (ret=87, fd=5)
[ERROR] DRV(115866,main):2024-06-12-02:54:59.142.956 [devdrv_manager.c:307][devmng] [drvCommonIoctl 307] open device manager failed, device is busy.
[ERROR] DRV(115866,main):2024-06-12-02:54:59.142.988 [devdrv_manager.c:545][devmng] [drvGetPlatformInfo 545] ioctl failed, ret = 87.
这个文档里有这一行,您看下呢
运行环境中的一个Device只能被一个容器使用,只有当使用该Device的容器退出后,该Device才可以被其他容器使用。
登录 后才可以发表评论