210 Star 1.1K Fork 967

Ascend/samples

多容器运行samples例子遇到设备无法打开

DONE
推理问题
创建于  
2024-06-12 11:20

1、问题描述
使用 samples/inference/modelInference/sampleResnetAIPP/cpp 路径下的 ResNet50 网络进行图片分类程序在容器运行测试验证,发现如果使用多个容器运行分类程序时会出现异常。

异常现象为,如果只起了一个容器A,在容器A里面运行 RestNet50 分类程序是可以正常运行,并且能得到推理结果,但是如果同时起两个容器,如容器 A 和容器B,在容器 A 运行分类程序是可以是可以正常运行,但是在容器 B 就会出现 Dev open failed. (dev=/dev/davinci0; errno=16; ret=87) 的设备无法打开的异常现象。

容器A正常运行打印

root@5229c5763d72:/home/samples/inference/modelInference/sampleResnetAIPP/cpp/scripts# ./sample_run.sh 
[INFO] The sample starts to run
......
[INFO]  top 1: index[162] value[0.902208] class[beagle]
[INFO]  top 2: index[161] value[0.096589] class[bassetbasset hound]
[INFO]  top 3: index[166] value[0.000621] class[Walker houndWalker foxhound]
[INFO]  top 4: index[167] value[0.000444] class[English foxhound]
[INFO]  top 5: index[163] value[0.000055] class[bloodhound sleuthhound]
......
[INFO] The program runs successfully

容器B运行的异常打印

root@8325ee2d9488:/home/samples/inference/modelInference/sampleResnetAIPP/cpp/scripts# ./sample_run.sh 
[INFO] The sample starts to run
[ERROR] DRV(114567,main):2024-06-12-01:57:05.086.409 [uda_user.c:110][tsdrv] [uda_dev_access 110] Dev open failed. (dev=/dev/davinci0; errno=16; ret=87)
[ERROR] DRV(114567,main):2024-06-12-01:57:05.086.742 [devdrv_manager.c:244][devmng] [devdrv_open_device_manager 244] devdrv_do_container return error. (ret=87, fd=3)
[ERROR] DRV(114567,main):2024-06-12-01:57:05.087.031 [devmng_user_common.c:33][devmng] [dmanage_common_ioctl 33] Device is busy. (ret=-87)
[ERROR] DRV(114567,main):2024-06-12-01:57:05.087.081 [devdrv_info.c:269][devmng] [dmanage_get_container_flag 269] ioctl failed, ret(87).
[ERROR] DRV(114567,main):2024-06-12-01:57:05.087.118 [devdrv_manager.c:3293][devmng] [drvManagerInit 3293] get container flag failed.
[ERROR] DRV(114567,main):2024-06-12-01:57:05.087.242 [uda_user.c:110][tsdrv] [uda_dev_access 110] Dev open failed. (dev=/dev/davinci0; errno=16; ret=87)
[ERROR] DRV(114567,main):2024-06-12-01:57:05.087.499 [devdrv_manager.c:244][devmng] [devdrv_open_device_manager 244] devdrv_do_container return error. (ret=87, fd=5)
[ERROR] DRV(114567,main):2024-06-12-01:57:05.087.580 [devdrv_pcie.c:788][devmng] [drvGetHostPhyMachFlag 788] open device manager failed, fd(-87). devid(0)
[ERROR] DRV(114567,main):2024-06-12-01:57:05.088.089 [uda_user.c:110][tsdrv] [uda_dev_access 110] Dev open failed. (dev=/dev/davinci0; errno=16; ret=87)
[ERROR] DRV(114567,main):2024-06-12-01:57:05.088.303 [uda_user.c:110][tsdrv] [uda_dev_access 110] Dev open failed. (dev=/dev/davinci0; errno=16; ret=87)
[ERROR] DRV(114567,main):2024-06-12-01:57:05.088.546 [devdrv_manager.c:244][devmng] [devdrv_open_device_manager 244] devdrv_do_container return error. (ret=87, fd=5)
[ERROR] DRV(114567,main):2024-06-12-01:57:05.088.588 [devdrv_manager.c:307][devmng] [drvCommonIoctl 307] open device manager failed, device is busy.

2、软件环境
套件:Atlas 200I A2 加速模块
系统:Ubuntu 22.04.4 LTS
CANN 版本:7.0.RC1

(base) root@davinci-mini:/home/samples/inference/modelInference/sampleResnetAIPP# npu-smi info
+--------------------------------------------------------------------------------------------------------+
| npu-smi 23.0.rc3                                 Version: 23.0.rc3                                     |
+-------------------------------+-----------------+------------------------------------------------------+
| NPU     Name                  | Health          | Power(W)     Temp(C)           Hugepages-Usage(page) |
| Chip    Device                | Bus-Id          | AICore(%)    Memory-Usage(MB)                        |
+===============================+=================+======================================================+
| 0       310B4                 | OK              | 7.7          48                15    / 15            |
| 0       0                     | NA              | 0            2878 / 3513                             |
+===============================+=================+======================================================+

3、测试步骤
测试的容器和docker脚本

(base) root@davinci-mini:/home/samples/inference/modelInference/sampleResnetAIPP#  docker pull ubuntu:22.04
(base) root@davinci-mini:/home/samples/inference/modelInference/sampleResnetAIPP#  docker images
REPOSITORY                                          TAG       IMAGE ID       CREATED        SIZE
ubuntu                                              22.04     2465309f578e   2 years ago    68.7MB
(base) root@davinci-mini:/home/samples/inference/modelInference/sampleResnetAIPP#  vim ./docker_ascend_run.sh
docker run -it  --pid=host \
--device=/dev/upgrade:/dev/upgrade \
--device=/dev/davinci0:/dev/davinci0 \
--device=/dev/davinci_manager_docker:/dev/davinci_manager \
--device=/dev/vdec:/dev/vdec \
--device=/dev/vpc:/dev/vpc \
--device=/dev/pngd:/dev/pngd \
--device=/dev/venc:/dev/venc \
--device=/dev/sys:/dev/sys \
--device=/dev/svm0 \
--device=/dev/acodec:/dev/acodec \
--device=/dev/ai:/dev/ai \
--device=/dev/ao:/dev/ao \
--device=/dev/hdmi:/dev/hdmi \
--device=/dev/ts_aisle:/dev/ts_aisle \
--device=/dev/dvpp_cmdlist:/dev/dvpp_cmdlist \
-v /usr/local/Ascend/:/usr/local/Ascend \
-v /etc/alternatives:/etc/alternatives/ \
-v /usr/lib/aarch64-linux-gnu:/usr/lib/aarch64-linux-gnu \
-v /usr/lib64/:/usr/lib64/ \
-v /usr/lib:/usr/lib \
-v /etc/sys_version.conf:/etc/sys_version.conf:ro \
-v /etc/hdcBasic.cfg:/etc/hdcBasic.cfg:ro \
-v /usr/lib64/libaicpu_processer.so:/usr/lib64/libaicpu_processer.so:ro \
-v /usr/lib64/libaicpu_prof.so:/usr/lib64/libaicpu_prof.so:ro \
-v /usr/lib64/libaicpu_sharder.so:/usr/lib64/libaicpu_sharder.so:ro \
-v /usr/lib64/libadump.so:/usr/lib64/libadump.so:ro \
-v /usr/lib64/libtsd_eventclient.so:/usr/lib64/libtsd_eventclient.so:ro \
-v /usr/lib64/libaicpu_scheduler.so:/usr/lib64/libaicpu_scheduler.so:ro \
-v /usr/lib/aarch64-linux-gnu/libcrypto.so.1.1:/usr/lib64/libcrypto.so.1.1:ro \
-v /usr/lib/aarch64-linux-gnu/libyaml-0.so.2.0.6:/usr/lib64/libyaml-0.so.2:ro \
-v /usr/lib64/libdcmi.so:/usr/lib64/libdcmi.so:ro \
-v /usr/lib64/libmpi_dvpp_adapter.so:/usr/lib64/libmpi_dvpp_adapter.so:ro \
-v /usr/lib64/aicpu_kernels/:/usr/lib64/aicpu_kernels/:ro \
-v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi:ro \
-v /usr/lib64/libstackcore.so:/usr/lib64/libstackcore.so:ro \
-v /usr/lib64/libunified_timer.so:/usr/lib64/libunified_timer.so \
-v /usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64:ro \
-v /var/slogd:/var/slogd:ro \
-v /var/dmp_daemon:/var/dmp_daemon:ro \
-v ${PWD}:${PWD} -w ${PWD} \
ubuntu:22.04 /bin/bash

运行验证过程

# 1、编译程序(宿主机)
# 按照README.md编译
​
# 2、进入容器(宿主机)
root@davinci-mini:/home/samples/inference/modelInference/sampleResnetAIPP# ./docker_run.sh
​
# 3、设置环境变量(容器内)
root@544119117df2:/home/samples/inference/modelInference/sampleResnetAIPP#
export LD_LIBRARY_PATH=/usr/lib64:/etc/alternatives:/usr/lib/aarch64-linux-gnu:/usr/local/Ascend/nnrt/latest:/usr/local/Ascend/nnrt/latest/lib64/:/usr/lib64$LD_LIBRARY_PATH
export DDK_PATH=/usr/local/Ascend/ascend-toolkit/latest
export NPU_HOST_LIB=$DDK_PATH/runtime/lib64/stub
source /usr/local/Ascend/ascend-toolkit/latest/bin/setenv.bash
​
# 4、运行分类程序(容器内)
root@544119117df2:/home/samples/inference/modelInference/sampleResnetAIPP/cpp/scripts# ./sample_run.sh 
[INFO] The sample starts to run
[ERROR] DRV(115866,main):2024-06-12-02:54:59.141.239 [uda_user.c:110][tsdrv] [uda_dev_access 110] Dev open failed. (dev=/dev/davinci0; errno=16; ret=87)
[ERROR] DRV(115866,main):2024-06-12-02:54:59.141.587 [devdrv_manager.c:244][devmng] [devdrv_open_device_manager 244] devdrv_do_container return error. (ret=87, fd=3)
[ERROR] DRV(115866,main):2024-06-12-02:54:59.141.643 [devmng_user_common.c:33][devmng] [dmanage_common_ioctl 33] Device is busy. (ret=-87)
[ERROR] DRV(115866,main):2024-06-12-02:54:59.141.679 [devdrv_info.c:269][devmng] [dmanage_get_container_flag 269] ioctl failed, ret(87).
[ERROR] DRV(115866,main):2024-06-12-02:54:59.141.752 [devdrv_manager.c:3293][devmng] [drvManagerInit 3293] get container flag failed.
[ERROR] DRV(115866,main):2024-06-12-02:54:59.141.876 [uda_user.c:110][tsdrv] [uda_dev_access 110] Dev open failed. (dev=/dev/davinci0; errno=16; ret=87)
[ERROR] DRV(115866,main):2024-06-12-02:54:59.142.057 [devdrv_manager.c:244][devmng] [devdrv_open_device_manager 244] devdrv_do_container return error. (ret=87, fd=5)
[ERROR] DRV(115866,main):2024-06-12-02:54:59.142.100 [devdrv_pcie.c:788][devmng] [drvGetHostPhyMachFlag 788] open device manager failed, fd(-87). devid(0)
[ERROR] DRV(115866,main):2024-06-12-02:54:59.142.573 [uda_user.c:110][tsdrv] [uda_dev_access 110] Dev open failed. (dev=/dev/davinci0; errno=16; ret=87)
[ERROR] DRV(115866,main):2024-06-12-02:54:59.142.769 [uda_user.c:110][tsdrv] [uda_dev_access 110] Dev open failed. (dev=/dev/davinci0; errno=16; ret=87)
[ERROR] DRV(115866,main):2024-06-12-02:54:59.142.921 [devdrv_manager.c:244][devmng] [devdrv_open_device_manager 244] devdrv_do_container return error. (ret=87, fd=5)
[ERROR] DRV(115866,main):2024-06-12-02:54:59.142.956 [devdrv_manager.c:307][devmng] [drvCommonIoctl 307] open device manager failed, device is busy.
[ERROR] DRV(115866,main):2024-06-12-02:54:59.142.988 [devdrv_manager.c:545][devmng] [drvGetPlatformInfo 545] ioctl failed, ret = 87.
......

评论 (4)

DC 创建了缺陷 12个月前
Liuboyu 任务类型缺陷 修改为推理问题 12个月前
Liuboyu 修改了描述 12个月前
Liuboyu 任务状态TODO 修改为Analysing 12个月前
Liuboyu 负责人设置为Liuboyu 12个月前
展开全部操作日志

您好,收到您的问题。
同时起两个容器,如容器 A 和容器B,在容器 A 运行分类程序是可以是可以正常运行,但是在容器 B 就报错,只起容器A,是可以的吗?如果只起容器B,是可以的吗?
您的sample有改动什么吗?
给个报错的debug日志呢。

附上给日志的方法:
执行 export ASCEND_SLOG_PRINT_TO_STDOUT=1;export ASCEND_GLOBAL_LOG_LEVEL=0(日志级别改为debug模式),然后执行命令最后加>>error.log将打屏的所有信息重定向到error.log中,上传error.log以供分析
日志修改回来的命令是:export ASCEND_SLOG_PRINT_TO_STDOUT=0;export ASCEND_GLOBAL_LOG_LEVEL=3(默认ERROR级别)

谢谢回复,如单独起容器 A,是可以运行的,并且单独起容器 B 也是可以运行的,发生设备打不开异常报错的情况下,是要同时起了两个或多个以上的容器才会出现问题。
测试 sample 例子是没有任何代码修改的,目前是想测试验证 NPU 在多容器里面运行推理。
附加一个有用的信息,如果起容器的命令添加 --privileged 权限选项,那么起多个容器时是不再会有设备打不开的情况发生了, 但是使用 --privileged 权限选项同时也带来了一些安全风险,如何在容器不使用 --privileged 选项时,也能多容器使用 NPU 推理。

找不到添加附件的方式,error.log 有上万行啊,其实报错信息大概是下面的日志。

[INFO] The sample starts to run
[ERROR] DRV(115866,main):2024-06-12-02:54:59.141.239 [uda_user.c:110][tsdrv] [uda_dev_access 110] Dev open failed. (dev=/dev/davinci0; errno=16; ret=87)
[ERROR] DRV(115866,main):2024-06-12-02:54:59.141.587 [devdrv_manager.c:244][devmng] [devdrv_open_device_manager 244] devdrv_do_container return error. (ret=87, fd=3)
[ERROR] DRV(115866,main):2024-06-12-02:54:59.141.643 [devmng_user_common.c:33][devmng] [dmanage_common_ioctl 33] Device is busy. (ret=-87)
[ERROR] DRV(115866,main):2024-06-12-02:54:59.141.679 [devdrv_info.c:269][devmng] [dmanage_get_container_flag 269] ioctl failed, ret(87).
[ERROR] DRV(115866,main):2024-06-12-02:54:59.141.752 [devdrv_manager.c:3293][devmng] [drvManagerInit 3293] get container flag failed.
[ERROR] DRV(115866,main):2024-06-12-02:54:59.141.876 [uda_user.c:110][tsdrv] [uda_dev_access 110] Dev open failed. (dev=/dev/davinci0; errno=16; ret=87)
[ERROR] DRV(115866,main):2024-06-12-02:54:59.142.057 [devdrv_manager.c:244][devmng] [devdrv_open_device_manager 244] devdrv_do_container return error. (ret=87, fd=5)
[ERROR] DRV(115866,main):2024-06-12-02:54:59.142.100 [devdrv_pcie.c:788][devmng] [drvGetHostPhyMachFlag 788] open device manager failed, fd(-87). devid(0)
[ERROR] DRV(115866,main):2024-06-12-02:54:59.142.573 [uda_user.c:110][tsdrv] [uda_dev_access 110] Dev open failed. (dev=/dev/davinci0; errno=16; ret=87)
[ERROR] DRV(115866,main):2024-06-12-02:54:59.142.769 [uda_user.c:110][tsdrv] [uda_dev_access 110] Dev open failed. (dev=/dev/davinci0; errno=16; ret=87)
[ERROR] DRV(115866,main):2024-06-12-02:54:59.142.921 [devdrv_manager.c:244][devmng] [devdrv_open_device_manager 244] devdrv_do_container return error. (ret=87, fd=5)
[ERROR] DRV(115866,main):2024-06-12-02:54:59.142.956 [devdrv_manager.c:307][devmng] [drvCommonIoctl 307] open device manager failed, device is busy.
[ERROR] DRV(115866,main):2024-06-12-02:54:59.142.988 [devdrv_manager.c:545][devmng] [drvGetPlatformInfo 545] ioctl failed, ret = 87.

https://www.hiascend.com/document/detail/zh/Atlas 200I A2/23.0.0/EP/installationguide/Install_52.html

这个文档里有这一行,您看下呢

运行环境中的一个Device只能被一个容器使用,只有当使用该Device的容器退出后,该Device才可以被其他容器使用。

Liuboyu 任务状态Analysing 修改为DONE 12个月前

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(2)
Liuboyu-liuboyuHappy DC-niucheng1991
1
https://gitee.com/ascend/samples.git
git@gitee.com:ascend/samples.git
ascend
samples
samples

搜索帮助