GPU（Nvidia）容器实例设备节点权限偶发丢失问题

版本：openeuler-22.03sp1 + systemd-249
描述：初步分析在Systemd 刷新所有unit的时候（daemon-reload、SetUnitProperties等操作），会丢失GPU设备节点权限信息（刷新覆盖原有GPU写入的正确权限）
问题：GPU容器实例内部存在设备节点，但是无法调用，返回Failed to initialize NVML: Unknown Error错误：
Eg:

# nvidia-smi
# Failed to initialize NVML: Unknown Error

查看容器实例Cgroup目录下的devices.list，发现实例缺失 Nvidia 设备权限。
异常：

# cat /sys/fs/cgroup/devices/kubepods/kubepods-burstable.slice/***/device.list
b *:* m
c *:* m

正常：

cat device.list
b *:* m
c *:* m
c 555:0 rw
c 555:1 rw

由于故障偶发，可以使用bpftrace对系统的devices.list上一个监控，抓取一段时间内所有会修改devices.list的程序，发现，一段时间内，只有三个进程会修改cgroup的devices.list，其中systemd极其频繁地修改devices.list，其次是runc和nvidia-container，其中，systemd 会有禁止全部设备权限的高危行为。其中，runc是通过调用systemd 提供的接口，创建并设置容器实例资源使用和权限

cat cg.log
2 bash
68 nvidia-containe
70 runc
3289 systemd

输入图片说明

查看 Documentation/cgroup-v1/devices.txt 文档，

devices.list（只读）
devices.allow（只写）
devices.deny（只写）
三个文件之间的关系:
 通过｜devices.allow｜写入 -> ｜devices.list(存放设备节点信息)| <- 删除｜devices.deny｜通过

devices.list的实现来看，只有操作devices.deny才会删除devices.list的内容。
所以这里的排查重点在有devices.deny操作的Systemd。Systemd操作cgroup devices.list的逻辑是这样的，先禁止所有设备节点的权限，再写入白名单设备节点：
vim systemd-249/src/core/cgroup.c

static void cgroup_context_apply()
    -> cgroup_apply_devices(Unit *u)
        -> r = cg_set_attribute("devices", path, "devices.deny", "a");
        -> LIST_FOREACH(device_allow, a, c->device_allow) 
            -> r = bpf_devices_allow_list_device(prog, path, a->path, acc);

输入图片说明

分析Systemd的Devices行为：首先排查可能性最大的路径：Systemd在重写cgroup devices.list的时候，丢失了GPU相关的设备节点。
首先查看Systemd维护的设备信息是什么样的。runc创建的容器，systemd会创建一个名字为cri-containerd-容器实例ID.scope的unit单位，用于维护容器实例cgroup的cpuset、memory等指标：具体：runc对dbus的访问：

StartTransientUnit -> 设置资源
SetUnitProperties -> 更新资源
StopUnit -> 删除Unit
接口：
org.freedesktop.systemd1.Manager.SetUnitProperties()
org.freedesktop.systemd1.Manager.StartTransientUnit()
org.freedesktop.systemd1.Manager.StopUnit()

输入图片说明

构建测试环境：
K8S通过crictl命令创建的容器实例，会调用Systemd给每个容器实例创建对应scope unit。手动创建的容器没有这个scope unit，所以测试环境要模拟K8S的行为，为容器实例创建scope unit。

sudo ctr -n k8s.io run --runc-binary /usr/bin/nvidia-container-runtime --rm --tty --env NVIDIA_VISIBLE_DEVICES=3 --env KGPU_MEM_DEV=23028 --env KGPU_SCHD_WEIGHT=0 --env KGPU_MEM_CONTAINER=10000 container-images test-cgroup-devices-9 /bin/bash

在容器内执行 nvidia-smi 可以访问Navidia设备节点
输入图片说明
构建一个Dbus命令，让Systemd为 test-cgroup-devices-9 创建scope unit，并且设置DevicePolicy为strict：

sudo gdbus call --system --dest org.freedesktop.systemd1 --object-path /org/freedesktop/systemd1 --method org.freedesktop.systemd1.Manager.StartTransientUnit 'test-cgroup-devices-9.scope' 'replace' "[
  ('Description', <'Test Scope'>),
  ('Delegate', <true>),
  ('PIDs', <[uint32 4448834]>),
  ('MemoryAccounting', <true>),
  ('CPUAccounting', <true>),
  ('BlockIOAccounting', <true>),
  ('TasksAccounting', <true>),
  ('DefaultDependencies', <false>),
  ('DevicePolicy', <'strict'>) 
]" "[]"

在机器上执行systemd daemon-reload，
容器内执行 nvidia-smi ，立刻失去Nvidia设备的访问权限：
输入图片说明

同时，同一台机器上其他GPU容器实例的设备节点也全部丢失：

# cat /sys/fs/cgroup/devices/kubepods/kubepods-burstable.slice/***/device.list
b *:* m
c *:* m

由上面构建的验证环境可知，系统上执行了Systemd重新加载所有unit，会导致GPU软件丢失。
同时，排查CentOS的机器上Dbus消息里面，同样有非常多命令会涉及devices.list操作，为什么GPU丢失设备节点的问题从来没有在CentOS上发生。

分析CentOS和OpenEuler 的不同行为 ：
CentOS-219版本的Systemd创建的unit默认是static状态，在reload/mask等流程中会被跳过：

systemctl status cri-containeed-***.scope
    Load: loaded (/run/systemd/system/cri-containerd-***.scope; static; vendor preset: disabled)

OpenEuler版本的Systemd-249创建unit是 transient （https://systemd.io/TRANSIENT-SETTINGS/），无法被设置为static/disable状态：

ls /run/systemd/transientcri-containerd-***.scope.d
50-DevicePolicy.conf, 50-DeviceAllow.conf, 50-CPUShares.conf

systemctl status cri-containeed-***.scope
    Load: loaded (/run/systemd/system/cri-containerd-***.scope; transient)
Transient: yes
    Drop-In: /run/systemd/system/cri-containerd-***.scope.d
            └─50-DevicePolicy.conf, 50-DeviceAllow.conf, 50-CPUShares.conf
    Active: active (running) since Tue 2025-04-01 19:30:44 CST; 5 days ago

# systemctl cat cri-containerd-***.scope
# /run/systemd/transient/cri-containerd-***.scope
# This is a transient unit file, created programmatically via the systemd API. Do not edit.
[Unit]
Description=libcontainer container ***

[Scope]
Slice=kubepods-burstable-pod***.slice
Delegate=yes
MemoryAccounting=yes
CPUAccounting=yes
BlockIOAccounting=yes
TasksAccounting=yes

[Unit]
DefaultDependencies=no

# /run/systemd/transient/cri-containerd-***.scope.d/50-CPUShares.conf
# This is a drop-in unit file extension, created via "systemctl set-property"
# or an equivalent operation. Do not edit.
[Scope]
CPUShares=1

# /run/systemd/transient/cri-containerd-***.scope.d/50-DeviceAllow.conf
# This is a drop-in unit file extension, created via "systemctl set-property"
# or an equivalent operation. Do not edit.
[Scope]
DeviceAllow=
DeviceAllow=char-pts rwm
DeviceAllow=/dev/char/5:0 rwm
DeviceAllow=/dev/char/1:9 rwm
DeviceAllow=char-* m
DeviceAllow=block-* m


# /run/systemd/transient/cri-containerd-***.scope.d/50-DevicePolicy.conf
# This is a drop-in unit file extension, created via "systemctl set-property"
# or an equivalent operation. Do not edit.
[Scope]
DevicePolicy=strict

结论验证:
已知，在Systemd托管设备节点时，对Systemd执行daemon-reload、daemon-reexec，或者对容器实例Unite执行mask，preset等命令，会造成容器GPU丢失。
因为，systemd reload或接口调用会覆盖原有 GPU 节点权限，所以定时删除GPU容器实例中的50-DeviceAllow.conf文件，让Systemd不再托管设备节点。运行一周，没有再发生设备节点丢失的情况。

后续动作:

1、是否可以和CentOS特性对齐，不对业务测做过多入侵（对于像容器这样调用接口或其他临时启动的unit 使用static 属性，而不是transient）

Hi zhaoxiaohu007, welcome to the openEuler Community.
I'm the Bot here serving you. You can find the instructions on how to interact with me at Here.
If you have any questions, please contact the SIG: Base-service, and any of the maintainers: @zhujianwei001 , @xiezhipeng1 , @dillon_chen , @shenyangyang01 , @overweight , @licihua , any of the committers: @protkhn , @yue-yuankun , @openeuler-basic , @licunlong , @jiayi0118 , @ship_harbour , @xujing99 , @zhaoxiaohu007

我在您上述的测试用例中发现test-cgroup-devices-9.service所属的transientcri-containerd-***.scope单元配置了DevicePolicy 但是没有配置DeviceAllow。这种情况下如果执行daemon-reload等操作，会刷新并删除设备devices.list中不符合的设备。

我的思路是：

首先查看你的服务所在的slice或者scope单元中是不是配置了DevicePolicy=strict，但是没有配置DeviceAllow？
如果是1中的情况，修改slice或scope单元的配置，将对应设备加入配置到DeviceAllow中，查看是否可解决问题该问题。

目前看只配置了 DevicePolicy=strict，没有配置DeviceAllow，可以验证下，因为线上类似/run/systemd/transient/cri-containerd-.scope是runc 调用dbus接口通过systemd 帮忙自动创建的，我先手动加上DeviceAllow需要在验证下。
systemctl cat cri-containerd-.scope

# /run/systemd/transient/cri-containerd-***.scope
# This is a transient unit file, created programmatically via the systemd API. Do not edit.
[Unit]
Description=libcontainer container ***

[Scope]
Slice=kubepods-burstable-pod***_***_***.slice
Delegate=yes
MemoryAccounting=yes
CPUAccounting=yes
BlockIOAccounting=yes
TasksAccounting=yes

[Unit]
DefaultDependencies=no

[Scope]
DevicePolicy=strict

1这里的方案呢，是否可以和CentOS对齐，使用static属性，而不是transient，这样对业务无感，无影响

官方文档有说明，scope不是一个通过单元配置文件的方式启动的单元，而是一种瞬时的单元。
https://www.freedesktop.org/software/systemd/man/latest/systemd.scope.html:

Scope units are not configured via unit configuration files, but are only created programmatically using the bus interfaces of systemd.

上面CentOS中的systemd是219版本已经是很早期的版本了。而且我看CentOS中的scope似乎没有启动？或者确认一下CentOS中有没有配置DevicePolicy呢？
我理解scope单元是static状态还是transient状态，在daemon-reload重新加载配置期间都是会去重新加载配置(即调用scope_load())，导致scope下device.list重新刷新的，解决不了实际的问题。
如下是官方219版本的代码：可以看到就算是static状态也是回去重新加载DevicePolicy和DeviceAllow配置的：

static int scope_load(Unit *u) {
        Scope *s = SCOPE(u);
        int r;

        assert(s);
        assert(u->load_state == UNIT_STUB);

        if (!u->transient && UNIT(s)->manager->n_reloading <= 0)
                return -ENOENT;

        u->load_state = UNIT_LOADED;

        r = unit_load_dropin(u);
        if (r < 0)
                return r;

        r = unit_patch_contexts(u);
        if (r < 0)
                return r;
...
}

src-openEuler/systemd

内容风险标识

评论 (7)

src-openEuler/systemd .gitee-modal { width: 500px !important; }

内容风险标识