diff --git a/Documentation/arch/arm64/cpu-feature-registers.rst b/Documentation/arch/arm64/cpu-feature-registers.rst index de6d8a4790e2b6cd06f69fe5ebe72eb69c23cb3d..14ea68bcf196ed3b24f034615865d0fb31f383f1 100644 --- a/Documentation/arch/arm64/cpu-feature-registers.rst +++ b/Documentation/arch/arm64/cpu-feature-registers.rst @@ -152,6 +152,8 @@ infrastructure: +------------------------------+---------+---------+ | DIT | [51-48] | y | +------------------------------+---------+---------+ + | MPAM | [43-40] | n | + +------------------------------+---------+---------+ | SVE | [35-32] | y | +------------------------------+---------+---------+ | GIC | [27-24] | n | diff --git a/Documentation/arch/arm64/mpam.md b/Documentation/arch/arm64/mpam.md new file mode 100644 index 0000000000000000000000000000000000000000..60f62cd7f78136bc52f8155de9ddd6d5fb6ca68d --- /dev/null +++ b/Documentation/arch/arm64/mpam.md @@ -0,0 +1,587 @@ +MPAM 用户手册 +========= + +# 1 MPAM 简介 + +MPAM(Memory system component Partioning and Monitoring)是Arm Architecture v8.4的拓展特性。用于解决服务器系统中,混合部署不同类型业务时,由于共享资源的竞争(L3/L2 Cache,MATA),而带来的关键应用性能下降或者系统整体性能下降问题。 + +MPAM的应用可针对不同业务,将同时作用于硬件访存路径上产生的竞争和冲突进行隔离控制,从而帮助提升服务器利用率,降低服务部署成本。 + +本手册只适用于OLK-6.6软件版本。 + +# 2 内核编译选项 + +配置 CONFIG_ARM64_MPAM=y 后,即使能MPAM完整功能。 + +# 3 内核启动参数 + +启动流程默认关闭MPAM,启动MPAM初始化需要**cmdline**添加**arm64.mpam**参数配置后重启机器。 + +# 3 接口总览 + +MPAM功能通过resctrl文件系统呈现,挂载点位于 */sys/fs/resctrl* 。系统启动后,需要手动挂载resctrl文件系统。 + +## 3.1 系统挂载参数 + +resctrl可以通过添加挂载参数支持多种挂载方式,具体指令如下: + +~~~ +# mount -t resctrl resctrl [-o cdp[,cdpl2][,debug][,l2]] /sys/fs/resctrl +~~~ + +挂载参数包括: + +* cdp: 针对L3缓存,根据访问指令和访问数据分别配置。 +* cdpl2: 针对L2缓存,根据访问指令和访问数据分别配置。 +* debug: 使能调试接口访问。 +* l2: 使能L2缓存配置和监控功能,默认关闭MPAM L2功能。 + +## 3.2 resctrl 系统目录介绍 + +### 3.2.1 Info 目录 + +Info 目录包含有关已启用资源的信息,每个资源都有其自己的子目录,子目录的名称反映了资源的名称。 + +每个子目录包含以下与分配相关的文件: + +缓存资源(L3/L2)子目录包含以下与分配相关的文件: + +**num_closids**: 适用于该资源的有效CLOSID(Class of Service ID)数量。内核会以所有已启用资源中最小的CLOSID数量作为限制。 + +**cbm_mask**: 适用于该资源的有效位掩码(bitmask)。 + +**min_cbm_bits**: 写入掩码时必须设置的连续位的最小数量。 + +**shareable_bits**: 与其他执行实体共享资源的位掩码。用户在设置独占缓存分区时可以使用此字段。 + +**bit_usage**: 标注了资源所有实例的使用情况的容量位掩码。说明如下: + + * *0*: 对应区域未使用。当系统的资源已被分配,且在“bit_usage”中发现“0”时,这表明资源被浪费了。 + + * *H*: 对应区域仅由硬件使用,但可供软件使用。如果资源的“shareable_bits”中有位被设置,但这些位并未全部出现在资源组的分配方案中,则“shareable_bits”中出现但未分配给资源组的位将被标记为“H”。 + + * *X*: 对应区域可供共享,并被硬件和软件使用。这些位同时出现在“shareable_bits”和资源组的分配中。 + + * *S*: 对应区域由软件使用,并可供共享。 + + * *E*: 对应区域被一个资源组独占使用,不允许共享。 + +**sparse_masks**: 指示是否支持CBM(Capacity Bit Mask)中的非连续1值。 + + * *0*: 仅支持CBM中的连续1值。 + + * *1*: 支持CBM中的非连续1值。 + +MB(Memory bandwidth,内存带宽)子目录包含以下与分配相关的文件: + +**min_bandwidth**: 用户可以请求的最小内存带宽百分比。 + +**bandwidth_gran**: 内存带宽百分比分配的粒度。分配的带宽百分比会四舍五入到硬件上可用的下一个控制步长。可用的带宽控制步长为: +~~~ +min_bandwidth + N * bandwidth_gran +~~~ + +**delay_linear**: 指示延迟刻度是线性还是非线性的。该字段仅用于信息参考。 + +如果支持监控功能,则会存在一个名为 L3_MON 和 MB_MON 的目录,其中包含以下文件: + +**num_rmids**: 可用的RMID(Resource Monitoring ID)数量。这是可以创建的“CTRL_MON”+“MON”组的最大数量。 + +**mon_features**: 如果为该资源启用了监控功能,则列出监控事件。例如: +~~~ +# grep . /sys/fs/resctrl/info/*_MON/mon_features +/sys/fs/resctrl/info/L3_MON/mon_features:llc_occupancy +/sys/fs/resctrl/info/MB_MON/mon_features:mbm_total_bytes +~~~ + +**max_threshold_occupancy**: 读/写文件,提供一个最大值(以字节为单位),低于此设定值下,之前使用过的LLC_occupancy计数器可以被考虑重新分配使用。 + +> 请注意,一旦释放了RMID(资源监控ID),它可能不会立即可用,因为RMID仍然与之前使用该RMID的缓存行相关联。因此,这些RMID会被放入一个“待定”列表中,并在缓存占用量降低后再次检查。如果系统中存在大量处于“待定”状态的RMID,但它们尚未准备好被使用,用户在执行mkdir操作时可能会看到-EBUSY错误。 + +> max_threshold_occupancy是一个用户可配置的值,用于确定在什么占用量下可以释放一个RMID。 + +最后,在 Info 目录的顶层有一个名为 **last_cmd_status** 的文件。每次通过文件系统发出“命令”(例如创建新目录或写入任何控制文件)时,该文件都会被重置。如果命令成功,文件内容将显示为“ok”。如果命令失败,它将提供比文件操作错误返回更详细的信息。例如: +~~~ +# echo "MB:1=110" > schemata +-bash: echo: write error: Invalid argument +# cat /sys/fs/resctrl/info/last_cmd_status +MB value 110 out of range [0,100] +~~~ + +### 3.2.2 资源分配和监控 + +资源组在resctrl文件系统中以目录的形式表示。默认组是根目录,刚挂载后,它拥有系统中的所有任务和CPU,并且可以充分利用所有资源。 + +在支持RDT(资源分配技术)控制功能的系统中,可以在根目录下创建额外的目录,这些目录指定了每种资源的不同数量(参见下面的“schemata”)。根目录和这些额外的顶级目录在下文中被称为“CTRL_MON”组。 + +在支持RDT监控的系统中,根目录和其他顶级目录中包含一个名为“mon_groups”的目录,在其中可以创建额外的目录来监控其父“CTRL_MON”组中任务的子集。这些在本文档的其余部分被称为“MON”组。 + +删除一个目录会将其所代表的组拥有的所有任务和CPU移动到其父目录。删除一个创建的“CTRL_MON”组将自动删除其下所有的“MON”组。 + +支持将“MON”组目录移动到一个新的父“CTRL_MON”组,以便在不影响其监控数据或分配的任务的情况下更改“MON”组的资源分配。此操作不适用于监控CPU的“MON”组。目前,除了简单地重命名“CTRL_MON”或“MON”组之外,不支持其他任何移动操作。 + +所有组包含以下文件: + +**tasks**: 读取此文件将显示属于该组的所有任务列表。将任务ID写入该文件会将任务添加到该组中。可以通过用逗号分隔任务ID来添加多个任务。任务将按顺序分配。在尝试分配任务时遇到的任何单个失败都会导致操作中止,而在失败之前已添加到组中的任务将保留在组中。失败信息将记录到/sys/fs/resctrl/info/last_cmd_status。 + +如果该组是一个“CTRL_MON”组,则任务将从之前拥有该任务的“CTRL_MON”组中移除,同时也会从任何拥有该任务的“MON”组中移除。如果该组是一个“MON”组,则任务必须已经属于该组的“CTRL_MON”父组。任务将从任何之前的“MON”组中移除。 + +**cpus**: 读取此文件将显示该组拥有的逻辑CPU的位掩码。将掩码写入该文件将向该组添加或移除CPU。与“tasks”文件类似,维护了一个层级结构,其中“MON”组只能包含其父“CTRL_MON”组拥有的CPU。 + +**cpus_list**: 与“cpus”类似,但使用CPU范围而不是位掩码。 + +当启用控制功能时,所有“CTRL_MON”组还将包含以下文件: + +**schemata**: 列出该组可用的所有资源。每种资源都有自己的行和格式——详细信息请参见下文。 + +**size**: 类似于“schemata”文件的显示,但显示的是每种资源分配的字节大小,而不是表示分配的位。 + +**mode**: 资源组的“mode”决定了其分配的共享方式。“shareable”资源组允许共享其分配,而“exclusive”资源组则不允许。 + +**ctrl_hw_id**: 仅在启用调试选项时可用。硬件用于控制组的标识符。在arm64架构上,即是 PARTID。 + +当启用监控功能时,所有“MON”组还将包含以下文件: + +**mon_data**: 包含一组按L3域和MB域事件组织的文件。例如,在具有两个L3域的系统中,将存在子目录“mon_L3_00”和“mon_L3_01”。 + +每个子目录中都有一个文件对应每个事件(例如“llc_occupancy”、“mbm_total_bytes”)。在“MON”组中,这些文件提供了组中所有任务的当前事件值。在“CTRL_MON”组中,这些文件提供了“CTRL_MON”组中所有任务以及所有“MON”组中任务的总和。有关使用方法的更多详细信息,请参见示例部分。 + +**mon_hw_id**: 仅在启用调试选项时可用。硬件用于监控组的标识符。 + +以下为resctrl文件系统目录树: + +~~~ +/sys/fs/resctrl(根分组) + ├── cpus # bitmask方式显示根分组关联的vcpu + ├── cpus_list # cpu list方式显示根分组关联的vcpu + ├── ctrl_hw_id # 硬件用于控制组的标识符 + ├── info # 用于显示属性信息及错误提示信息 + │ ├── L3 + │   │   ├── bit_usage # 标注了资源所有实例的使用情况的容量位掩码 + │ │ ├── cbm_mask # 系统所支持的最大cache way bitmask,一个bit代表一个cache way + │ │ ├── min_cbm_bits # 使用schemata所能配置的最小cache way bitmask + │ │ ├── num_closids # L3能够提供创建控制组的最大数量 + │   │   ├── shareable_bits # 当前所有cbm_mask全部shareable,支持后续扩展 + │   │   └── sparse_masks # 指示是否支持CBM(Capacity Bit Mask)中的非连续1值 + │ ├── L3_MON + │   │   ├── max_threshold_occupancy # 低于此设定值下,之前使用过的LLC_occupancy计数器可以被考虑重新分配使用 + │   │   ├── mon_features # 列出监控事件 + │   │   └── num_rmids # 可创建控制组和监控组的总数 + │ ├── last_cmd_status # 操作错误提示 + │ ├── MB + │ │ ├── bandwidth_gran # 带宽百分比配置粒度 + │ │ ├── delay_linear # 指示延迟刻度是线性还是非线性的 + │ │ ├── min_bandwidth # 最小带宽配置百分比 + │ │ └── num_closids # 同L3 num_closid + │ └── MB_MON + │ ├── mon_features # 同L3 mon_features + │ └── num_rmids # 同L3 num_rmids + ├── mode # 资源组的 mode 决定了其分配的共享方式 + ├── mon_data + │ ├── mon_L3_01 # 标号代表L3 cache id + │   │   └── llc_occupancy # 表示当前分组所关联的pid/vcpu在该区域上实际占用L3 Cache大小,下同 + │ ├── mon_L3_122 + │   │   └── llc_occupancy + │ ├── mon_MB_00 # 标号代表numa id + │   │   └── mbm_total_bytes # 表示当前分组所关联的pid/vcpu在该区域上内存带宽流量大小,下同 + │ └── mon_MB_01 + │   └── mbm_total_bytes + ├── mon_groups # 创建监控组目录 + ├── mon_hw_id # 硬件用于监控组的标识符 + ├── schemata # 资源使用配置接口 + ├── size # 显示的是每种资源分配的字节大小 + └── tasks # 显示与根组关联的pid +~~~ + +### 3.2.3 控制组配置接口 Schemata 文件 + +**schemata**文件中的每一行描述一种资源。每行以资源的名称开头,后面跟着该资源在系统中每个实例上要应用的具体值。 + +#### Cache IDs + +在当前一代的系统中,每个插槽(socket)有一个L3缓存,而L2缓存通常仅由一个核心上的超线程共享,但这并不是架构上的强制要求。我们可能会在一个插槽上有多个独立的L3缓存,或者多个核心共享一个L2缓存。因此,我们不使用“插槽”或“核心”来定义共享资源的逻辑CPU集合,而是使用“Cache ID”(缓存ID)。 + +在给定的缓存级别上,这将在整个系统中是一个唯一的数字(但不能保证是一个连续的序列,可能会有间隔)。要查找每个逻辑CPU的ID,请查看 /sys/devices/system/cpu/cpu*/cache/index*/id。 + +#### Cache Bit Masks(CBM,缓存位掩码) + +对于缓存资源,我们使用位掩码来描述可用于分配的缓存部分。掩码的最大值由每种CPU型号定义(并且可能因不同的缓存级别而异)。该值在resctrl文件系统的“info”目录中提供,位于info/{resource}/cbm_mask。 + +#### L3 缓存配置 + +当未启用CDP(代码/数据缓存划分)时,L3 schemata的格式为: + +~~~ +L3:=;=;... +~~~ + +当启用CDP时,L3控制被拆分为两个独立的资源,因此您可以分别为代码和数据指定独立的掩码,如下所示: + +~~~ +L3DATA:=;=;... +L3CODE:=;=;... +~~~ + +读取schemata文件将显示所有域上所有资源的状态。在写入时,您只需要指定您希望更改的值。 + +例如使用默认方式挂载,设置L3缓存位掩码只有4位: + +~~~ +# mount -t resctrl resctrl /sys/fs/resctrl/ +# cat /sys/fs/resctrl/schemata +L3:1=fffffff;122=fffffff + +# echo "L3:122=3c0;" > /sys/fs/resctrl/schemata +# cat /sys/fs/resctrl/schemata +L3:1=fffffff;122=00003c0 +~~~ + +使用开启CDP方式挂载,设置L3 data缓存位掩码只有4位: + +~~~ +# mount -t resctrl resctrl /sys/fs/resctrl/ -o cdp +# cat /sys/fs/resctrl/schemata +L3DATA:1=fffffff;122=fffffff +L3CODE:1=fffffff;122=fffffff + +# echo "L3DATA:122=3c0;" > schemata +# cat /sys/fs/resctrl/schemata +L3DATA:1=fffffff;122=00003c0 +L3CODE:1=fffffff;122=fffffff +~~~ + +#### L2 缓存配置 + +L2 缓存配置功能默认关闭,需要通过显式添加 l2 挂载参数,才会使能L2缓存配置功能。使能L2功能以后,系统关闭 cpuidle powerdown 功能和cpu下线功能。 + +L2 schemata的格式是: + +~~~ +L2:=;=;... +~~~ + +使用“cdpl2”挂载选项可以在L2上支持CDP: + +~~~ +L2DATA:=;=;... +L2CODE:=;=;... +~~~ + +L2 缓存配置示例,设置L2缓存位掩码只有4位: + +~~~ +# mount -t resctrl resctrl /sys/fs/resctrl/ -o l2 +# cat schemata +L2:4=000ff;8=000ff + +# echo "L2:4=f;" > schemata +# cat schemata +L2:4=0000f;8=000ff +~~~ + +使用“cdpl2”挂载选项: + +~~~ +# mount -t resctrl resctrl /sys/fs/resctrl/ -o l2cdp,l2 +# cat schemata +L2DATA:4=000ff;8=000ff +L2CODE:4=000ff;8=000ff + +# echo "L2DATA:4=f;" > schemata +# cat schemata +L2DATA:4=0000f;8=000ff # 控制组 L2 DATA 只能使用L2 cache4的4个cache way +L2CODE:4=000ff;8=000ff +~~~ + +#### MB 内存带宽分配 + +对于内存带宽资源,默认情况下,用户通过指定总内存带宽的百分比来控制该资源。 + +每种CPU型号的最小带宽百分比值是预定义的,可以通过info/MB/min_bandwidth查询。分配的带宽粒度也取决于CPU型号,可以在info/MB/bandwidth_gran中查询。可用的带宽控制步长为:min_bw + N * bw_gran。中间值会被四舍五入到硬件上可用的下一个控制步长。 + +MB schemata的格式是: + +~~~ +MB:=bandwidth0;=bandwidth1;... +~~~ + +MB 带宽配置示例: + +~~~ +# cat /sys/fs/resctrl/schemata +MB:0=0000100;1=0000100 + +# echo "MB:0=50" > /sys/fs/resctrl/schemata +# cat /sys/fs/resctrl/schemata +MB:0=0000050;1=0000100 # 降低控制组 MB 内存带宽使用上限为50% +~~~ + +### 3.2.4 资源分配规则 + +当一个任务正在运行时,以下规则定义了它可用的资源: +1. 如果任务属于一个非默认组,则使用该组的分配方案(schemata)。 +2. 否则,如果任务属于默认组,但运行在一个被分配给某个特定组的CPU上,则使用该CPU所属组的分配方案。 +3. 否则,使用默认组的分配方案。 + +### 3.2.5 监控组配置方法 + +读取控制组和监控组的监控数据,可通过 mon_data 目录接口读取监控数据: + +~~~ +# grep . mon_data/*/* +mon_data/mon_L3_01/llc_occupancy:73276416 +mon_data/mon_L3_122/llc_occupancy:11875328 +mon_data/mon_MB_00/mbm_total_bytes:32806 +mon_data/mon_MB_01/mbm_total_bytes:31700 +~~~ + +其中,mon_data读取监控数据文件分别: + * llc_occupancy 代表 L3缓存当前占用量,单位 Byte + * mbm_total_bytes 代表内存带宽瞬时流量,单位 MB/s + +支持在控制组下创建子监控组,监控父控制组监控对象的子集: + +~~~ +# cd /sys/fs/resctrl/p1 +# cd mon_groups/ && mkdir m1 # 监控组只能监控,m1分组资源配置跟随p1分组 +# echo '0-1' > cpus_list +# grep . mon_data/mon_*/* +mon_data/mon_L3_01/llc_occupancy:18432 +mon_data/mon_L3_122/llc_occupancy:1024 +mon_data/mon_MB_00/mbm_total_bytes:0 +mon_data/mon_MB_01/mbm_total_bytes:0 +~~~ + +控制组监控的是控制组本身及所有子监控组的监控值之和。 + +## 3.3 QoS 增强特性 + +### 3.3.1 PRI 优先级设置 + +对共享资源优先级进行配置,包括 L3PRI 和 MBPRI: + +~~~ +# cat schemata + MBPRI:0=0000007;1=0000007 + L3PRI:1=0000003;122=0000003 + +# echo "MBPRI:0=0000003" > schemata +# cat schemata + MBPRI:0=0000003;1=0000007 # 降低控制组 MB numa0的优先级 + L3PRI:1=0000003;122=0000003 + +# echo "MBPRI:0=0000003" > schemata + +~~~ + +优先级设置数字越大,即优先级越高,反之,数字越小,优先级越低。 + +> MBPRI 默认值为 3,MBPRI合法值范围 [0,7]。L3PRI 默认值为 0,L3PRI合法值范围 [0,3]。 + +### 3.3.2 MIN 限低值设置 + +共享资源实际使用占比低于设置值,会自动提高对该资源使用优先级,包括 L3MIN 和 MBMIN: + +~~~ +# cat schemata + MBMIN:0=00100 + L3MIN:1=00100;5=00100 + +# echo "MBMIN:0=00050" > schemata +# cat schemata + MBMIN:0=00050 + L3MIN:1=00100;5=00100 + +# echo "L3MIN:1=00050" > schemata +# cat schemata + MBMIN:0=00050 + L3MIN:1=00050;5=00100 +~~~ + +L3MIN 和 MBMIN 接口接受的输入参数为百分比,即设置值与总资源(内存带宽/缓存占用量)的占比。 + +> MBMIN 和 L3MIN 默认值为 0,合法值范围都为 [0,100]。 + +### 3.3.3 HDL 强制隔离设置 + +当MBHDL=1,限制MB共享资源使用量不能超出MB设置值,若MBHDL=0,则允许空闲情况下,MB共享资源使用量超过MB设置值: + +~~~ +# cat schemata + MBHDL:0=0000001;1=0000001 + +# echo "MBHDL:0=0000000" > schemata +# cat schemata + MBHDL:0=0000000;1=0000001 # 关闭MB在numa0上的强制限制功能 +~~~ + +> MBHDL 默认值为 1,合法值范围 [0,1]。 + +### 3.3.4 MAX 资源上限设置 + +设置允许分配的缓存容量的最大百分比,包括 L3MAX 接口: + +~~~ +# cat schemata + L3MAX:1=00100;5=00100 + +# echo "L3MAX:1=00050" > schemata +# cat schemata + L3MAX:1=00050;5=00100 # 降低L3分配的缓存最大容量百分比 +~~~ + +> MB 和 L3MAX 默认值为 100,MB合法值范围 [1,100],L3MAX合法值范围 [0,100]。 + +## 3.4 外设 IO 流量管控 + +### 3.4.1 控制组绑定外设 + +MPAM 提供通过绑定 iommu_group ID,对设备IO流量进行带宽限制和监控。 + +譬如,控制网卡设备 eno2 ,首先查找该设备 PCI_SLOT 信息: + +~~~ +# cat /sys/class/net/eno2/device/uevent | grep PCI_SLOT +PCI_SLOT_NAME=0000:35:00.1 +~~~ + +或者通过 ethtool 工具查看 bus-info 信息: + +~~~ +# ethtool -i eno2 | grep bus-info +bus-info: 0000:35:00.1 +~~~ + +按照设备总线信息,查找到该设备所属的 iommu_group : + +~~~ +# find /sys/kernel/iommu_groups/ -name "0000:35:00.1" +/sys/kernel/iommu_groups/17/devices/0000:35:00.1 +~~~ + +或者通过 lspci 工具查看 iommu_group 信息: + +~~~ +# lspci -vvv -s 0000:35:00.1 | grep "IOMMU group" + IOMMU group: 17 +~~~ + +将查询到的group id 17,通过 tasks 接口绑定到指定的控制组下: + +~~~ +# cd /sys/fs/resctrl/p1/ +# echo "iommu_group:17" > tasks +# cat tasks +iommu_group:17 # 此时iommu_group 17 已被绑定到控制组p1 +~~~ + +### 3.4.2 查看外设带宽流量 + +控制组绑定外设所属的iommu组后,可以查看设备流量带宽: + +~~~ +# grep . mon_data/mon_MB_0*/* +mon_data/mon_MB_00/mbm_total_bytes:0 +mon_data/mon_MB_01/mbm_total_bytes:4230 +~~~ + +### 3.4.3 配置外设带宽流量 + +通过MB配置接口,可以实现限制设备的流量带宽: + +~~~ +# echo "MB:1=0000001" > schemata +# cat schemata + MB:0=0000100;1=0000000 + +# grep . mon_data/*/* +mon_data/mon_MB_00/mbm_total_bytes:0 +mon_data/mon_MB_01/mbm_total_bytes:1208 +~~~ + +# 4 控制组和监控组配置使用示例 + +/sys/fs/resctrl 默认为根分组,根分组可以创建若干个控制组,一个控制组既可以关联一组pid/tid,也可以关联一组cpu集合。 + +创建一个新的控制组,关联 pid/tid: + +~~~ +# cd /sys/fs/resctrl/ && mkdir p1 +# cd p1 && echo $$ > tasks # 关联当前shell进程pid到p1组 +# cat tasks # 可查看成功关联的pid +29190 +29607 +~~~ + +也可以选择关联 cpu: + +~~~ +# cd p1 && echo '0-1' > cpus_list +# cat cpus_list # 可查看关联的cpu +0-1 +~~~ + +查看可创建的控制组和监控组数量: + +~~~ +# cat info/L3/num_closids # 可在info对应的资源的目录下查看closid数量,即可以创建控制组的数量 +32 +# cat info/MB/num_closids +32 +# cat info/L3_MON/num_rmids # 可以创建控制组和监控组的总数量 +128 +# cat info/MB_MON/num_rmids +128 +~~~ + +通过配置控制组可达到隔离 L3 Cache/Memory Bandwidth 效果,通过读取对应分组mon_data接口可以获取该组资源占用情况,比如对一个控制组限制L3 Cache使用: + +~~~ +# cat info/L3/cbm_mask # 查看info目录下对应资源的属性 +fffffff +# cat info/L3/min_cbm_bits +1 + +# cd /sys/fs/resctrl/p1 +# cat schemata + MB:0=0000100;1=0000100 + L3:1=fffffff;122=fffffff + +# echo 'L3:0=1' > schemata # 配置1条cache way给p1分组 +# cat schemata + MB:0=0000100;1=0000100 # 若此时该组关联pid/cpu,那么该pid/cpu产生的访存请求只会分配到这条cache + L3:1=0000001;122=fffffff +~~~ + +对控制组限制MB使用: + +~~~ +# cat info/MB/min_bandwidth # 和配置L3类似,也可以查看MB的相关信息 +1 +# cat info/MB/bandwidth_gran +1 # 可知,配置带宽最小百分比是1%,颗粒度是1% +# cat schemata + MB:0=0000100;1=0000100 + L3:1=0000001;122=fffffff +# echo 'MB:0=1' > schemata +# cat schemata + MB:0=0000001;1=0000100 + L3:1=0000001;122=fffffff +~~~ + +支持在控制组下创建子监控组: + +~~~ +# cd /sys/fs/resctrl/p1 +# cd mon_groups/ && mkdir m1 # 监控组只能监控,m1分组资源配置跟随p1分组 +# ls m1/ +cpus cpus_list mon_data tasks +# echo '0-1' > cpus_list +# cat cpus_list +0-1 +# grep . mon_data/mon_*/* +mon_data/mon_L3_01/llc_occupancy:18432 +mon_data/mon_L3_122/llc_occupancy:1024 +mon_data/mon_MB_00/mbm_total_bytes:0 +mon_data/mon_MB_01/mbm_total_bytes:0 +~~~ diff --git a/Documentation/devicetree/bindings/arm/arm,mpam-msc.yaml b/Documentation/devicetree/bindings/arm/arm,mpam-msc.yaml new file mode 100644 index 0000000000000000000000000000000000000000..9d542ecb1a7d6c3e3ec820179de29f1dd4259bb4 --- /dev/null +++ b/Documentation/devicetree/bindings/arm/arm,mpam-msc.yaml @@ -0,0 +1,227 @@ +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/arm/arm,mpam-msc.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Arm Memory System Resource Partitioning and Monitoring (MPAM) + +description: | + The Arm MPAM specification can be found here: + + https://developer.arm.com/documentation/ddi0598/latest + +maintainers: + - Rob Herring + +properties: + compatible: + items: + - const: arm,mpam-msc # Further details are discoverable + - const: arm,mpam-memory-controller-msc + + reg: + maxItems: 1 + description: A memory region containing registers as defined in the MPAM + specification. + + interrupts: + minItems: 1 + items: + - description: error (optional) + - description: overflow (optional, only for monitoring) + + interrupt-names: + oneOf: + - items: + - enum: [ error, overflow ] + - items: + - const: error + - const: overflow + + arm,not-ready-us: + description: The maximum time in microseconds for monitoring data to be + accurate after a settings change. For more information, see the + Not-Ready (NRDY) bit description in the MPAM specification. + + numa-node-id: true # see NUMA binding + + '#address-cells': + const: 1 + + '#size-cells': + const: 0 + +patternProperties: + '^ris@[0-9a-f]$': + type: object + additionalProperties: false + description: | + RIS nodes for each RIS in an MSC. These nodes are required for each RIS + implementing known MPAM controls + + properties: + compatible: + enum: + # Bulk storage for cache + - arm,mpam-cache + # Memory bandwidth + - arm,mpam-memory + + reg: + minimum: 0 + maximum: 0xf + + cpus: + $ref: '/schemas/types.yaml#/definitions/phandle-array' + description: + Phandle(s) to the CPU node(s) this RIS belongs to. By default, the parent + device's affinity is used. + + arm,mpam-device: + $ref: '/schemas/types.yaml#/definitions/phandle' + description: + By default, the MPAM enabled device associated with a RIS is the MSC's + parent node. It is possible for each RIS to be associated with different + devices in which case 'arm,mpam-device' should be used. + + required: + - compatible + - reg + +required: + - compatible + - reg + +dependencies: + interrupts: [ interrupt-names ] + +additionalProperties: false + +examples: + - | + /* + cpus { + cpu@0 { + next-level-cache = <&L2_0>; + }; + cpu@100 { + next-level-cache = <&L2_1>; + }; + }; + */ + L2_0: cache-controller-0 { + compatible = "cache"; + cache-level = <2>; + cache-unified; + next-level-cache = <&L3>; + + }; + + L2_1: cache-controller-1 { + compatible = "cache"; + cache-level = <2>; + cache-unified; + next-level-cache = <&L3>; + + }; + + L3: cache-controller@30000000 { + compatible = "arm,dsu-l3-cache", "cache"; + cache-level = <3>; + cache-unified; + + ranges = <0x0 0x30000000 0x800000>; + #address-cells = <1>; + #size-cells = <1>; + + msc@10000 { + compatible = "arm,mpam-msc"; + + /* CPU affinity implied by parent cache node's */ + reg = <0x10000 0x2000>; + interrupts = <1>, <2>; + interrupt-names = "error", "overflow"; + arm,not-ready-us = <1>; + }; + }; + + mem: memory-controller@20000 { + compatible = "foo,a-memory-controller"; + reg = <0x20000 0x1000>; + + #address-cells = <1>; + #size-cells = <1>; + ranges; + + msc@21000 { + compatible = "arm,mpam-memory-controller-msc", "arm,mpam-msc"; + reg = <0x21000 0x1000>; + interrupts = <3>; + interrupt-names = "error"; + arm,not-ready-us = <1>; + numa-node-id = <1>; + }; + }; + + iommu@40000 { + reg = <0x40000 0x1000>; + + ranges; + #address-cells = <1>; + #size-cells = <1>; + + msc@41000 { + compatible = "arm,mpam-msc"; + reg = <0 0x1000>; + interrupts = <5>, <6>; + interrupt-names = "error", "overflow"; + arm,not-ready-us = <1>; + + #address-cells = <1>; + #size-cells = <0>; + + ris@2 { + compatible = "arm,mpam-cache"; + reg = <0>; + // TODO: How to map to device(s)? + }; + }; + }; + + msc@80000 { + compatible = "foo,a-standalone-msc"; + reg = <0x80000 0x1000>; + + clocks = <&clks 123>; + + ranges; + #address-cells = <1>; + #size-cells = <1>; + + msc@10000 { + compatible = "arm,mpam-msc"; + + reg = <0x10000 0x2000>; + interrupts = <7>; + interrupt-names = "overflow"; + arm,not-ready-us = <1>; + + #address-cells = <1>; + #size-cells = <0>; + + ris@0 { + compatible = "arm,mpam-cache"; + reg = <0>; + arm,mpam-device = <&L2_0>; + }; + + ris@1 { + compatible = "arm,mpam-memory"; + reg = <1>; + arm,mpam-device = <&mem>; + }; + }; + }; + +... diff --git a/arch/Kconfig b/arch/Kconfig index d6db5522656e17c59818d88b4ea8863d42f7a994..53424408375459fd1f6cb8f53b45448ff5daf13d 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -1345,6 +1345,14 @@ config STRICT_MODULE_RWX config ARCH_HAS_PHYS_TO_DMA bool +config ARCH_HAS_CPU_RESCTRL + bool + help + The 'resctrl' filesystem allows cpu controls of shared resources + such as caches and memory bandwidth to be configured. An architecture + selects this if it provides the arch-specific hooks for the filesystem + and needs the per-task closid/rmid properties. + config HAVE_ARCH_COMPILER_H bool help diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 5d360a4bb3b3c9c0bc89d06a56d2152db7334939..d2680c8a4e18c8fb14b4fc8940969164ba6067cb 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -2099,7 +2099,27 @@ config ARM64_TLB_RANGE The feature introduces new assembly instructions, and they were support when binutils >= 2.30. -endmenu # "ARMv8.4 architectural features" +config ARM64_MPAM + bool "Enable support for MPAM" + select ACPI_MPAM if ACPI + select ARCH_HAS_CPU_RESCTRL + select RESCTRL_FS + help + Memory Partitioning and Monitoring is an optional extension + that allows the CPUs to mark load and store transactions with + labels for partition-id and performance-monitoring-group. + System components, such as the caches, can use the partition-id + to apply a performance policy. MPAM monitors can use the + partition-id and performance-monitoring-group to measure the + cache occupancy or data throughput. + + Use of this extension requires CPU support, support in the + memory system components (MSC), and a description from firmware + of where the MSC are in the address space. + + MPAM is exposed to user-space via the resctrl pseudo filesystem. + +endmenu menu "ARMv8.5 architectural features" diff --git a/arch/arm64/include/asm/cpu.h b/arch/arm64/include/asm/cpu.h index e749838b9c5d4e03517f0e9261f3c6740555845c..1cb5bafd9238fed07a7213f7309aa3c1f962cd47 100644 --- a/arch/arm64/include/asm/cpu.h +++ b/arch/arm64/include/asm/cpu.h @@ -47,6 +47,7 @@ struct cpuinfo_arm64 { u64 reg_revidr; u64 reg_gmid; u64 reg_smidr; + u64 reg_mpamidr; u64 reg_id_aa64dfr0; u64 reg_id_aa64dfr1; diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h index 3eba6e1855b28a3a6dc167984eb8c562695cb076..87752218051c4a2bc6ca5758dd214aaa571af4f3 100644 --- a/arch/arm64/include/asm/cpufeature.h +++ b/arch/arm64/include/asm/cpufeature.h @@ -620,6 +620,20 @@ static inline bool id_aa64pfr1_sme(u64 pfr1) return val > 0; } +static inline bool id_aa64pfr0_mpam(u64 pfr0) +{ + u32 val = cpuid_feature_extract_unsigned_field(pfr0, ID_AA64PFR0_EL1_MPAM_SHIFT); + + return val > 0; +} + +static inline bool id_aa64pfr1_mpamfrac(u64 pfr1) +{ + u32 val = cpuid_feature_extract_unsigned_field(pfr1, ID_AA64PFR1_EL1_MPAM_frac_SHIFT); + + return val > 0; +} + static inline bool id_aa64pfr1_mte(u64 pfr1) { u32 val = cpuid_feature_extract_unsigned_field(pfr1, ID_AA64PFR1_EL1_MTE_SHIFT); @@ -838,6 +852,14 @@ static inline bool system_supports_tlb_range(void) cpus_have_const_cap(ARM64_HAS_TLB_RANGE); } +static inline bool cpus_support_mpam(void) +{ + return IS_ENABLED(CONFIG_ARM64_MPAM) && + cpus_have_final_cap(ARM64_MPAM); +} + +bool mpam_detect_is_enabled(void); + static inline bool system_supports_haft(void) { return IS_ENABLED(CONFIG_ARM64_HAFT) && diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h index 1095c6647e9665267e6aa67bac2dd7bb11b091f1..3dee0945097971458d7e98cb56f6389649f0994a 100644 --- a/arch/arm64/include/asm/kvm_arm.h +++ b/arch/arm64/include/asm/kvm_arm.h @@ -104,6 +104,7 @@ #define HCRX_GUEST_FLAGS (HCRX_EL2_SMPME | HCRX_EL2_TCR2En) #define HCRX_HOST_FLAGS (HCRX_EL2_MSCEn | HCRX_EL2_TCR2En) +#define MPAMHCR_HOST_FLAGS 0 /* TCR_EL2 Registers bits */ #define TCR_EL2_RES1 ((1U << 31) | (1 << 23)) diff --git a/arch/arm64/include/asm/mpam.h b/arch/arm64/include/asm/mpam.h new file mode 100644 index 0000000000000000000000000000000000000000..783b3929e1440f563265ceb4c771c065364ffee1 --- /dev/null +++ b/arch/arm64/include/asm/mpam.h @@ -0,0 +1,167 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* Copyright (C) 2021 Arm Ltd. */ + +#ifndef __ASM__MPAM_H +#define __ASM__MPAM_H + +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +/* CPU Registers */ +#define MPAM_SYSREG_EN BIT_ULL(63) +#define MPAM_SYSREG_TRAP_IDR BIT_ULL(58) +#define MPAM_SYSREG_TRAP_MPAM0_EL1 BIT_ULL(49) +#define MPAM_SYSREG_TRAP_MPAM1_EL1 BIT_ULL(48) +#define MPAM_SYSREG_PMG_D GENMASK(47, 40) +#define MPAM_SYSREG_PMG_I GENMASK(39, 32) +#define MPAM_SYSREG_PARTID_D GENMASK(31, 16) +#define MPAM_SYSREG_PARTID_I GENMASK(15, 0) + +#define MPAMIDR_PMG_MAX GENMASK(40, 32) +#define MPAMIDR_PMG_MAX_SHIFT 32 +#define MPAMIDR_PMG_MAX_LEN 8 +#define MPAMIDR_VPMR_MAX GENMASK(20, 18) +#define MPAMIDR_VPMR_MAX_SHIFT 18 +#define MPAMIDR_VPMR_MAX_LEN 3 +#define MPAMIDR_HAS_HCR BIT(17) +#define MPAMIDR_HAS_HCR_SHIFT 17 +#define MPAMIDR_PARTID_MAX GENMASK(15, 0) +#define MPAMIDR_PARTID_MAX_SHIFT 0 +#define MPAMIDR_PARTID_MAX_LEN 15 + +#define MPAMHCR_EL0_VPMEN BIT_ULL(0) +#define MPAMHCR_EL1_VPMEN BIT_ULL(1) +#define MPAMHCR_GSTAPP_PLK BIT_ULL(8) +#define MPAMHCR_TRAP_MPAMIDR BIT_ULL(31) + +/* Properties of the VPM registers */ +#define MPAM_VPM_NUM_REGS 8 +#define MPAM_VPM_PARTID_LEN 16 +#define MPAM_VPM_PARTID_MASK 0xffff +#define MPAM_VPM_REG_LEN 64 +#define MPAM_VPM_PARTIDS_PER_REG (MPAM_VPM_REG_LEN / MPAM_VPM_PARTID_LEN) +#define MPAM_VPM_MAX_PARTID (MPAM_VPM_NUM_REGS * MPAM_VPM_PARTIDS_PER_REG) + + +DECLARE_STATIC_KEY_FALSE(arm64_mpam_has_hcr); +DECLARE_STATIC_KEY_FALSE(mpam_enabled); +DECLARE_PER_CPU(u64, arm64_mpam_default); +DECLARE_PER_CPU(u64, arm64_mpam_current); + +/* check whether all CPUs have MPAM support */ +static __always_inline bool mpam_cpus_have_feature(void) +{ + if (IS_ENABLED(CONFIG_ARM64_MPAM)) + return cpus_have_final_cap(ARM64_MPAM); + return false; +} + +/* check whether all CPUs have MPAM virtualisation support */ +static __always_inline bool mpam_cpus_have_mpam_hcr(void) +{ + if (IS_ENABLED(CONFIG_ARM64_MPAM)) + return static_branch_unlikely(&arm64_mpam_has_hcr); + return false; +} + +/* enable MPAM virtualisation support */ +static inline void __init __enable_mpam_hcr(void) +{ + if (IS_ENABLED(CONFIG_ARM64_MPAM)) + static_branch_enable(&arm64_mpam_has_hcr); +} + +/* + * The resctrl filesystem writes to the partid/pmg values for threads and CPUs, + * which may race with reads in __mpam_sched_in(). Ensure only one of the old + * or new values are used. Particular care should be taken with the pmg field + * as __mpam_sched_in() may read a partid and pmg that don't match, causing + * this value to be stored with cache allocations, despite being considered + * 'free' by resctrl. + * + * A value in struct thread_info is used instead of struct task_struct as the + * cpu's u64 register format is used, but struct task_struct has two u32'. + */ +static inline void mpam_set_cpu_defaults(int cpu, u16 partid_d, u16 partid_i, + u8 pmg_d, u8 pmg_i) +{ + u64 default_val; + + default_val = FIELD_PREP(MPAM_SYSREG_PARTID_D, partid_d); + default_val |= FIELD_PREP(MPAM_SYSREG_PARTID_I, partid_i); + default_val |= FIELD_PREP(MPAM_SYSREG_PMG_D, pmg_d); + default_val |= FIELD_PREP(MPAM_SYSREG_PMG_I, pmg_i); + + WRITE_ONCE(per_cpu(arm64_mpam_default, cpu), default_val); +} + +static inline void mpam_set_task_partid_pmg(struct task_struct *tsk, + u16 partid_d, u16 partid_i, + u8 pmg_d, u8 pmg_i) +{ +#ifdef CONFIG_ARM64_MPAM + u64 regval; + + regval = FIELD_PREP(MPAM_SYSREG_PARTID_D, partid_d); + regval |= FIELD_PREP(MPAM_SYSREG_PARTID_I, partid_i); + regval |= FIELD_PREP(MPAM_SYSREG_PMG_D, pmg_d); + regval |= FIELD_PREP(MPAM_SYSREG_PMG_I, pmg_i); + + WRITE_ONCE(task_thread_info(tsk)->mpam_partid_pmg, regval); +#endif +} + +static inline u64 mpam_get_regval(struct task_struct *tsk) +{ +#ifdef CONFIG_ARM64_MPAM + return READ_ONCE(task_thread_info(tsk)->mpam_partid_pmg); +#else + return 0; +#endif +} + +static inline void resctrl_arch_set_rmid(struct task_struct *tsk, u32 rmid) +{ +#ifdef CONFIG_ARM64_MPAM + u64 regval = mpam_get_regval(tsk); + + regval &= ~MPAM_SYSREG_PMG_D; + regval &= ~MPAM_SYSREG_PMG_I; + regval |= FIELD_PREP(MPAM_SYSREG_PMG_D, rmid); + regval |= FIELD_PREP(MPAM_SYSREG_PMG_I, rmid); + + WRITE_ONCE(task_thread_info(tsk)->mpam_partid_pmg, regval); +#endif +} + +static inline void mpam_thread_switch(struct task_struct *tsk) +{ + u64 oldregval; + int cpu = smp_processor_id(); + u64 regval = mpam_get_regval(tsk); + + if (!IS_ENABLED(CONFIG_ARM64_MPAM) || + !static_branch_likely(&mpam_enabled)) + return; + + if (!regval) + regval = READ_ONCE(per_cpu(arm64_mpam_default, cpu)); + + oldregval = READ_ONCE(per_cpu(arm64_mpam_current, cpu)); + if (oldregval == regval) + return; + + /* Synchronising this write is left until the ERET to EL0 */ + write_sysreg_s(regval, SYS_MPAM0_EL1); + WRITE_ONCE(per_cpu(arm64_mpam_current, cpu), regval); +} +#endif /* __ASM__MPAM_H */ diff --git a/arch/arm64/include/asm/resctrl.h b/arch/arm64/include/asm/resctrl.h new file mode 100644 index 0000000000000000000000000000000000000000..044ab094740178a66db877bb6351121224a9c7e9 --- /dev/null +++ b/arch/arm64/include/asm/resctrl.h @@ -0,0 +1,4 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* Copyright (C) 2022 Arm Ltd. */ + +#include diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h index 976761b24a2f8a4995e0c7c81543d7cb1971829f..6727dbf17478fb41e38ff0fe51276548d81d5a83 100644 --- a/arch/arm64/include/asm/sysreg.h +++ b/arch/arm64/include/asm/sysreg.h @@ -511,6 +511,13 @@ #define SYS_MAIR_EL2 sys_reg(3, 4, 10, 2, 0) #define SYS_AMAIR_EL2 sys_reg(3, 4, 10, 3, 0) +#define SYS_MPAMHCR_EL2 sys_reg(3, 4, 10, 4, 0) +#define SYS_MPAMVPMV_EL2 sys_reg(3, 4, 10, 4, 1) +#define SYS_MPAM2_EL2 sys_reg(3, 4, 10, 5, 0) + +#define __VPMn_op2(n) ((n) & 0x7) +#define SYS_MPAM_VPMn_EL2(n) sys_reg(3, 4, 10, 6, __VPMn_op2(n)) + #define SYS_VBAR_EL2 sys_reg(3, 4, 12, 0, 0) #define SYS_RVBAR_EL2 sys_reg(3, 4, 12, 0, 1) #define SYS_RMR_EL2 sys_reg(3, 4, 12, 0, 2) @@ -597,6 +604,7 @@ #define SYS_PMSCR_EL12 sys_reg(3, 5, 9, 9, 0) #define SYS_MAIR_EL12 sys_reg(3, 5, 10, 2, 0) #define SYS_AMAIR_EL12 sys_reg(3, 5, 10, 3, 0) +#define SYS_MPAM1_EL12 sys_reg(3, 5, 10, 5, 0) #define SYS_VBAR_EL12 sys_reg(3, 5, 12, 0, 0) #define SYS_CONTEXTIDR_EL12 sys_reg(3, 5, 13, 0, 1) #define SYS_SCXTNUM_EL12 sys_reg(3, 5, 13, 0, 7) diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h index d8d8b383e9d14eba95728200d3052da535503c29..b7d2412a0f5f700e0dfd2d4fadce11002f0e5e55 100644 --- a/arch/arm64/include/asm/thread_info.h +++ b/arch/arm64/include/asm/thread_info.h @@ -41,6 +41,9 @@ struct thread_info { #ifdef CONFIG_SHADOW_CALL_STACK void *scs_base; void *scs_sp; +#endif +#ifdef CONFIG_ARM64_MPAM + u64 mpam_partid_pmg; #endif u32 cpu; }; diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile index 010e23ccadbe4406aefc2aaeefc33ca4e729b69b..70ddad8e3784f837dd2070ef8b25311472083dbd 100644 --- a/arch/arm64/kernel/Makefile +++ b/arch/arm64/kernel/Makefile @@ -70,6 +70,7 @@ obj-$(CONFIG_CRASH_CORE) += crash_core.o obj-$(CONFIG_ARM_SDE_INTERFACE) += sdei.o obj-$(CONFIG_SDEI_WATCHDOG) += watchdog_sdei.o obj-$(CONFIG_ARM64_PTR_AUTH) += pointer_auth.o +obj-$(CONFIG_ARM64_MPAM) += mpam.o obj-$(CONFIG_ARM64_MTE) += mte.o obj-y += vdso-wrap.o obj-$(CONFIG_COMPAT_VDSO) += vdso32-wrap.o diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c index 15c26c17ca48b655ab1a76e42c9d451b8405d36a..c45e8bb1ea8dc484319d164cb5711a76ba08977d 100644 --- a/arch/arm64/kernel/cpufeature.c +++ b/arch/arm64/kernel/cpufeature.c @@ -84,6 +84,7 @@ #include #include #include +#include #include #include #include @@ -628,6 +629,18 @@ static const struct arm64_ftr_bits ftr_smcr[] = { ARM64_FTR_END, }; +static const struct arm64_ftr_bits ftr_mpamidr[] = { + ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, + MPAMIDR_PMG_MAX_SHIFT, MPAMIDR_PMG_MAX_LEN, 0), /* PMG_MAX */ + ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, + MPAMIDR_VPMR_MAX_SHIFT, MPAMIDR_VPMR_MAX_LEN, 0), /* VPMR_MAX */ + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, + MPAMIDR_HAS_HCR_SHIFT, 1, 0), /* HAS_HCR */ + ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, + MPAMIDR_PARTID_MAX_SHIFT, MPAMIDR_PARTID_MAX_LEN, 0), /* PARTID_MAX */ + ARM64_FTR_END, +}; + /* * Common ftr bits for a 32bit register with all hidden, strict * attributes, with 4bit feature fields and a default safe value of @@ -744,6 +757,9 @@ static const struct __ftr_reg_entry { ARM64_FTR_REG(SYS_ZCR_EL1, ftr_zcr), ARM64_FTR_REG(SYS_SMCR_EL1, ftr_smcr), + /* Op1 = 0, CRn = 10, CRm = 4 */ + ARM64_FTR_REG(SYS_MPAMIDR_EL1, ftr_mpamidr), + /* Op1 = 1, CRn = 0, CRm = 0 */ ARM64_FTR_REG(SYS_GMID_EL1, ftr_gmid), @@ -1063,6 +1079,11 @@ void __init init_cpu_features(struct cpuinfo_arm64 *info) vec_init_vq_map(ARM64_VEC_SME); } + if (mpam_detect_is_enabled() && + (id_aa64pfr0_mpam(info->reg_id_aa64pfr0) || + id_aa64pfr1_mpamfrac(info->reg_id_aa64pfr1))) + init_cpu_ftr_reg(SYS_MPAMIDR_EL1, info->reg_mpamidr); + if (id_aa64pfr1_mte(info->reg_id_aa64pfr1)) init_cpu_ftr_reg(SYS_GMID_EL1, info->reg_gmid); @@ -1322,6 +1343,13 @@ void update_cpu_features(int cpu, vec_update_vq_map(ARM64_VEC_SME); } + if (mpam_detect_is_enabled() && + (id_aa64pfr0_mpam(info->reg_id_aa64pfr0) || + id_aa64pfr1_mpamfrac(info->reg_id_aa64pfr1))) { + taint |= check_update_ftr_reg(SYS_MPAMIDR_EL1, cpu, + info->reg_mpamidr, boot->reg_mpamidr); + } + /* * The kernel uses the LDGM/STGM instructions and the number of tags * they read/write depends on the GMID_EL1.BS field. Check that the @@ -2352,6 +2380,78 @@ cpucap_panic_on_conflict(const struct arm64_cpu_capabilities *cap) return !!(cap->type & ARM64_CPUCAP_PANIC_ON_CONFLICT); } +static bool __read_mostly mpam_force_enabled; +bool mpam_detect_is_enabled(void) +{ + return mpam_force_enabled; +} + +static int __init mpam_setup(char *str) +{ + mpam_force_enabled = true; + return 0; +} +early_param("arm64.mpam", mpam_setup); + +static bool __maybe_unused +test_has_mpam(const struct arm64_cpu_capabilities *entry, int scope) +{ + u64 pfr0 = read_sanitised_ftr_reg(SYS_ID_AA64PFR0_EL1); + u64 pfr1 = read_sanitised_ftr_reg(SYS_ID_AA64PFR1_EL1); + + if (!id_aa64pfr0_mpam(pfr0) && + !id_aa64pfr1_mpamfrac(pfr1)) + return false; + + if (is_kdump_kernel()) + return false; + + if (!mpam_detect_is_enabled()) + return false; + + /* Check firmware actually enabled MPAM on this cpu. */ + return (read_sysreg_s(SYS_MPAM1_EL1) & MPAM_SYSREG_EN); +} + +static void __maybe_unused +cpu_enable_mpam(const struct arm64_cpu_capabilities *entry) +{ + u64 idr = read_sanitised_ftr_reg(SYS_MPAMIDR_EL1); + int cpu = smp_processor_id(); + u64 regval = 0; + + if (IS_ENABLED(CONFIG_ARM64_MPAM)) + regval = READ_ONCE(per_cpu(arm64_mpam_current, cpu)); + + /* + * Initialise MPAM EL2 registers and disable EL2 traps. + */ + write_sysreg_s(0, SYS_MPAM2_EL2); + + if (idr & MPAMIDR_HAS_HCR) + write_sysreg_s(0, SYS_MPAMHCR_EL2); + + /* + * Access by the kernel (at EL1) should use the reserved PARTID + * which is configured unrestricted. This avoids priority-inversion + * where latency sensitive tasks have to wait for a task that has + * been throttled to release the lock. + */ + write_sysreg_s(0, SYS_MPAM1_EL1); + write_sysreg_s(regval, SYS_MPAM0_EL1); +} + +static void mpam_extra_caps(void) +{ + u64 idr = read_sanitised_ftr_reg(SYS_MPAMIDR_EL1); + + if (!IS_ENABLED(CONFIG_ARM64_MPAM)) + return; + + if (idr & MPAMIDR_HAS_HCR) + __enable_mpam_hcr(); +} + static const struct arm64_cpu_capabilities arm64_features[] = { { .capability = ARM64_ALWAYS_BOOT, @@ -2889,6 +2989,15 @@ static const struct arm64_cpu_capabilities arm64_features[] = { .cpu_enable = cpu_enable_ls64_v, ARM64_CPUID_FIELDS(ID_AA64ISAR1_EL1, LS64, LS64_V) }, +#endif +#ifdef CONFIG_ARM64_MPAM + { + .desc = "Memory Partitioning And Monitoring", + .type = ARM64_CPUCAP_SYSTEM_FEATURE, + .capability = ARM64_MPAM, + .matches = test_has_mpam, + .cpu_enable = cpu_enable_mpam, + }, #endif {}, }; @@ -3558,6 +3667,7 @@ void __init setup_cpu_features(void) sve_setup(); sme_setup(); minsigstksz_setup(); + mpam_extra_caps(); /* * Check for sane CTR_EL0.CWG value. diff --git a/arch/arm64/kernel/cpuinfo.c b/arch/arm64/kernel/cpuinfo.c index 21a18237fc4b5300086b9f2d7c711b5a4b9ce163..722568bd7e3eb41ffd24c466d32356d4ffcf6f43 100644 --- a/arch/arm64/kernel/cpuinfo.c +++ b/arch/arm64/kernel/cpuinfo.c @@ -462,6 +462,12 @@ static void __cpuinfo_store_cpu(struct cpuinfo_arm64 *info) if (id_aa64pfr0_32bit_el0(info->reg_id_aa64pfr0)) __cpuinfo_store_cpu_32bit(&info->aarch32); + if (IS_ENABLED(CONFIG_ARM64_MPAM) && + mpam_detect_is_enabled() && + (id_aa64pfr0_mpam(info->reg_id_aa64pfr0) || + id_aa64pfr1_mpamfrac(info->reg_id_aa64pfr1))) + info->reg_mpamidr = read_cpuid(MPAMIDR_EL1); + cpuinfo_detect_icache_policy(info); } diff --git a/arch/arm64/kernel/image-vars.h b/arch/arm64/kernel/image-vars.h index 35f3c795951373549faad3ed2d1bd30ad427eb78..d10d3fed31d9334662e429a5af28a1da2b217dd5 100644 --- a/arch/arm64/kernel/image-vars.h +++ b/arch/arm64/kernel/image-vars.h @@ -64,6 +64,11 @@ KVM_NVHE_ALIAS(nvhe_hyp_panic_handler); /* Vectors installed by hyp-init on reset HVC. */ KVM_NVHE_ALIAS(__hyp_stub_vectors); +/* Additional static keys for cpufeatures */ +#ifdef CONFIG_ARM64_MPAM +KVM_NVHE_ALIAS(arm64_mpam_has_hcr); +#endif + /* Static keys which are set if a vGIC trap should be handled in hyp. */ KVM_NVHE_ALIAS(vgic_v2_cpuif_trap); KVM_NVHE_ALIAS(vgic_v3_cpuif_trap); diff --git a/arch/arm64/kernel/mpam.c b/arch/arm64/kernel/mpam.c new file mode 100644 index 0000000000000000000000000000000000000000..4017f6bb13c000b4e244c2dde5ca3ebbc7fdce1f --- /dev/null +++ b/arch/arm64/kernel/mpam.c @@ -0,0 +1,75 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (C) 2021 Arm Ltd. */ + +#include + +#include +#include +#include +#include +#include +#include + +DEFINE_STATIC_KEY_FALSE(arm64_mpam_has_hcr); +DEFINE_STATIC_KEY_FALSE(mpam_enabled); +DEFINE_PER_CPU(u64, arm64_mpam_default); +DEFINE_PER_CPU(u64, arm64_mpam_current); + +static int mpam_pm_notifier(struct notifier_block *self, + unsigned long cmd, void *v) +{ + u64 regval; + struct rdt_resource *r; + int i, cpu = smp_processor_id(); + + switch (cmd) { + case CPU_PM_ENTER: + if (!resctrl_mounted) + return NOTIFY_OK; + + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + r = resctrl_arch_get_resource(i); + if (!r->invisible && r->is_volatile) + return NOTIFY_BAD; + } + + return NOTIFY_OK; + case CPU_PM_EXIT: + /* + * Don't use mpam_thread_switch() as the system register + * value has changed under our feet. + */ + regval = READ_ONCE(per_cpu(arm64_mpam_current, cpu)); + write_sysreg_s(0, SYS_MPAM1_EL1); + write_sysreg_s(regval, SYS_MPAM0_EL1); + + return NOTIFY_OK; + default: + return NOTIFY_DONE; + } +} + +static struct notifier_block mpam_pm_nb = { + .notifier_call = mpam_pm_notifier, +}; + +static int __init arm64_mpam_register_cpus(void) +{ + u16 partid_max; + u64 mpamidr; + u8 pmg_max; + + if (is_kdump_kernel()) + return 0; + + if (!mpam_cpus_have_feature()) + return 0; + + mpamidr = read_sanitised_ftr_reg(SYS_MPAMIDR_EL1); + partid_max = FIELD_GET(MPAMIDR_PARTID_MAX, mpamidr); + pmg_max = FIELD_GET(MPAMIDR_PMG_MAX, mpamidr); + + cpu_pm_register_notifier(&mpam_pm_nb); + return mpam_register_requestor(partid_max, pmg_max); +} +arch_initcall(arm64_mpam_register_cpus); diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c index 385fb78845d696fd1d0e1d6f08bcdeff4c5027c1..dcd519994395394c951da9acf9a801f20457f0a9 100644 --- a/arch/arm64/kernel/process.c +++ b/arch/arm64/kernel/process.c @@ -49,6 +49,7 @@ #include #include #include +#include #include #include #include @@ -552,6 +553,12 @@ struct task_struct *__switch_to(struct task_struct *prev, if (prev->thread.sctlr_user != next->thread.sctlr_user) update_sctlr_el1(next->thread.sctlr_user); + /* + * MPAM thread switch happens after the DSB to ensure prev's accesses + * use prev's MPAM settings. + */ + mpam_thread_switch(next); + /* the actual thread switch */ last = cpu_switch_to(prev, next); diff --git a/arch/arm64/kvm/hyp/include/hyp/switch.h b/arch/arm64/kvm/hyp/include/hyp/switch.h index 6269cab564e9f99ac0b20033dac37c6a4a9396a8..b9a26ff76211285d61f0430023cf5ac112e850c4 100644 --- a/arch/arm64/kvm/hyp/include/hyp/switch.h +++ b/arch/arm64/kvm/hyp/include/hyp/switch.h @@ -27,6 +27,7 @@ #include #include #include +#include #include #include #include @@ -172,6 +173,35 @@ static inline void __deactivate_traps_hfgxtr(struct kvm_vcpu *vcpu) write_sysreg_s(ctxt_sys_reg(hctxt, HDFGWTR_EL2), SYS_HDFGWTR_EL2); } +static inline void __activate_traps_mpam(struct kvm_vcpu *vcpu) +{ + u64 r = MPAM_SYSREG_TRAP_MPAM0_EL1 | MPAM_SYSREG_TRAP_MPAM1_EL1; + + if (!mpam_cpus_have_feature()) + return; + + /* trap guest access to MPAMIDR_EL1 */ + if (mpam_cpus_have_mpam_hcr()) { + write_sysreg_s(MPAMHCR_TRAP_MPAMIDR, SYS_MPAMHCR_EL2); + } else { + /* From v1.1 TIDR can trap MPAMIDR, set it unconditionally */ + r |= MPAM_SYSREG_TRAP_IDR; + } + + write_sysreg_s(r, SYS_MPAM2_EL2); +} + +static inline void __deactivate_traps_mpam(void) +{ + if (!mpam_cpus_have_feature()) + return; + + write_sysreg_s(0, SYS_MPAM2_EL2); + + if (mpam_cpus_have_mpam_hcr()) + write_sysreg_s(MPAMHCR_HOST_FLAGS, SYS_MPAMHCR_EL2); +} + static inline void __activate_traps_common(struct kvm_vcpu *vcpu) { /* Trap on AArch32 cp15 c15 (impdef sysregs) accesses (EL1 or EL0) */ @@ -221,6 +251,7 @@ static inline void __activate_traps_common(struct kvm_vcpu *vcpu) } __activate_traps_hfgxtr(vcpu); + __activate_traps_mpam(vcpu); } static inline void __deactivate_traps_common(struct kvm_vcpu *vcpu) @@ -243,6 +274,7 @@ static inline void __deactivate_traps_common(struct kvm_vcpu *vcpu) write_sysreg_s(HCRX_HOST_FLAGS, SYS_HCRX_EL2); __deactivate_traps_hfgxtr(vcpu); + __deactivate_traps_mpam(); } static inline void ___activate_traps(struct kvm_vcpu *vcpu) diff --git a/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h b/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h index bb6b571ec627dede466c5fa1d05785a9c9f78764..c8767abd693e2d78d2337137ee3389d5b4470cf6 100644 --- a/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h +++ b/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h @@ -15,6 +15,7 @@ #include #include #include +#include static inline void __sysreg_save_common_state(struct kvm_cpu_context *ctxt) { @@ -243,4 +244,30 @@ static inline void __sysreg32_restore_state(struct kvm_vcpu *vcpu) write_sysreg(__vcpu_sys_reg(vcpu, DBGVCR32_EL2), dbgvcr32_el2); } +/* + * The _EL0 value was written by the host's context switch, copy this into the + * guest's EL1. + */ +static inline void __mpam_guest_load(void) +{ + if (IS_ENABLED(CONFIG_ARM64_MPAM) && mpam_cpus_have_feature()) + write_sysreg_el1(read_sysreg_s(SYS_MPAM0_EL1), SYS_MPAM1); +} + +/* + * Copy the _EL2 register back to _EL1, clearing any trap bits EL2 may have set. + * nVHE world-switch copies the _EL1 register to _EL2. A VHE host writes to the + * _EL2 register as it is aliased by the hardware when TGE is set. + */ +static inline void __mpam_guest_put(void) +{ + u64 val, mask = MPAM_SYSREG_PMG_D | MPAM_SYSREG_PMG_I | + MPAM_SYSREG_PARTID_D | MPAM_SYSREG_PARTID_I; + + if (IS_ENABLED(CONFIG_ARM64_MPAM) && mpam_cpus_have_feature()) { + val = FIELD_GET(mask, read_sysreg_s(SYS_MPAM2_EL2)); + write_sysreg_el1(val, SYS_MPAM1); + } +} + #endif /* __ARM64_KVM_HYP_SYSREG_SR_H__ */ diff --git a/arch/arm64/kvm/hyp/nvhe/switch.c b/arch/arm64/kvm/hyp/nvhe/switch.c index ec2f32cbd80c45f772751ea9bb7d5019677a5a36..68abbba1eef556dd825a770329f7d73456b9f77e 100644 --- a/arch/arm64/kvm/hyp/nvhe/switch.c +++ b/arch/arm64/kvm/hyp/nvhe/switch.c @@ -282,6 +282,13 @@ static inline bool fixup_guest_exit(struct kvm_vcpu *vcpu, u64 *exit_code) return __fixup_guest_exit(vcpu, exit_code, handlers); } +/* Use the host thread's partid and pmg for world switch */ +static void __mpam_copy_el1_to_el2(void) +{ + if (IS_ENABLED(CONFIG_ARM64_MPAM) && mpam_cpus_have_feature()) + write_sysreg_s(read_sysreg_s(SYS_MPAM1_EL1), SYS_MPAM2_EL2); +} + /* Switch to the guest for legacy non-VHE systems */ int __kvm_vcpu_run(struct kvm_vcpu *vcpu) { @@ -291,6 +298,8 @@ int __kvm_vcpu_run(struct kvm_vcpu *vcpu) bool pmu_switch_needed; u64 exit_code; + __mpam_copy_el1_to_el2(); + /* * Having IRQs masked via PMR when entering the guest means the GIC * will not signal the CPU of interrupts of lower priority, and the @@ -350,6 +359,7 @@ int __kvm_vcpu_run(struct kvm_vcpu *vcpu) __timer_enable_traps(vcpu); __debug_switch_to_guest(vcpu); + __mpam_guest_load(); do { /* Jump in the fire! */ @@ -360,6 +370,7 @@ int __kvm_vcpu_run(struct kvm_vcpu *vcpu) __sysreg_save_state_nvhe(guest_ctxt); __sysreg32_save_state(vcpu); + __mpam_guest_put(); __timer_disable_traps(vcpu); __hyp_vgic_save_state(vcpu); diff --git a/arch/arm64/kvm/hyp/vhe/sysreg-sr.c b/arch/arm64/kvm/hyp/vhe/sysreg-sr.c index 5ca8f74853180670fb87323b4d3d968131cf5f31..5cb4b70e0aef870a6793f7706c8b7ce428e53ef5 100644 --- a/arch/arm64/kvm/hyp/vhe/sysreg-sr.c +++ b/arch/arm64/kvm/hyp/vhe/sysreg-sr.c @@ -90,6 +90,7 @@ void kvm_vcpu_load_sysregs_vhe(struct kvm_vcpu *vcpu) __sysreg32_restore_state(vcpu); __sysreg_restore_user_state(guest_ctxt); __sysreg_restore_el1_state(guest_ctxt); + __mpam_guest_load(); vcpu_set_flag(vcpu, SYSREGS_ON_CPU); diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index 4eab628a5853cf17edceed4dcf390be5972260f7..6238698e095ec07141700fd67479a5268bc43b92 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -417,6 +417,23 @@ static bool trap_oslar_el1(struct kvm_vcpu *vcpu, return true; } +static bool workaround_bad_mpam_abi(struct kvm_vcpu *vcpu, + struct sys_reg_params *p, + const struct sys_reg_desc *r) +{ + /* + * The ID register can't be removed without breaking migration, + * but MPAMIDR_EL1 can advertise all-zeroes, indicating there are zero + * PARTID/PMG supported by the CPU, allowing the other two trapped + * registers (MPAM1_EL1 and MPAM0_EL1) to be treated as RAZ/WI. + * Emulating MPAM1_EL1 as RAZ/WI means the guest sees the MPAMEN bit + * as clear, and realises MPAM isn't usable on this CPU. + */ + p->regval = 0; + + return true; +} + static bool trap_oslsr_el1(struct kvm_vcpu *vcpu, struct sys_reg_params *p, const struct sys_reg_desc *r) @@ -2300,8 +2317,11 @@ static const struct sys_reg_desc sys_reg_descs[] = { { SYS_DESC(SYS_LOREA_EL1), trap_loregion }, { SYS_DESC(SYS_LORN_EL1), trap_loregion }, { SYS_DESC(SYS_LORC_EL1), trap_loregion }, + { SYS_DESC(SYS_MPAMIDR_EL1), workaround_bad_mpam_abi }, { SYS_DESC(SYS_LORID_EL1), trap_loregion }, + { SYS_DESC(SYS_MPAM1_EL1), workaround_bad_mpam_abi }, + { SYS_DESC(SYS_MPAM0_EL1), workaround_bad_mpam_abi }, { SYS_DESC(SYS_VBAR_EL1), access_rw, reset_val, VBAR_EL1, 0 }, { SYS_DESC(SYS_DISR_EL1), NULL, reset_val, DISR_EL1, 0 }, diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps index bcef3bce7a583a215ef9c35e060a6db0fdcc9be5..a7151bb471238df7f3a8084a399b28e98b2a46f1 100644 --- a/arch/arm64/tools/cpucaps +++ b/arch/arm64/tools/cpucaps @@ -58,6 +58,7 @@ HW_DBM KVM_HVHE KVM_PROTECTED_MODE MISMATCHED_CACHE_TYPE +MPAM MTE MTE_ASYMM SME diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg index 961bef1c0959f2046da21f694e2461dc7cd24caf..baded26ab7d751168496dd9e9b5ce86007624e09 100644 --- a/arch/arm64/tools/sysreg +++ b/arch/arm64/tools/sysreg @@ -2665,6 +2665,22 @@ Res0 1 Field 0 EN EndSysreg +Sysreg MPAMIDR_EL1 3 0 10 4 4 +Res0 63:62 +Field 61 HAS_SDEFLT +Field 60 HAS_FORCE_NS +Field 59 SP4 +Field 58 HAS_TIDR +Field 57 HAS_ALTSP +Res0 56:40 +Field 39:32 PMG_MAX +Res0 31:21 +Field 20:18 VPMR_MAX +Field 17 HAS_HCR +Res0 16 +Field 15:0 PARTID_MAX +EndSysreg + Sysreg LORID_EL1 3 0 10 4 7 Res0 63:24 Field 23:16 LD @@ -2672,6 +2688,22 @@ Res0 15:8 Field 7:0 LR EndSysreg +Sysreg MPAM1_EL1 3 0 10 5 0 +Res0 63:48 +Field 47:40 PMG_D +Field 39:32 PMG_I +Field 31:16 PARTID_D +Field 15:0 PARTID_I +EndSysreg + +Sysreg MPAM0_EL1 3 0 10 5 1 +Res0 63:48 +Field 47:40 PMG_D +Field 39:32 PMG_I +Field 31:16 PARTID_D +Field 15:0 PARTID_I +EndSysreg + Sysreg ISR_EL1 3 0 12 1 0 Res0 63:11 Field 10 IS @@ -2685,6 +2717,7 @@ EndSysreg Sysreg ICC_NMIAR1_EL1 3 0 12 9 5 Res0 63:24 Field 23:0 INTID + EndSysreg Sysreg TRBLIMITR_EL1 3 0 9 11 0 diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index d7163b764c626824b148c75bcd6a79ce9817ac9b..be67cb47a8880a5499ad97a94efef6d273a935a9 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -3036,7 +3036,7 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r, char name[32]; snc_mode = r->mon_scope == RESCTRL_L3_NODE; - sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci->id : d->hdr.id); + sprintf(name, "mon_%s_%02ld", r->name, snc_mode ? d->ci->id : d->hdr.id); if (snc_mode) sprintf(subname, "mon_sub_%s_%02d", r->name, d->hdr.id); @@ -3088,7 +3088,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn, lockdep_assert_held(&rdtgroup_mutex); snc_mode = r->mon_scope == RESCTRL_L3_NODE; - sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci->id : d->hdr.id); + sprintf(name, "mon_%s_%02ld", r->name, snc_mode ? d->ci->id : d->hdr.id); kn = kernfs_find_and_get(parent_kn, name); if (kn) { /* diff --git a/drivers/acpi/arm64/Kconfig b/drivers/acpi/arm64/Kconfig index b3ed6212244c1e5405008355b7d0878252564251..f2fd79f22e7d836b07401eb055725b2632ba624c 100644 --- a/drivers/acpi/arm64/Kconfig +++ b/drivers/acpi/arm64/Kconfig @@ -21,3 +21,6 @@ config ACPI_AGDI config ACPI_APMT bool + +config ACPI_MPAM + bool diff --git a/drivers/acpi/arm64/Makefile b/drivers/acpi/arm64/Makefile index 143debc1ba4a9d9dae6147c3c1dfc4408d05c5e9..9497a777772966608ae0e75ddc70e0f8c3b9ef53 100644 --- a/drivers/acpi/arm64/Makefile +++ b/drivers/acpi/arm64/Makefile @@ -4,4 +4,6 @@ obj-$(CONFIG_ACPI_IORT) += iort.o obj-$(CONFIG_ACPI_GTDT) += gtdt.o obj-$(CONFIG_ACPI_APMT) += apmt.o obj-$(CONFIG_ARM_AMBA) += amba.o +obj-$(CONFIG_ACPI_MPAM) += mpam.o obj-y += dma.o init.o + diff --git a/drivers/acpi/arm64/mpam.c b/drivers/acpi/arm64/mpam.c new file mode 100644 index 0000000000000000000000000000000000000000..2239cae8568474eda22cb2758af043f3f91bc4c4 --- /dev/null +++ b/drivers/acpi/arm64/mpam.c @@ -0,0 +1,368 @@ +// SPDX-License-Identifier: GPL-2.0 +// Copyright (C) 2022 Arm Ltd. + +/* Parse the MPAM ACPI table feeding the discovered nodes into the driver */ + +#define pr_fmt(fmt) "ACPI MPAM: " fmt + +#include +#include +#include +#include +#include + +#include + +#include + +/* Flags for acpi_table_mpam_msc.*_interrupt_flags */ +#define ACPI_MPAM_MSC_IRQ_MODE_EDGE 1 +#define ACPI_MPAM_MSC_IRQ_TYPE_MASK (3<<1) +#define ACPI_MPAM_MSC_IRQ_TYPE_WIRED 0 +#define ACPI_MPAM_MSC_IRQ_AFFINITY_PROCESSOR_CONTAINER (1<<3) +#define ACPI_MPAM_MSC_IRQ_AFFINITY_VALID (1<<4) + +static bool frob_irq(struct platform_device *pdev, int intid, u32 flags, + int *irq, u32 processor_container_uid) +{ + int sense; + + if (!intid) + return false; + + /* 0 in this field indicates a wired interrupt */ + if (flags & ACPI_MPAM_MSC_IRQ_TYPE_MASK) + return false; + + if (flags & ACPI_MPAM_MSC_IRQ_MODE_EDGE) + sense = ACPI_EDGE_SENSITIVE; + else + sense = ACPI_LEVEL_SENSITIVE; + + /* + * If the GSI is in the GIC's PPI range, try and create a partitioned + * percpu interrupt. + */ + if (intid >= 16 && intid < 32 && processor_container_uid != ~0) { + pr_err_once("Partitioned interrupts not supported\n"); + return false; + } else { + *irq = acpi_register_gsi(&pdev->dev, intid, sense, + ACPI_ACTIVE_HIGH); + } + if (*irq <= 0) { + pr_err_once("Failed to register interrupt 0x%x with ACPI\n", + intid); + return false; + } + + return true; +} + +static void acpi_mpam_parse_irqs(struct platform_device *pdev, + struct acpi_mpam_msc_node *tbl_msc, + struct resource *res, int *res_idx) +{ + u32 flags, aff = ~0; + int irq; + + flags = tbl_msc->overflow_interrupt_flags; + if (flags & ACPI_MPAM_MSC_IRQ_AFFINITY_VALID && + flags & ACPI_MPAM_MSC_IRQ_AFFINITY_PROCESSOR_CONTAINER) + aff = tbl_msc->overflow_interrupt_affinity; + if (frob_irq(pdev, tbl_msc->overflow_interrupt, flags, &irq, aff)) { + res[*res_idx].start = irq; + res[*res_idx].end = irq; + res[*res_idx].flags = IORESOURCE_IRQ; + res[*res_idx].name = "overflow"; + + (*res_idx)++; + } + + flags = tbl_msc->error_interrupt_flags; + if (flags & ACPI_MPAM_MSC_IRQ_AFFINITY_VALID && + flags & ACPI_MPAM_MSC_IRQ_AFFINITY_PROCESSOR_CONTAINER) + aff = tbl_msc->error_interrupt_affinity; + else + aff = ~0; + if (frob_irq(pdev, tbl_msc->error_interrupt, flags, &irq, aff)) { + res[*res_idx].start = irq; + res[*res_idx].end = irq; + res[*res_idx].flags = IORESOURCE_IRQ; + res[*res_idx].name = "error"; + + (*res_idx)++; + } +} + +static int acpi_mpam_parse_resource(struct mpam_msc *msc, + struct acpi_mpam_resource_node *res) +{ + u32 cache_id; + int level; + + switch (res->locator_type) { + case ACPI_MPAM_LOCATION_TYPE_PROCESSOR_CACHE: + cache_id = res->locator.cache_locator.cache_reference; + level = find_acpi_cache_level_from_id(cache_id); + if (level < 0) { + pr_err_once("Bad level for cache with id %u\n", cache_id); + return level; + } + return mpam_ris_create(msc, res->ris_index, MPAM_CLASS_CACHE, + level, cache_id); + case ACPI_MPAM_LOCATION_TYPE_MEMORY: + return mpam_ris_create(msc, res->ris_index, MPAM_CLASS_MEMORY, + 255, res->locator.memory_locator.proximity_domain); + default: + /* These get discovered later and treated as unknown */ + return 0; + } +} + +int acpi_mpam_parse_resources(struct mpam_msc *msc, + struct acpi_mpam_msc_node *tbl_msc) +{ + int i, err; + struct acpi_mpam_resource_node *resources; + + resources = (struct acpi_mpam_resource_node *)(tbl_msc + 1); + for (i = 0; i < tbl_msc->num_resouce_nodes; i++) { + err = acpi_mpam_parse_resource(msc, &resources[i]); + if (err) + return err; + } + + return 0; +} + +static bool __init parse_msc_pm_link(struct acpi_mpam_msc_node *tbl_msc, + struct platform_device *pdev, + u32 *acpi_id) +{ + bool acpi_id_valid = false; + struct acpi_device *buddy; + char hid[16], uid[16]; + int err; + + memset(&hid, 0, sizeof(hid)); + memcpy(hid, &tbl_msc->hardware_id_linked_device, + sizeof(tbl_msc->hardware_id_linked_device)); + + if (!strcmp(hid, ACPI_PROCESSOR_CONTAINER_HID)) { + *acpi_id = tbl_msc->instance_id_linked_device; + acpi_id_valid = true; + } + + err = snprintf(uid, sizeof(uid), "%u", + tbl_msc->instance_id_linked_device); + if (err < 0 || err >= sizeof(uid)) + return acpi_id_valid; + + buddy = acpi_dev_get_first_match_dev(hid, uid, -1); + if (buddy) + device_link_add(&pdev->dev, &buddy->dev, DL_FLAG_STATELESS); + + return acpi_id_valid; +} + +static int decode_interface_type(struct acpi_mpam_msc_node *tbl_msc, + enum mpam_msc_iface *iface) +{ + switch (tbl_msc->interface_type) { + case 0: + *iface = MPAM_IFACE_MMIO; + return 0; + case 1: + *iface = MPAM_IFACE_PCC; + return 0; + default: + return -EINVAL; + } +} + +static int __init _parse_table(struct acpi_table_header *table) +{ + char *table_end, *table_offset = (char *)(table + 1); + struct property_entry props[4]; /* needs a sentinel */ + struct acpi_mpam_msc_node *tbl_msc; + int next_res, next_prop, err = 0; + struct acpi_device *companion; + struct platform_device *pdev; + enum mpam_msc_iface iface; + struct resource res[3]; + char uid[16]; + u32 acpi_id; + + table_end = (char *)table + table->length; + + while (table_offset < table_end) { + tbl_msc = (struct acpi_mpam_msc_node *)table_offset; + table_offset += tbl_msc->length; + + /* + * If any of the reserved fields are set, make no attempt to + * parse the msc structure. This will prevent the driver from + * probing all the MSC, meaning it can't discover the system + * wide supported partid and pmg ranges. This avoids whatever + * this MSC is truncating the partids and creating a screaming + * error interrupt. + */ + if (tbl_msc->reserved || tbl_msc->reserved1 || tbl_msc->reserved2) + continue; + + if (decode_interface_type(tbl_msc, &iface)) + continue; + + next_res = 0; + next_prop = 0; + memset(res, 0, sizeof(res)); + memset(props, 0, sizeof(props)); + + pdev = platform_device_alloc("mpam_msc", tbl_msc->identifier); + if (IS_ERR(pdev)) { + err = PTR_ERR(pdev); + break; + } + + if (tbl_msc->length < sizeof(*tbl_msc)) { + err = -EINVAL; + break; + } + + /* Some power management is described in the namespace: */ + err = snprintf(uid, sizeof(uid), "%u", tbl_msc->identifier); + if (err > 0 && err < sizeof(uid)) { + companion = acpi_dev_get_first_match_dev("ARMHAA5C", uid, -1); + if (companion) + ACPI_COMPANION_SET(&pdev->dev, companion); + } + + if (iface == MPAM_IFACE_MMIO) { + res[next_res].name = "MPAM:MSC"; + res[next_res].start = tbl_msc->base_address; + res[next_res].end = tbl_msc->base_address + tbl_msc->mmio_size - 1; + res[next_res].flags = IORESOURCE_MEM; + next_res++; + } else if (iface == MPAM_IFACE_PCC) { + props[next_prop++] = PROPERTY_ENTRY_U32("pcc-channel", + tbl_msc->base_address); + next_prop++; + } + + acpi_mpam_parse_irqs(pdev, tbl_msc, res, &next_res); + err = platform_device_add_resources(pdev, res, next_res); + if (err) + break; + + props[next_prop++] = PROPERTY_ENTRY_U32("arm,not-ready-us", + tbl_msc->max_nrdy_usec); + + /* + * The MSC's CPU affinity is described via its linked power + * management device, but only if it points at a Processor or + * Processor Container. + */ + if (parse_msc_pm_link(tbl_msc, pdev, &acpi_id)) { + props[next_prop++] = PROPERTY_ENTRY_U32("cpu_affinity", + acpi_id); + } + + err = device_create_managed_software_node(&pdev->dev, props, + NULL); + if (err) + break; + + /* Come back later if you want the RIS too */ + err = platform_device_add_data(pdev, tbl_msc, tbl_msc->length); + if (err) + break; + + platform_device_add(pdev); + } + + if (err) + platform_device_put(pdev); + + return err; +} + +static struct acpi_table_header *get_table(void) +{ + struct acpi_table_header *table; + acpi_status status; + + if (acpi_disabled || !mpam_cpus_have_feature()) + return NULL; + + status = acpi_get_table(ACPI_SIG_MPAM, 0, &table); + if (ACPI_FAILURE(status)) + return NULL; + + if (table->revision != 1) + return NULL; + + return table; +} + + + +static int __init acpi_mpam_parse(void) +{ + struct acpi_table_header *mpam; + int err; + + mpam = get_table(); + if (!mpam) + return 0; + + err = _parse_table(mpam); + acpi_put_table(mpam); + + return err; +} + +static int _count_msc(struct acpi_table_header *table) +{ + char *table_end, *table_offset = (char *)(table + 1); + struct acpi_mpam_msc_node *tbl_msc; + int ret = 0; + + tbl_msc = (struct acpi_mpam_msc_node *)table_offset; + table_end = (char *)table + table->length; + + while (table_offset < table_end) { + if (tbl_msc->length < sizeof(*tbl_msc)) + return -EINVAL; + + ret++; + + table_offset += tbl_msc->length; + tbl_msc = (struct acpi_mpam_msc_node *)table_offset; + } + + return ret; +} + + +int acpi_mpam_count_msc(void) +{ + struct acpi_table_header *mpam; + int ret; + + mpam = get_table(); + if (!mpam) + return 0; + + ret = _count_msc(mpam); + acpi_put_table(mpam); + + return ret; +} + +/* + * Call after ACPI devices have been created, which happens behind acpi_scan_init() + * called from subsys_initcall(). PCC requires the mailbox driver, which is + * initialised from postcore_initcall(). + */ +subsys_initcall_sync(acpi_mpam_parse); diff --git a/drivers/acpi/pptt.c b/drivers/acpi/pptt.c index 54676e3d82dd598a76bfbec10440eae57483365a..c28868dd62469d57d74e122b72df8c34c38762a6 100644 --- a/drivers/acpi/pptt.c +++ b/drivers/acpi/pptt.c @@ -21,6 +21,8 @@ #include #include +typedef int (*acpi_pptt_cpu_callback_t)(struct acpi_pptt_processor *, void *); + static struct acpi_subtable_header *fetch_pptt_subtable(struct acpi_table_header *table_hdr, u32 pptt_ref) { @@ -181,9 +183,10 @@ acpi_find_cache_level(struct acpi_table_header *table_hdr, * levels and split cache levels (data/instruction). * @table_hdr: Pointer to the head of the PPTT table * @cpu_node: processor node we wish to count caches for - * @levels: Number of levels if success. + * @levels: Number of levels if success. (*levels) should be initialized by + * the caller with the value to be used as the starting level. * @split_levels: Number of split cache levels (data/instruction) if - * success. Can by NULL. + * success. Can be NULL. * * Given a processor node containing a processing unit, walk into it and count * how many levels exist solely for it, and then walk up each level until we hit @@ -298,6 +301,125 @@ static struct acpi_pptt_processor *acpi_find_processor_node(struct acpi_table_he return NULL; } +/* parent_node points into the table, but the table isn't provided. */ +static void acpi_pptt_get_child_cpus(struct acpi_pptt_processor *parent_node, + cpumask_t *cpus) +{ + struct acpi_pptt_processor *cpu_node; + struct acpi_table_header *table_hdr; + acpi_status status; + u32 acpi_id; + int cpu; + + status = acpi_get_table(ACPI_SIG_PPTT, 0, &table_hdr); + if (ACPI_FAILURE(status)) + return; + + for_each_possible_cpu(cpu) { + acpi_id = get_acpi_id_for_cpu(cpu); + cpu_node = acpi_find_processor_node(table_hdr, acpi_id); + + while (cpu_node) { + if (cpu_node == parent_node) { + cpumask_set_cpu(cpu, cpus); + break; + } + cpu_node = fetch_pptt_node(table_hdr, cpu_node->parent); + } + } + + acpi_put_table(table_hdr); +} + +/** + * acpi_pptt_for_each_container() - Iterate over all processor containers + * + * Not all 'Processor' entries in the PPTT are either a CPU or a Processor + * Container, they may exist purely to describe a Private resource. CPUs + * have to be leaves, so a Processor Container is a non-leaf that has the + * 'ACPI Processor ID valid' flag set. + * + * Return: 0 for a complete walk, or the first non-zero value from the callback + * that stopped the walk. + */ +int acpi_pptt_for_each_container(acpi_pptt_cpu_callback_t callback, void *arg) +{ + struct acpi_pptt_processor *cpu_node; + struct acpi_table_header *table_hdr; + struct acpi_subtable_header *entry; + bool leaf_flag, has_leaf_flag = false; + unsigned long table_end; + acpi_status status; + u32 proc_sz; + int ret = 0; + + status = acpi_get_table(ACPI_SIG_PPTT, 0, &table_hdr); + if (ACPI_FAILURE(status)) + return 0; + + if (table_hdr->revision > 1) + has_leaf_flag = true; + + table_end = (unsigned long)table_hdr + table_hdr->length; + entry = ACPI_ADD_PTR(struct acpi_subtable_header, table_hdr, + sizeof(struct acpi_table_pptt)); + proc_sz = sizeof(struct acpi_pptt_processor); + while ((unsigned long)entry + proc_sz < table_end) { + cpu_node = (struct acpi_pptt_processor *)entry; + if (entry->type == ACPI_PPTT_TYPE_PROCESSOR && + cpu_node->flags & ACPI_PPTT_ACPI_PROCESSOR_ID_VALID) { + leaf_flag = cpu_node->flags & ACPI_PPTT_ACPI_LEAF_NODE; + if ((has_leaf_flag && !leaf_flag) || + (!has_leaf_flag && !acpi_pptt_leaf_node(table_hdr, cpu_node))) { + ret = callback(cpu_node, arg); + if (ret) + break; + } + } + entry = ACPI_ADD_PTR(struct acpi_subtable_header, entry, + entry->length); + } + + acpi_put_table(table_hdr); + + return ret; +} + +struct __cpus_from_container_arg { + u32 acpi_cpu_id; + cpumask_t *cpus; +}; + +static int __cpus_from_container(struct acpi_pptt_processor *container, void *arg) +{ + struct __cpus_from_container_arg *params = arg; + + if (container->acpi_processor_id == params->acpi_cpu_id) + acpi_pptt_get_child_cpus(container, params->cpus); + + return 0; +} + +/** + * acpi_pptt_get_cpus_from_container() - Populate a cpumask with all CPUs in a + * processor containers + * + * Find the specified Processor Container, and fill cpus with all the cpus + * below it. + * + * Return: 0 for a complete walk, or an error if the mask is incomplete. + */ +int acpi_pptt_get_cpus_from_container(u32 acpi_cpu_id, cpumask_t *cpus) +{ + struct __cpus_from_container_arg params; + + params.acpi_cpu_id = acpi_cpu_id; + params.cpus = cpus; + + cpumask_clear(cpus); + return acpi_pptt_for_each_container(&__cpus_from_container, ¶ms); +} + static u8 acpi_cache_type(enum cache_type type) { switch (type) { @@ -817,3 +939,151 @@ int find_acpi_cpu_topology_hetero_id(unsigned int cpu) return find_acpi_cpu_topology_tag(cpu, PPTT_ABORT_PACKAGE, ACPI_PPTT_ACPI_IDENTICAL); } + + +/** + * find_acpi_cache_level_from_id() - Get the level of the specified cache + * @cache_id: The id field of the unified cache + * + * Determine the level relative to any CPU for the unified cache identified by + * cache_id. This allows the property to be found even if the CPUs are offline. + * + * The returned level can be used to group unified caches that are peers. + * + * The PPTT table must be rev 3 or later, + * + * If one CPUs L2 is shared with another as L3, this function will return + * and unpredictable value. + * + * Return: -ENOENT if the PPTT doesn't exist, or the cache cannot be found. + * Otherwise returns a value which represents the level of the specified cache. + */ +int find_acpi_cache_level_from_id(u32 cache_id) +{ + u32 acpi_cpu_id; + acpi_status status; + int level, cpu, num_levels; + struct acpi_pptt_cache *cache; + struct acpi_table_header *table; + struct acpi_pptt_cache_v1 *cache_v1; + struct acpi_pptt_processor *cpu_node; + + status = acpi_get_table(ACPI_SIG_PPTT, 0, &table); + if (ACPI_FAILURE(status)) { + acpi_pptt_warn_missing(); + return -ENOENT; + } + + if (table->revision < 3) { + acpi_put_table(table); + return -ENOENT; + } + + /* + * If we found the cache first, we'd still need to walk from each CPU + * to find the level... + */ + for_each_possible_cpu(cpu) { + + num_levels = 0; + acpi_cpu_id = get_acpi_id_for_cpu(cpu); + cpu_node = acpi_find_processor_node(table, acpi_cpu_id); + if (!cpu_node) + break; + acpi_count_levels(table, cpu_node, &num_levels, NULL); + + /* Start at 1 for L1 */ + for (level = 1; level <= num_levels; level++) { + cache = acpi_find_cache_node(table, acpi_cpu_id, + ACPI_PPTT_CACHE_TYPE_UNIFIED, + level, &cpu_node); + if (!cache) + continue; + + cache_v1 = ACPI_ADD_PTR(struct acpi_pptt_cache_v1, + cache, + sizeof(struct acpi_pptt_cache)); + + if (cache->flags & ACPI_PPTT_CACHE_ID_VALID && + cache_v1->cache_id == cache_id) { + acpi_put_table(table); + return level; + } + } + } + + acpi_put_table(table); + return -ENOENT; +} + +/** + * acpi_pptt_get_cpumask_from_cache_id() - Get the cpus associated with the + * specified cache + * @cache_id: The id field of the unified cache + * @cpus: Where to buidl the cpumask + * + * Determine which CPUs are below this cache in the PPTT. This allows the property + * to be found even if the CPUs are offline. + * + * The PPTT table must be rev 3 or later, + * + * Return: -ENOENT if the PPTT doesn't exist, or the cache cannot be found. + * Otherwise returns 0 and sets the cpus in the provided cpumask. + */ +int acpi_pptt_get_cpumask_from_cache_id(u32 cache_id, cpumask_t *cpus) +{ + u32 acpi_cpu_id; + acpi_status status; + int level, cpu, num_levels; + struct acpi_pptt_cache *cache; + struct acpi_table_header *table; + struct acpi_pptt_cache_v1 *cache_v1; + struct acpi_pptt_processor *cpu_node; + + cpumask_clear(cpus); + + status = acpi_get_table(ACPI_SIG_PPTT, 0, &table); + if (ACPI_FAILURE(status)) { + acpi_pptt_warn_missing(); + return -ENOENT; + } + + if (table->revision < 3) { + acpi_put_table(table); + return -ENOENT; + } + + /* + * If we found the cache first, we'd still need to walk from each cpu. + */ + for_each_possible_cpu(cpu) { + + num_levels = 0; + acpi_cpu_id = get_acpi_id_for_cpu(cpu); + cpu_node = acpi_find_processor_node(table, acpi_cpu_id); + if (!cpu_node) + break; + acpi_count_levels(table, cpu_node, &num_levels, NULL); + + /* Start at 1 for L1 */ + for (level = 1; level <= num_levels; level++) { + cache = acpi_find_cache_node(table, acpi_cpu_id, + ACPI_PPTT_CACHE_TYPE_UNIFIED, + level, &cpu_node); + if (!cache) + continue; + + cache_v1 = ACPI_ADD_PTR(struct acpi_pptt_cache_v1, + cache, + sizeof(struct acpi_pptt_cache)); + + if (cache->flags & ACPI_PPTT_CACHE_ID_VALID && + cache_v1->cache_id == cache_id) { + cpumask_set_cpu(cpu, cpus); + } + } + } + + acpi_put_table(table); + return 0; +} diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c index b27bc9ee511500688ece898e4dadf767e4415594..4fca04c8dd2d40f6e0fe611131b54527b8185374 100644 --- a/drivers/acpi/tables.c +++ b/drivers/acpi/tables.c @@ -604,7 +604,7 @@ static const char table_sigs[][ACPI_NAMESEG_SIZE] __initconst = { ACPI_SIG_PSDT, ACPI_SIG_RSDT, ACPI_SIG_XSDT, ACPI_SIG_SSDT, ACPI_SIG_IORT, ACPI_SIG_NFIT, ACPI_SIG_HMAT, ACPI_SIG_PPTT, ACPI_SIG_NHLT, ACPI_SIG_AEST, ACPI_SIG_CEDT, ACPI_SIG_AGDI, - ACPI_SIG_NBFT }; + ACPI_SIG_NBFT, ACPI_SIG_MPAM }; #define ACPI_HEADER_SIZE sizeof(struct acpi_table_header) diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c index 0fbb1575e297ab90bcbb4ced9858bedd257abf07..24200cb949ccd57a86bd4a3714c716c92d58de30 100644 --- a/drivers/base/cacheinfo.c +++ b/drivers/base/cacheinfo.c @@ -183,6 +183,38 @@ static bool cache_node_is_unified(struct cacheinfo *this_leaf, return of_property_read_bool(np, "cache-unified"); } +unsigned long cache_of_get_id(struct device_node *np) +{ + struct device_node *cpu; + unsigned long min_id = ~0UL; + + for_each_of_cpu_node(cpu) { + struct device_node *cache_node = cpu; + u64 id = of_get_cpu_hwid(cache_node, 0); + + while ((cache_node = of_find_next_cache_node(cache_node))) { + if ((cache_node == np) && (id < min_id)) { + min_id = id; + of_node_put(cache_node); + break; + } + of_node_put(cache_node); + } + } + + return min_id; +} + +static void cache_of_set_id(struct cacheinfo *this_leaf, struct device_node *np) +{ + unsigned long id = cache_of_get_id(np); + + if (id != ~0UL) { + this_leaf->id = id; + this_leaf->attributes |= CACHE_ID; + } +} + static void cache_of_set_props(struct cacheinfo *this_leaf, struct device_node *np) { @@ -198,6 +230,7 @@ static void cache_of_set_props(struct cacheinfo *this_leaf, cache_get_line_size(this_leaf, np); cache_nr_sets(this_leaf, np); cache_associativity(this_leaf); + cache_of_set_id(this_leaf, np); } static int cache_setup_of_node(unsigned int cpu) @@ -622,13 +655,19 @@ static ssize_t file_name##_show(struct device *dev, \ return sysfs_emit(buf, "%u\n", this_leaf->object); \ } -show_one(id, id); show_one(level, level); show_one(coherency_line_size, coherency_line_size); show_one(number_of_sets, number_of_sets); show_one(physical_line_partition, physical_line_partition); show_one(ways_of_associativity, ways_of_associativity); +static ssize_t id_show(struct device *dev, struct device_attribute *attr, char *buf) +{ + struct cacheinfo *this_leaf = dev_get_drvdata(dev); + + return sysfs_emit(buf, "%lu\n", this_leaf->id); +} + static ssize_t size_show(struct device *dev, struct device_attribute *attr, char *buf) { diff --git a/drivers/base/core.c b/drivers/base/core.c index ecb5d8f42c711c32706c7d864c858ae4ebed7617..3290c91a6ffa514225ba836a7344c4ee9c3baf02 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -22,6 +22,7 @@ #include #include #include +#include #include #include #include @@ -564,20 +565,11 @@ static struct class devlink_class = { static int devlink_add_symlinks(struct device *dev) { + char *buf_con __free(kfree) = NULL, *buf_sup __free(kfree) = NULL; int ret; - size_t len; struct device_link *link = to_devlink(dev); struct device *sup = link->supplier; struct device *con = link->consumer; - char *buf; - - len = max(strlen(dev_bus_name(sup)) + strlen(dev_name(sup)), - strlen(dev_bus_name(con)) + strlen(dev_name(con))); - len += strlen(":"); - len += strlen("supplier:") + 1; - buf = kzalloc(len, GFP_KERNEL); - if (!buf) - return -ENOMEM; ret = sysfs_create_link(&link->link_dev.kobj, &sup->kobj, "supplier"); if (ret) @@ -587,58 +579,64 @@ static int devlink_add_symlinks(struct device *dev) if (ret) goto err_con; - snprintf(buf, len, "consumer:%s:%s", dev_bus_name(con), dev_name(con)); - ret = sysfs_create_link(&sup->kobj, &link->link_dev.kobj, buf); + buf_con = kasprintf(GFP_KERNEL, "consumer:%s:%s", dev_bus_name(con), dev_name(con)); + if (!buf_con) { + ret = -ENOMEM; + goto err_con_dev; + } + + ret = sysfs_create_link(&sup->kobj, &link->link_dev.kobj, buf_con); if (ret) goto err_con_dev; - snprintf(buf, len, "supplier:%s:%s", dev_bus_name(sup), dev_name(sup)); - ret = sysfs_create_link(&con->kobj, &link->link_dev.kobj, buf); + buf_sup = kasprintf(GFP_KERNEL, "supplier:%s:%s", dev_bus_name(sup), dev_name(sup)); + if (!buf_sup) { + ret = -ENOMEM; + goto err_sup_dev; + } + + ret = sysfs_create_link(&con->kobj, &link->link_dev.kobj, buf_sup); if (ret) goto err_sup_dev; goto out; err_sup_dev: - snprintf(buf, len, "consumer:%s:%s", dev_bus_name(con), dev_name(con)); - sysfs_remove_link(&sup->kobj, buf); + sysfs_remove_link(&sup->kobj, buf_con); err_con_dev: sysfs_remove_link(&link->link_dev.kobj, "consumer"); err_con: sysfs_remove_link(&link->link_dev.kobj, "supplier"); out: - kfree(buf); return ret; } static void devlink_remove_symlinks(struct device *dev) { + char *buf_con __free(kfree) = NULL, *buf_sup __free(kfree) = NULL; struct device_link *link = to_devlink(dev); - size_t len; struct device *sup = link->supplier; struct device *con = link->consumer; - char *buf; sysfs_remove_link(&link->link_dev.kobj, "consumer"); sysfs_remove_link(&link->link_dev.kobj, "supplier"); - len = max(strlen(dev_bus_name(sup)) + strlen(dev_name(sup)), - strlen(dev_bus_name(con)) + strlen(dev_name(con))); - len += strlen(":"); - len += strlen("supplier:") + 1; - buf = kzalloc(len, GFP_KERNEL); - if (!buf) { - WARN(1, "Unable to properly free device link symlinks!\n"); - return; - } - if (device_is_registered(con)) { - snprintf(buf, len, "supplier:%s:%s", dev_bus_name(sup), dev_name(sup)); - sysfs_remove_link(&con->kobj, buf); + buf_sup = kasprintf(GFP_KERNEL, "supplier:%s:%s", dev_bus_name(sup), dev_name(sup)); + if (!buf_sup) + goto out; + sysfs_remove_link(&con->kobj, buf_sup); } - snprintf(buf, len, "consumer:%s:%s", dev_bus_name(con), dev_name(con)); - sysfs_remove_link(&sup->kobj, buf); - kfree(buf); + + buf_con = kasprintf(GFP_KERNEL, "consumer:%s:%s", dev_bus_name(con), dev_name(con)); + if (!buf_con) + goto out; + sysfs_remove_link(&sup->kobj, buf_con); + + return; + +out: + WARN(1, "Unable to properly free device link symlinks!\n"); } static struct class_interface devlink_class_intf = { diff --git a/drivers/perf/arm-cmn.c b/drivers/perf/arm-cmn.c index 77aa37de59880f33eeb35f43747333dd5a34e2d5..2f022d1fb052b039fd31ae97fedb7ff8795901b5 100644 --- a/drivers/perf/arm-cmn.c +++ b/drivers/perf/arm-cmn.c @@ -2435,6 +2435,7 @@ static int arm_cmn_probe(struct platform_device *pdev) struct arm_cmn *cmn; const char *name; static atomic_t id; + struct resource *cfg; int err, rootnode, this_id; cmn = devm_kzalloc(&pdev->dev, sizeof(*cmn), GFP_KERNEL); @@ -2450,7 +2451,16 @@ static int arm_cmn_probe(struct platform_device *pdev) rootnode = arm_cmn600_acpi_probe(pdev, cmn); } else { rootnode = 0; - cmn->base = devm_platform_ioremap_resource(pdev, 0); + + /* + * Avoid registering resources as the PMUs registers are + * scattered through CMN, and may appear either side of + * registers for other 'devices'. (e.g. the MPAM MSC controls). + */ + cfg = platform_get_resource(pdev, IORESOURCE_MEM, 0); + if (!cfg) + return -EINVAL; + cmn->base = devm_ioremap(&pdev->dev, cfg->start, resource_size(cfg)); if (IS_ERR(cmn->base)) return PTR_ERR(cmn->base); if (cmn->part == PART_CMN600) diff --git a/drivers/platform/Kconfig b/drivers/platform/Kconfig index 868b20361769c37048ef58392d9aae43abcaf021..f26534a4a83b5aedb85c616f90db739a0ceaba4a 100644 --- a/drivers/platform/Kconfig +++ b/drivers/platform/Kconfig @@ -9,6 +9,8 @@ source "drivers/platform/chrome/Kconfig" source "drivers/platform/mellanox/Kconfig" +source "drivers/platform/mpam/Kconfig" + source "drivers/platform/olpc/Kconfig" source "drivers/platform/surface/Kconfig" diff --git a/drivers/platform/Makefile b/drivers/platform/Makefile index 8296d4c41eb7706133d909b215e6f059996256f1..54ee16e4e4d8a96e7795dfe152827693b862ad56 100644 --- a/drivers/platform/Makefile +++ b/drivers/platform/Makefile @@ -12,3 +12,4 @@ obj-$(CONFIG_OLPC_EC) += olpc/ obj-$(CONFIG_GOLDFISH) += goldfish/ obj-$(CONFIG_CHROME_PLATFORMS) += chrome/ obj-$(CONFIG_SURFACE_PLATFORMS) += surface/ +obj-$(CONFIG_ARM_CPU_RESCTRL) += mpam/ diff --git a/drivers/platform/mpam/Kconfig b/drivers/platform/mpam/Kconfig new file mode 100644 index 0000000000000000000000000000000000000000..75f5b2454fbe45b9531e6a680ecdc55df166200b --- /dev/null +++ b/drivers/platform/mpam/Kconfig @@ -0,0 +1,8 @@ +# Confusingly, this is everything but the CPU bits of MPAM. CPU here means +# CPU resources, not containers or cgroups etc. +config ARM_CPU_RESCTRL + bool + default y + depends on ARM64 && ARCH_HAS_CPU_RESCTRL + depends on MISC_FILESYSTEMS + select RESCTRL_RMID_DEPENDS_ON_CLOSID diff --git a/drivers/platform/mpam/Makefile b/drivers/platform/mpam/Makefile new file mode 100644 index 0000000000000000000000000000000000000000..37693be531c34151078e6e6055d0bb4eea9099bf --- /dev/null +++ b/drivers/platform/mpam/Makefile @@ -0,0 +1 @@ +obj-$(CONFIG_ARM_CPU_RESCTRL) += mpam_devices.o mpam_resctrl.o diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c new file mode 100644 index 0000000000000000000000000000000000000000..8ea6d783b6ee02af40ce821f4771b12fdbd344ec --- /dev/null +++ b/drivers/platform/mpam/mpam_devices.c @@ -0,0 +1,2655 @@ +// SPDX-License-Identifier: GPL-2.0 +// Copyright (C) 2022 Arm Ltd. + +#define pr_fmt(fmt) "mpam: " fmt + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include + +#include "mpam_internal.h" + +/* + * mpam_list_lock protects the SRCU lists when writing. Once the + * mpam_enabled key is enabled these lists are read-only, + * unless the error interrupt disables the driver. + */ +static DEFINE_MUTEX(mpam_list_lock); +static LIST_HEAD(mpam_all_msc); + +struct srcu_struct mpam_srcu; + +/* MPAM isn't available until all the MSC have been probed. */ +static u32 mpam_num_msc; + +static int mpam_cpuhp_state; +static DEFINE_MUTEX(mpam_cpuhp_state_lock); + +/* + * The smallest common values for any CPU or MSC in the system. + * Generating traffic outside this range will result in screaming interrupts. + */ +u16 mpam_partid_max; +u8 mpam_pmg_max; +static bool partid_max_init, partid_max_published; +static DEFINE_SPINLOCK(partid_max_lock); + +/* + * mpam is enabled once all devices have been probed from CPU online callbacks, + * scheduled via this work_struct. If access to an MSC depends on a CPU that + * was not brought online at boot, this can happen surprisingly late. + */ +static DECLARE_WORK(mpam_enable_work, &mpam_enable); + +/* + * All mpam error interrupts indicate a software bug. On receipt, disable the + * driver. + */ +static DECLARE_WORK(mpam_broken_work, &mpam_disable); + +/* + * An MSC is a container for resources, each identified by their RIS index. + * Components are a group of RIS that control the same thing. + * Classes are the set components of the same type. + * + * e.g. The set of RIS that make up the L2 are a component. These are sometimes + * termed slices. They should be configured as if they were one MSC. + * + * e.g. The SoC probably has more than one L2, each attached to a distinct set + * of CPUs. All the L2 components are grouped as a class. + * + * When creating an MSC, struct mpam_msc is added to the all mpam_all_msc list, + * then linked via struct mpam_ris to a component and a class. + * The same MSC may exist under different class->component paths, but the RIS + * index will be unique. + */ +LIST_HEAD(mpam_classes); + +static u32 __mpam_read_reg(struct mpam_msc *msc, u16 reg) +{ + WARN_ON_ONCE(reg + sizeof(u32) > msc->mapped_hwpage_sz); + WARN_ON_ONCE(!cpumask_test_cpu(smp_processor_id(), &msc->accessibility)); + + return readl_relaxed(msc->mapped_hwpage + reg); +} + +static void __mpam_write_reg(struct mpam_msc *msc, u16 reg, u32 val) +{ + WARN_ON_ONCE(reg + sizeof(u32) > msc->mapped_hwpage_sz); + WARN_ON_ONCE(!cpumask_test_cpu(smp_processor_id(), &msc->accessibility)); + + writel_relaxed(val, msc->mapped_hwpage + reg); +} + +#define mpam_read_partsel_reg(msc, reg) \ +({ \ + u32 ____ret; \ + \ + lockdep_assert_held_once(&msc->part_sel_lock); \ + ____ret = __mpam_read_reg(msc, MPAMF_##reg); \ + \ + pr_debug("par r: msc %d reg 0x%x val 0x%x\n", \ + msc->id, MPAMF_##reg, ____ret); \ + ____ret; \ +}) + +#define mpam_write_partsel_reg(msc, reg, val) \ +({ \ + lockdep_assert_held_once(&msc->part_sel_lock); \ + pr_debug("par w: msc %d reg 0x%x val 0x%lx\n", \ + msc->id, MPAMCFG_##reg, (unsigned long)val); \ + __mpam_write_reg(msc, MPAMCFG_##reg, val); \ +}) + +#define mpam_read_monsel_reg(msc, reg) \ +({ \ + u32 ____ret; \ + \ + lockdep_assert_held_once(&msc->mon_sel_lock); \ + ____ret = __mpam_read_reg(msc, MSMON_##reg); \ + \ + pr_debug("mon r: msc %d reg 0x%x val 0x%x\n", \ + msc->id, MSMON_##reg, ____ret); \ + ____ret; \ +}) + +#define mpam_write_monsel_reg(msc, reg, val) \ +({ \ + lockdep_assert_held_once(&msc->mon_sel_lock); \ + pr_debug("mon w: msc %d reg 0x%x val 0x%lx\n", \ + msc->id, MSMON_##reg, (unsigned long)val); \ + __mpam_write_reg(msc, MSMON_##reg, val); \ +}) + +static u64 mpam_msc_read_idr(struct mpam_msc *msc) +{ + u64 idr_high = 0, idr_low; + + lockdep_assert_held(&msc->part_sel_lock); + + idr_low = mpam_read_partsel_reg(msc, IDR); + if (FIELD_GET(MPAMF_IDR_HAS_EXT, idr_low)) + idr_high = mpam_read_partsel_reg(msc, IDR + 4); + + return (idr_high << 32) | idr_low; +} + +static void mpam_msc_zero_esr(struct mpam_msc *msc) +{ + writel_relaxed(0, msc->mapped_hwpage + MPAMF_ESR); + if (msc->has_extd_esr) + writel_relaxed(0, msc->mapped_hwpage + MPAMF_ESR + 4); +} + +static u64 mpam_msc_read_esr(struct mpam_msc *msc) +{ + u64 esr_high = 0, esr_low; + + esr_low = readl_relaxed(msc->mapped_hwpage + MPAMF_ESR); + if (msc->has_extd_esr) + esr_high = readl_relaxed(msc->mapped_hwpage + MPAMF_ESR + 4); + + return (esr_high << 32) | esr_low; +} + +static void __mpam_part_sel(u8 ris_idx, u16 partid, struct mpam_msc *msc) +{ + u32 partsel; + + lockdep_assert_held(&msc->part_sel_lock); + + partsel = FIELD_PREP(MPAMCFG_PART_SEL_RIS, ris_idx) | + FIELD_PREP(MPAMCFG_PART_SEL_PARTID_SEL, partid); + mpam_write_partsel_reg(msc, PART_SEL, partsel); +} + +int mpam_register_requestor(u16 partid_max, u8 pmg_max) +{ + int err = 0; + + spin_lock(&partid_max_lock); + if (!partid_max_init) { + mpam_partid_max = partid_max; + mpam_pmg_max = pmg_max; + partid_max_init = true; + } else if (!partid_max_published) { + mpam_partid_max = min(mpam_partid_max, partid_max); + mpam_pmg_max = min(mpam_pmg_max, pmg_max); + } else { + /* New requestors can't lower the values */ + if ((partid_max < mpam_partid_max) || (pmg_max < mpam_pmg_max)) + err = -EBUSY; + } + spin_unlock(&partid_max_lock); + + return err; +} +EXPORT_SYMBOL(mpam_register_requestor); + +static struct mpam_component * +mpam_component_alloc(struct mpam_class *class, int id, gfp_t gfp) +{ + struct mpam_component *comp; + + lockdep_assert_held(&mpam_list_lock); + + comp = kzalloc(sizeof(*comp), gfp); + if (!comp) + return ERR_PTR(-ENOMEM); + + comp->comp_id = id; + INIT_LIST_HEAD_RCU(&comp->ris); + /* affinity is updated when ris are added */ + INIT_LIST_HEAD_RCU(&comp->class_list); + comp->class = class; + + list_add_rcu(&comp->class_list, &class->components); + + return comp; +} + +static struct mpam_component * +mpam_component_get(struct mpam_class *class, int id, bool alloc, gfp_t gfp) +{ + struct mpam_component *comp; + + lockdep_assert_held(&mpam_list_lock); + + list_for_each_entry(comp, &class->components, class_list) { + if (comp->comp_id == id) + return comp; + } + + if (!alloc) + return ERR_PTR(-ENOENT); + + return mpam_component_alloc(class, id, gfp); +} + +static struct mpam_class * +mpam_class_alloc(u8 level_idx, enum mpam_class_types type, gfp_t gfp) +{ + struct mpam_class *class; + + lockdep_assert_held(&mpam_list_lock); + + class = kzalloc(sizeof(*class), gfp); + if (!class) + return ERR_PTR(-ENOMEM); + + INIT_LIST_HEAD_RCU(&class->components); + /* affinity is updated when ris are added */ + class->level = level_idx; + class->type = type; + INIT_LIST_HEAD_RCU(&class->classes_list); + ida_init(&class->ida_csu_mon); + ida_init(&class->ida_mbwu_mon); + + list_add_rcu(&class->classes_list, &mpam_classes); + + return class; +} + +static struct mpam_class * +mpam_class_get(u8 level_idx, enum mpam_class_types type, bool alloc, gfp_t gfp) +{ + bool found = false; + struct mpam_class *class; + + lockdep_assert_held(&mpam_list_lock); + + list_for_each_entry(class, &mpam_classes, classes_list) { + if (class->type == type && class->level == level_idx) { + found = true; + break; + } + } + + if (found) + return class; + + if (!alloc) + return ERR_PTR(-ENOENT); + + return mpam_class_alloc(level_idx, type, gfp); +} + +static void mpam_class_destroy(struct mpam_class *class) +{ + lockdep_assert_held(&mpam_list_lock); + + list_del_rcu(&class->classes_list); + synchronize_srcu(&mpam_srcu); + kfree(class); +} + +static void mpam_comp_destroy(struct mpam_component *comp) +{ + struct mpam_class *class = comp->class; + + lockdep_assert_held(&mpam_list_lock); + + list_del_rcu(&comp->class_list); + synchronize_srcu(&mpam_srcu); + kfree(comp); + + if (list_empty(&class->components)) + mpam_class_destroy(class); +} + +/* synchronise_srcu() before freeing ris */ +static void mpam_ris_destroy(struct mpam_msc_ris *ris) +{ + struct mpam_component *comp = ris->comp; + struct mpam_class *class = comp->class; + struct mpam_msc *msc = ris->msc; + + lockdep_assert_held(&mpam_list_lock); + lockdep_assert_preemption_enabled(); + + clear_bit(ris->ris_idx, msc->ris_idxs); + list_del_rcu(&ris->comp_list); + list_del_rcu(&ris->msc_list); + + cpumask_andnot(&comp->affinity, &comp->affinity, &ris->affinity); + cpumask_andnot(&class->affinity, &class->affinity, &ris->affinity); + + if (list_empty(&comp->ris)) + mpam_comp_destroy(comp); +} + +/* + * There are two ways of reaching a struct mpam_msc_ris. Via the + * class->component->ris, or via the msc. + * When destroying the msc, the other side needs unlinking and cleaning up too. + * synchronise_srcu() before freeing msc. + */ +static void mpam_msc_destroy(struct mpam_msc *msc) +{ + struct mpam_msc_ris *ris, *tmp; + + lockdep_assert_held(&mpam_list_lock); + lockdep_assert_preemption_enabled(); + + list_for_each_entry_safe(ris, tmp, &msc->ris, msc_list) + mpam_ris_destroy(ris); +} + +/* + * The cacheinfo structures are only populated when CPUs are online. + * This helper walks the device tree to include offline CPUs too. + */ +static int get_cpumask_from_cache_id(u32 cache_id, u32 cache_level, + cpumask_t *affinity) +{ + int cpu, err; + u32 iter_level; + int iter_cache_id; + struct device_node *iter; + + if (!acpi_disabled) + return acpi_pptt_get_cpumask_from_cache_id(cache_id, affinity); + + for_each_possible_cpu(cpu) { + iter = of_get_cpu_node(cpu, NULL); + if (!iter) { + pr_err("Failed to find cpu%d device node\n", cpu); + return -ENOENT; + } + + while ((iter = of_find_next_cache_node(iter))) { + err = of_property_read_u32(iter, "cache-level", + &iter_level); + if (err || (iter_level != cache_level)) { + of_node_put(iter); + continue; + } + + /* + * get_cpu_cacheinfo_id() isn't ready until sometime + * during device_initcall(). Use cache_of_get_id(). + */ + iter_cache_id = cache_of_get_id(iter); + if (cache_id == (~0)) { + of_node_put(iter); + continue; + } + + if (iter_cache_id == cache_id) + cpumask_set_cpu(cpu, affinity); + + of_node_put(iter); + } + } + + return 0; +} + + +/* + * cpumask_of_node() only knows about online CPUs. This can't tell us whether + * a class is represented on all possible CPUs. + */ +static void get_cpumask_from_node_id(u32 node_id, cpumask_t *affinity) +{ + int cpu; + + for_each_possible_cpu(cpu) { + if (node_id == cpu_to_node(cpu)) + cpumask_set_cpu(cpu, affinity); + } +} + +static int get_cpumask_from_cache(struct device_node *cache, + cpumask_t *affinity) +{ + int err; + u32 cache_level; + int cache_id; + + err = of_property_read_u32(cache, "cache-level", &cache_level); + if (err) { + pr_err("Failed to read cache-level from cache node\n"); + return -ENOENT; + } + + cache_id = cache_of_get_id(cache); + if (cache_id == ~0UL) { + pr_err("Failed to calculate cache-id from cache node\n"); + return -ENOENT; + } + + return get_cpumask_from_cache_id(cache_id, cache_level, affinity); +} + +static int mpam_ris_get_affinity(struct mpam_msc *msc, cpumask_t *affinity, + enum mpam_class_types type, + struct mpam_class *class, + struct mpam_component *comp) +{ + int err; + + switch (type) { + case MPAM_CLASS_CACHE: + err = get_cpumask_from_cache_id(comp->comp_id, class->level, + affinity); + if (err) + return err; + + if (cpumask_empty(affinity)) + pr_warn_once("%s no CPUs associated with cache node", + dev_name(&msc->pdev->dev)); + + break; + case MPAM_CLASS_MEMORY: + get_cpumask_from_node_id(comp->comp_id, affinity); + if (cpumask_empty(affinity)) + pr_warn_once("%s no CPUs associated with memory node", + dev_name(&msc->pdev->dev)); + break; + case MPAM_CLASS_UNKNOWN: + return 0; + } + + cpumask_and(affinity, affinity, &msc->accessibility); + + pr_debug("comp_id %d msc->id %d affinity %*pb\n", + comp->comp_id, msc->id, cpumask_pr_args(affinity)); + + return 0; +} + +static int mpam_ris_create_locked(struct mpam_msc *msc, u8 ris_idx, + enum mpam_class_types type, u8 class_id, + int component_id, gfp_t gfp) +{ + int err; + struct mpam_msc_ris *ris; + struct mpam_class *class; + struct mpam_component *comp; + + lockdep_assert_held(&mpam_list_lock); + + if (test_and_set_bit(ris_idx, msc->ris_idxs)) + return -EBUSY; + + ris = devm_kzalloc(&msc->pdev->dev, sizeof(*ris), gfp); + if (!ris) + return -ENOMEM; + + class = mpam_class_get(class_id, type, true, gfp); + if (IS_ERR(class)) + return PTR_ERR(class); + + comp = mpam_component_get(class, component_id, true, gfp); + if (IS_ERR(comp)) { + if (list_empty(&class->components)) + mpam_class_destroy(class); + return PTR_ERR(comp); + } + + err = mpam_ris_get_affinity(msc, &ris->affinity, type, class, comp); + if (err) { + if (list_empty(&class->components)) + mpam_class_destroy(class); + return err; + } + + ris->ris_idx = ris_idx; + INIT_LIST_HEAD_RCU(&ris->comp_list); + INIT_LIST_HEAD_RCU(&ris->msc_list); + ris->msc = msc; + ris->comp = comp; + + cpumask_or(&comp->affinity, &comp->affinity, &ris->affinity); + cpumask_or(&class->affinity, &class->affinity, &ris->affinity); + list_add_rcu(&ris->comp_list, &comp->ris); + list_add_rcu(&ris->msc_list, &msc->ris); + + return 0; +} + +int mpam_ris_create(struct mpam_msc *msc, u8 ris_idx, + enum mpam_class_types type, u8 class_id, int component_id) +{ + int err; + + mutex_lock(&mpam_list_lock); + err = mpam_ris_create_locked(msc, ris_idx, type, class_id, + component_id, GFP_KERNEL); + mutex_unlock(&mpam_list_lock); + + return err; +} + +static struct mpam_msc_ris *mpam_get_or_create_ris(struct mpam_msc *msc, + u8 ris_idx) +{ + int err; + struct mpam_msc_ris *ris, *found = ERR_PTR(-ENOENT); + + lockdep_assert_held(&mpam_list_lock); + + if (!test_bit(ris_idx, msc->ris_idxs)) { + err = mpam_ris_create_locked(msc, ris_idx, MPAM_CLASS_UNKNOWN, + 0, 0, GFP_ATOMIC); + if (err) + return ERR_PTR(err); + } + + list_for_each_entry(ris, &msc->ris, msc_list) { + if (ris->ris_idx == ris_idx) { + found = ris; + break; + } + } + + return found; +} + +u16 mpam_cpbm_wd_hisi_workaround(u16 cpbm_wd, enum mpam_device_features feat, + u8 cache_level) +{ + static const struct midr_range cpus[] = { + MIDR_ALL_VERSIONS(MIDR_HISI_HIP12), + { /* sentinel */ } + }; + + if (cache_level != 3) + return cpbm_wd; + + if (is_midr_in_range_list(read_cpuid_id(), cpus)) { + if (feat == mpam_feat_cpor_part) + return 19; + else if (feat == mpam_feat_ccap_part || + feat == mpam_feat_cmin) + return 21; + } + + return cpbm_wd; +} + +static void mpam_ris_hw_probe(struct mpam_msc_ris *ris) +{ + int err; + struct mpam_msc *msc = ris->msc; + struct mpam_props *props = &ris->props; + struct mpam_class *class = ris->comp->class; + + lockdep_assert_held(&msc->lock); + lockdep_assert_held(&msc->part_sel_lock); + + /* Cache Capacity Partitioning */ + if (FIELD_GET(MPAMF_IDR_HAS_CCAP_PART, ris->idr)) { + u32 ccap_features = mpam_read_partsel_reg(msc, CCAP_IDR); + + props->cmax_wd = FIELD_GET(MPAMF_CCAP_IDR_CMAX_WD, ccap_features); + + if (props->cmax_wd) { + if (!FIELD_GET(MPAMF_CCAP_IDR_NO_CMAX, ccap_features)) + mpam_set_feature(mpam_feat_ccap_part, props); + + if (FIELD_GET(MPAMF_CCAP_IDR_HAS_CMIN, ccap_features)) + mpam_set_feature(mpam_feat_cmin, props); + } + } + + /* Cache Portion partitioning */ + if (FIELD_GET(MPAMF_IDR_HAS_CPOR_PART, ris->idr)) { + u32 cpor_features = mpam_read_partsel_reg(msc, CPOR_IDR); + + props->cpbm_wd = mpam_cpbm_wd_hisi_workaround( + FIELD_GET(MPAMF_CPOR_IDR_CPBM_WD, cpor_features), + mpam_feat_cpor_part, class->level); + if (props->cpbm_wd) + mpam_set_feature(mpam_feat_cpor_part, props); + } + + /* Memory bandwidth partitioning */ + if (FIELD_GET(MPAMF_IDR_HAS_MBW_PART, ris->idr)) { + u32 mbw_features = mpam_read_partsel_reg(msc, MBW_IDR); + + /* portion bitmap resolution */ + props->mbw_pbm_bits = FIELD_GET(MPAMF_MBW_IDR_BWPBM_WD, mbw_features); + if (props->mbw_pbm_bits && + FIELD_GET(MPAMF_MBW_IDR_HAS_PBM, mbw_features)) + mpam_set_feature(mpam_feat_mbw_part, props); + + props->bwa_wd = FIELD_GET(MPAMF_MBW_IDR_BWA_WD, mbw_features); + if (props->bwa_wd && FIELD_GET(MPAMF_MBW_IDR_HAS_MAX, mbw_features)) { + mpam_set_feature(mpam_feat_mbw_max, props); + mpam_set_feature(mpam_feat_max_limit, props); + } + + if (props->bwa_wd && FIELD_GET(MPAMF_MBW_IDR_HAS_MIN, mbw_features)) + mpam_set_feature(mpam_feat_mbw_min, props); + + if (props->bwa_wd && FIELD_GET(MPAMF_MBW_IDR_HAS_PROP, mbw_features)) + mpam_set_feature(mpam_feat_mbw_prop, props); + } + + /* Priority partitioning */ + if (FIELD_GET(MPAMF_IDR_HAS_PRI_PART, ris->idr)) { + u32 pri_features = mpam_read_partsel_reg(msc, PRI_IDR); + + props->intpri_wd = FIELD_GET(MPAMF_PRI_IDR_INTPRI_WD, pri_features); + if (props->intpri_wd && FIELD_GET(MPAMF_PRI_IDR_HAS_INTPRI, pri_features)) { + mpam_set_feature(mpam_feat_intpri_part, props); + if (FIELD_GET(MPAMF_PRI_IDR_INTPRI_0_IS_LOW, pri_features)) + mpam_set_feature(mpam_feat_intpri_part_0_low, props); + } + + props->dspri_wd = FIELD_GET(MPAMF_PRI_IDR_DSPRI_WD, pri_features); + if (props->dspri_wd && FIELD_GET(MPAMF_PRI_IDR_HAS_DSPRI, pri_features)) { + mpam_set_feature(mpam_feat_dspri_part, props); + if (FIELD_GET(MPAMF_PRI_IDR_DSPRI_0_IS_LOW, pri_features)) + mpam_set_feature(mpam_feat_dspri_part_0_low, props); + } + } + + /* Performance Monitoring */ + if (FIELD_GET(MPAMF_IDR_HAS_MSMON, ris->idr)) { + u32 msmon_features = mpam_read_partsel_reg(msc, MSMON_IDR); + + if (FIELD_GET(MPAMF_MSMON_IDR_MSMON_CSU, msmon_features)) { + u32 csumonidr, discard; + + /* + * If the firmware max-nrdy-us property is missing, the + * CSU counters can't be used. Should we wait forever? + */ + err = device_property_read_u32(&msc->pdev->dev, + "arm,not-ready-us", + &discard); + + csumonidr = mpam_read_partsel_reg(msc, CSUMON_IDR); + props->num_csu_mon = FIELD_GET(MPAMF_CSUMON_IDR_NUM_MON, csumonidr); + if (props->num_csu_mon && !err) + mpam_set_feature(mpam_feat_msmon_csu, props); + else if (props->num_csu_mon) + pr_err_once("Counters are not usable because not-ready timeout was not provided by firmware."); + } + if (FIELD_GET(MPAMF_MSMON_IDR_MSMON_MBWU, msmon_features)) { + bool has_long; + u32 mbwumonidr = mpam_read_partsel_reg(msc, MBWUMON_IDR); + + props->num_mbwu_mon = FIELD_GET(MPAMF_MBWUMON_IDR_NUM_MON, mbwumonidr); + if (props->num_mbwu_mon) + mpam_set_feature(mpam_feat_msmon_mbwu, props); + + if (FIELD_GET(MPAMF_MBWUMON_IDR_HAS_RWBW, mbwumonidr)) + mpam_set_feature(mpam_feat_msmon_mbwu_rwbw, props); + + /* + * Treat long counter and its extension, lwd as mutually + * exclusive feature bits. Though these are dependent + * fields at the implementation level, there would never + * be a need for mpam_feat_msmon_mbwu_44counter (long + * counter) and mpam_feat_msmon_mbwu_63counter (lwd) + * bits to be set together. + * + * mpam_feat_msmon_mbwu isn't treated as an exclusive + * bit as this feature bit would be used as the "front + * facing feature bit" for any checks related to mbwu + * monitors. + */ + has_long = FIELD_GET(MPAMF_MBWUMON_IDR_HAS_LONG, mbwumonidr); + if (props->num_mbwu_mon && has_long) { + if (FIELD_GET(MPAMF_MBWUMON_IDR_LWD, mbwumonidr)) + mpam_set_feature(mpam_feat_msmon_mbwu_63counter, props); + else + mpam_set_feature(mpam_feat_msmon_mbwu_44counter, props); + } + } + } + + /* + * RIS with PARTID narrowing don't have enough storage for one + * configuration per PARTID. If these are in a class we could use, + * reduce the supported partid_max to match the number of intpartid. + * If the class is unknown, just ignore it. + */ + if (FIELD_GET(MPAMF_IDR_HAS_PARTID_NRW, ris->idr) && + class->type != MPAM_CLASS_UNKNOWN) { + u32 nrwidr = mpam_read_partsel_reg(msc, PARTID_NRW_IDR); + u16 partid_max = FIELD_GET(MPAMF_PARTID_NRW_IDR_INTPARTID_MAX, nrwidr); + + mpam_set_feature(mpam_feat_partid_nrw, props); + msc->partid_max = min(msc->partid_max, partid_max); + } +} + +static int mpam_msc_hw_probe(struct mpam_msc *msc) +{ + u64 idr; + u16 partid_max; + u8 ris_idx, pmg_max; + struct mpam_msc_ris *ris; + + lockdep_assert_held(&msc->lock); + + spin_lock(&msc->part_sel_lock); + idr = mpam_read_partsel_reg(msc, AIDR); + if ((idr & MPAMF_AIDR_ARCH_MAJOR_REV) != MPAM_ARCHITECTURE_V1) { + pr_err_once("%s does not match MPAM architecture v1.0\n", + dev_name(&msc->pdev->dev)); + spin_unlock(&msc->part_sel_lock); + return -EIO; + } + + idr = mpam_msc_read_idr(msc); + spin_unlock(&msc->part_sel_lock); + + msc->ris_max = FIELD_GET(MPAMF_IDR_RIS_MAX, idr); + + /* Use these values so partid/pmg always starts with a valid value */ + msc->partid_max = FIELD_GET(MPAMF_IDR_PARTID_MAX, idr); + msc->pmg_max = FIELD_GET(MPAMF_IDR_PMG_MAX, idr); + + for (ris_idx = 0; ris_idx <= msc->ris_max; ris_idx++) { + spin_lock(&msc->part_sel_lock); + __mpam_part_sel(ris_idx, 0, msc); + idr = mpam_msc_read_idr(msc); + spin_unlock(&msc->part_sel_lock); + + partid_max = FIELD_GET(MPAMF_IDR_PARTID_MAX, idr); + pmg_max = FIELD_GET(MPAMF_IDR_PMG_MAX, idr); + msc->partid_max = min(msc->partid_max, partid_max); + msc->pmg_max = min(msc->pmg_max, pmg_max); + msc->has_extd_esr = FIELD_GET(MPAMF_IDR_HAS_EXT_ESR, idr); + + ris = mpam_get_or_create_ris(msc, ris_idx); + if (IS_ERR(ris)) + return PTR_ERR(ris); + ris->idr = idr; + + spin_lock(&msc->part_sel_lock); + __mpam_part_sel(ris_idx, 0, msc); + mpam_ris_hw_probe(ris); + spin_unlock(&msc->part_sel_lock); + } + + spin_lock(&partid_max_lock); + mpam_partid_max = min(mpam_partid_max, msc->partid_max); + mpam_pmg_max = min(mpam_pmg_max, msc->pmg_max); + spin_unlock(&partid_max_lock); + + msc->probed = true; + + return 0; +} + +struct mon_read { + struct mpam_msc_ris *ris; + struct mon_cfg *ctx; + enum mpam_device_features type; + u64 *val; + int err; +}; + +static bool mpam_ris_has_mbwu_long_counter(struct mpam_msc_ris *ris) +{ + return (mpam_has_feature(mpam_feat_msmon_mbwu_63counter, &ris->props) || + mpam_has_feature(mpam_feat_msmon_mbwu_44counter, &ris->props)); +} + +static u64 mpam_msc_read_mbwu_l(struct mpam_msc *msc) +{ + int retry = 3; + u32 mbwu_l_low; + u64 mbwu_l_high1, mbwu_l_high2; + + lockdep_assert_held_once(&msc->mon_sel_lock); + + WARN_ON_ONCE((MSMON_MBWU_L + sizeof(u64)) > msc->mapped_hwpage_sz); + WARN_ON_ONCE(!cpumask_test_cpu(smp_processor_id(), &msc->accessibility)); + + mbwu_l_high2 = readl_relaxed(msc->mapped_hwpage + MSMON_MBWU_L + 4); + do { + mbwu_l_high1 = mbwu_l_high2; + mbwu_l_low = readl_relaxed(msc->mapped_hwpage + MSMON_MBWU_L); + mbwu_l_high2 = readl_relaxed(msc->mapped_hwpage + MSMON_MBWU_L + 4); + + retry--; + } while (mbwu_l_high1 != mbwu_l_high2 && retry > 0); + + if (mbwu_l_high2 == mbwu_l_high1) + return (mbwu_l_high1 << 32) | mbwu_l_low; + return MSMON___NRDY_L; +} + +static void mpam_msc_zero_mbwu_l(struct mpam_msc *msc) +{ + lockdep_assert_held_once(&msc->mon_sel_lock); + + WARN_ON_ONCE((MSMON_MBWU_L + sizeof(u64)) > msc->mapped_hwpage_sz); + WARN_ON_ONCE(!cpumask_test_cpu(smp_processor_id(), &msc->accessibility)); + + writel_relaxed(0, msc->mapped_hwpage + MSMON_MBWU_L); + writel_relaxed(0, msc->mapped_hwpage + MSMON_MBWU_L + 4); +} + +static void gen_msmon_ctl_flt_vals(struct mon_read *m, u32 *ctl_val, + u32 *flt_val) +{ + struct mon_cfg *ctx = m->ctx; + + switch (m->type) { + case mpam_feat_msmon_csu: + *ctl_val = MSMON_CFG_MBWU_CTL_TYPE_CSU; + break; + case mpam_feat_msmon_mbwu: + *ctl_val = MSMON_CFG_MBWU_CTL_TYPE_MBWU; + break; + default: + return; + } + + /* + * For CSU counters its implementation-defined what happens when not + * filtering by partid. + */ + *ctl_val |= MSMON_CFG_x_CTL_MATCH_PARTID; + + *flt_val = FIELD_PREP(MSMON_CFG_MBWU_FLT_PARTID, ctx->partid); + *flt_val |= FIELD_PREP(MSMON_CFG_MBWU_FLT_RWBW, ctx->opts); + if (m->ctx->match_pmg) { + *ctl_val |= MSMON_CFG_x_CTL_MATCH_PMG; + *flt_val |= FIELD_PREP(MSMON_CFG_MBWU_FLT_PMG, ctx->pmg); + } + + if (mpam_has_feature(mpam_feat_msmon_mbwu_rwbw, &m->ris->props)) + *flt_val |= FIELD_PREP(MSMON_CFG_MBWU_FLT_RWBW, ctx->opts); +} + +static void read_msmon_ctl_flt_vals(struct mon_read *m, u32 *ctl_val, + u32 *flt_val) +{ + struct mpam_msc *msc = m->ris->msc; + + switch (m->type) { + case mpam_feat_msmon_csu: + *ctl_val = mpam_read_monsel_reg(msc, CFG_CSU_CTL); + *flt_val = mpam_read_monsel_reg(msc, CFG_CSU_FLT); + break; + case mpam_feat_msmon_mbwu: + *ctl_val = mpam_read_monsel_reg(msc, CFG_MBWU_CTL); + *flt_val = mpam_read_monsel_reg(msc, CFG_MBWU_FLT); + break; + default: + return; + } +} + +static bool mpam_csu_hisi_need_retrigger(struct mpam_msc_ris *ris, + bool read_again) +{ + static const struct midr_range cpus[] = { + MIDR_ALL_VERSIONS(MIDR_HISI_HIP12), + { /* sentinel */ } + }; + + if (ris->comp->class->type != MPAM_CLASS_CACHE || + ris->comp->class->level != 3) + return false; + + if (!is_midr_in_range_list(read_cpuid_id(), cpus)) + return false; + + if (read_again) + return false; + + return true; +} + +static void write_msmon_ctl_flt_vals(struct mon_read *m, u32 ctl_val, + u32 flt_val) +{ + struct mpam_msc *msc = m->ris->msc; + struct msmon_mbwu_state *mbwu_state; + + /* + * Write the ctl_val with the enable bit cleared, reset the counter, + * then enable counter. + */ + switch (m->type) { + case mpam_feat_msmon_csu: + mpam_write_monsel_reg(msc, CFG_CSU_FLT, flt_val); + mpam_write_monsel_reg(msc, CFG_CSU_CTL, ctl_val); + mpam_write_monsel_reg(msc, CSU, 0); + mpam_write_monsel_reg(msc, CFG_CSU_CTL, ctl_val|MSMON_CFG_x_CTL_EN); + break; + case mpam_feat_msmon_mbwu: + mpam_write_monsel_reg(msc, CFG_MBWU_FLT, flt_val); + mpam_write_monsel_reg(msc, CFG_MBWU_CTL, ctl_val); + + if (mpam_ris_has_mbwu_long_counter(m->ris)) + mpam_msc_zero_mbwu_l(m->ris->msc); + else + mpam_write_monsel_reg(msc, MBWU, 0); + + mpam_write_monsel_reg(msc, CFG_MBWU_CTL, ctl_val|MSMON_CFG_x_CTL_EN); + + mbwu_state = &m->ris->mbwu_state[m->ctx->mon]; + if (mbwu_state) + mbwu_state->prev_val = 0; + + break; + default: + return; + } +} + +static u64 mpam_msmon_overflow_val(struct mpam_msc_ris *ris) +{ + /* TODO: implement scaling counters */ + if (mpam_has_feature(mpam_feat_msmon_mbwu_63counter, &ris->props)) + return GENMASK_ULL(62, 0); + else if (mpam_has_feature(mpam_feat_msmon_mbwu_44counter, &ris->props)) + return GENMASK_ULL(43, 0); + else + return GENMASK_ULL(30, 0); +} + +bool resctrl_arch_would_mbm_overflow(void) +{ + return read_cpuid_implementor() != ARM_CPU_IMP_HISI; +} + +static bool mpam_ris_has_nrdy_bit(struct mpam_msc_ris *ris) +{ + if (ris->comp->class->type == MPAM_CLASS_MEMORY) + return read_cpuid_implementor() != ARM_CPU_IMP_HISI; + + return true; +} + +static u64 mpam_csu_hisi_need_halved(struct mpam_msc_ris *ris, u64 now) +{ + static const struct midr_range cpus[] = { + MIDR_ALL_VERSIONS(MIDR_HISI_HIP12), + { /* sentinel */ } + }; + + if (!is_midr_in_range_list(read_cpuid_id(), cpus)) + return now; + + if (ris->comp->class->type != MPAM_CLASS_CACHE || + ris->comp->class->level != 3) + return now; + + return now >> 1; +} + +static bool read_msmon_mbwu_is_overflow(struct mpam_msc *msc) +{ + u32 ctl; + bool overflow; + + ctl = mpam_read_monsel_reg(msc, CFG_MBWU_CTL); + overflow = ctl & MSMON_CFG_x_CTL_OFLOW_STATUS ? true : false; + + if (overflow) + mpam_write_monsel_reg(msc, CFG_MBWU_CTL, ctl & + ~MSMON_CFG_x_CTL_OFLOW_STATUS); + + return overflow; +} + +static void __ris_msmon_read(void *arg) +{ + bool nrdy = false; + unsigned long flags; + bool config_mismatch; + struct mon_read *m = arg; + u64 now, overflow_val = 0; + bool mbwu_overflow = false; + struct mon_cfg *ctx = m->ctx; + bool reset_on_next_read = false; + struct mpam_msc_ris *ris = m->ris; + struct mpam_msc *msc = m->ris->msc; + struct msmon_mbwu_state *mbwu_state = NULL; + u32 mon_sel, ctl_val, flt_val, cur_ctl, cur_flt; + + lockdep_assert_held(&msc->lock); + + spin_lock_irqsave(&msc->mon_sel_lock, flags); + mon_sel = FIELD_PREP(MSMON_CFG_MON_SEL_MON_SEL, ctx->mon) | + FIELD_PREP(MSMON_CFG_MON_SEL_RIS, ris->ris_idx); + mpam_write_monsel_reg(msc, CFG_MON_SEL, mon_sel); + + /* Selects a monitor instance to configure PARTID. */ + wmb(); + + if (m->type == mpam_feat_msmon_mbwu) { + mbwu_state = &ris->mbwu_state[ctx->mon]; + if (mbwu_state) { + reset_on_next_read = mbwu_state->reset_on_next_read; + mbwu_state->reset_on_next_read = false; + } + + mbwu_overflow = read_msmon_mbwu_is_overflow(msc); + } + + /* + * Read the existing configuration to avoid re-writing the same values. + * This saves waiting for 'nrdy' on subsequent reads. + */ + read_msmon_ctl_flt_vals(m, &cur_ctl, &cur_flt); + gen_msmon_ctl_flt_vals(m, &ctl_val, &flt_val); + config_mismatch = cur_flt != flt_val || + cur_ctl != (ctl_val | MSMON_CFG_x_CTL_EN); + + if (config_mismatch || reset_on_next_read || + mpam_csu_hisi_need_retrigger(ris, m->err == -EBUSY)) { + write_msmon_ctl_flt_vals(m, ctl_val, flt_val); + if (mbwu_state) { + mbwu_state->prev_val = 0; + mbwu_state->correction = 0; + mbwu_overflow = false; + } + } + + /* + * Selects the monitor instance associated to the specified PARTID + * to read counter value. + */ + wmb(); + + switch (m->type) { + case mpam_feat_msmon_csu: + now = mpam_read_monsel_reg(msc, CSU); + nrdy = now & MSMON___NRDY; + now = FIELD_GET(MSMON___VALUE, now); + now = mpam_csu_hisi_need_halved(ris, now); + break; + case mpam_feat_msmon_mbwu: + /* + * If long or lwd counters are supported, use them, else revert + * to the 32 bit counter. + */ + if (mpam_ris_has_mbwu_long_counter(ris)) { + now = mpam_msc_read_mbwu_l(msc); + nrdy = now & MSMON___NRDY_L; + if (mpam_has_feature(mpam_feat_msmon_mbwu_63counter, &ris->props)) + now = FIELD_GET(MSMON___LWD_VALUE, now); + else + now = FIELD_GET(MSMON___L_VALUE, now); + } else { + now = mpam_read_monsel_reg(msc, MBWU); + nrdy = now & MSMON___NRDY; + now = FIELD_GET(MSMON___VALUE, now); + } + + if (config_mismatch && !mpam_ris_has_nrdy_bit(ris)) + nrdy = true; + + if (nrdy) + break; + + if (!mbwu_state) + break; + + /* + * Following the definition of the DDI0598 version, + * the value field of MPAM Memory Bandwidth Usage Monitor Register + * indicates the memory bandwidth usage in bytes per second, + * instead the scaled count of bytes transferred since the monitor + * was last reset in the latest version (DDI0598D_b). + */ + if (ris->comp->class->type == MPAM_CLASS_MEMORY) { + if (!resctrl_arch_would_mbm_overflow()) + break; + } + + /* Add any pre-overflow value to the mbwu_state->val */ + if (mbwu_overflow) + overflow_val = mpam_msmon_overflow_val(ris); + + mbwu_state->prev_val = now; + mbwu_state->correction += overflow_val; + + /* Include bandwidth consumed before the last hardware reset */ + now += mbwu_state->correction; + break; + default: + return; + } + spin_unlock_irqrestore(&msc->mon_sel_lock, flags); + + if (nrdy) { + m->err = -EBUSY; + return; + } + + *(m->val) += now; + m->err = 0; +} + +static int _msmon_read(struct mpam_component *comp, struct mon_read *arg) +{ + int err = 0, idx; + bool read_again; + u64 wait_jiffies; + struct mpam_msc *msc; + struct mpam_msc_ris *ris; + + idx = srcu_read_lock(&mpam_srcu); + list_for_each_entry_rcu(ris, &comp->ris, comp_list) { + arg->ris = ris; + + msc = ris->msc; + read_again = false; +again: + mutex_lock(&msc->lock); + err = smp_call_function_any(&msc->accessibility, + __ris_msmon_read, arg, true); + mutex_unlock(&msc->lock); + + if (arg->err == -EBUSY && !read_again) { + read_again = true; + + wait_jiffies = usecs_to_jiffies(comp->class->nrdy_usec); + while (wait_jiffies) + wait_jiffies = schedule_timeout_uninterruptible(wait_jiffies); + + goto again; + } + + if (!err && arg->err) + err = arg->err; + if (err) + break; + } + srcu_read_unlock(&mpam_srcu, idx); + + return err; +} + +int mpam_msmon_read(struct mpam_component *comp, struct mon_cfg *ctx, + enum mpam_device_features type, u64 *val) +{ + struct mon_read arg; + struct mpam_props *cprops = &comp->class->props; + + might_sleep(); + + if (!mpam_is_enabled()) + return -EIO; + + if (!mpam_has_feature(type, cprops)) + return -EOPNOTSUPP; + + memset(&arg, 0, sizeof(arg)); + arg.ctx = ctx; + arg.type = type; + arg.val = val; + arg.err = 0; + *val = 0; + + return _msmon_read(comp, &arg); +} + +void mpam_msmon_reset_all_mbwu(struct mpam_component *comp) +{ + int idx, i; + unsigned long flags; + struct mpam_msc *msc; + struct mpam_msc_ris *ris; + + if (!mpam_is_enabled()) + return; + + idx = srcu_read_lock(&mpam_srcu); + list_for_each_entry_rcu(ris, &comp->ris, comp_list) { + if (!mpam_has_feature(mpam_feat_msmon_mbwu, &ris->props)) + continue; + + msc = ris->msc; + spin_lock_irqsave(&msc->mon_sel_lock, flags); + for (i = 0; i < ris->props.num_mbwu_mon; i++) { + ris->mbwu_state[i].correction = 0; + ris->mbwu_state[i].reset_on_next_read = true; + } + spin_unlock_irqrestore(&msc->mon_sel_lock, flags); + } + srcu_read_unlock(&mpam_srcu, idx); +} + +void mpam_msmon_reset_mbwu(struct mpam_component *comp, struct mon_cfg *ctx) +{ + int idx; + unsigned long flags; + struct mpam_msc *msc; + struct mpam_msc_ris *ris; + + if (!mpam_is_enabled()) + return; + + idx = srcu_read_lock(&mpam_srcu); + list_for_each_entry_rcu(ris, &comp->ris, comp_list) { + if (!mpam_has_feature(mpam_feat_msmon_mbwu, &ris->props)) + continue; + + msc = ris->msc; + spin_lock_irqsave(&msc->mon_sel_lock, flags); + ris->mbwu_state[ctx->mon].correction = 0; + ris->mbwu_state[ctx->mon].reset_on_next_read = true; + spin_unlock_irqrestore(&msc->mon_sel_lock, flags); + } + srcu_read_unlock(&mpam_srcu, idx); +} + +static void mpam_reset_msc_bitmap(struct mpam_msc *msc, u16 reg, u16 wd) +{ + u32 num_words, msb; + u32 bm = ~0; + int i; + + lockdep_assert_held(&msc->part_sel_lock); + + /* + * Write all ~0 to all but the last 32bit-word, which may + * have fewer bits... + */ + num_words = DIV_ROUND_UP(wd, 32); + for (i = 0; i < num_words - 1; i++, reg += sizeof(bm)) + __mpam_write_reg(msc, reg, bm); + + /* + * ....and then the last (maybe) partial 32bit word. When wd is a + * multiple of 32, msb should be 31 to write a full 32bit word. + */ + msb = (wd - 1) % 32; + bm = GENMASK(msb, 0); + if (bm) + __mpam_write_reg(msc, reg, bm); +} + +static u32 mpam_cpbm_hisi_workaround(u32 cpbm, u8 cache_level) +{ + static const struct midr_range cpus[] = { + MIDR_ALL_VERSIONS(MIDR_HISI_HIP12), + { /* sentinel */ } + }; + + if (cache_level != 3 || + !is_midr_in_range_list(read_cpuid_id(), cpus)) + return cpbm; + + if (cpbm & BIT(18)) + cpbm |= (BIT(19) | BIT(20)); + + if (cpbm & BIT(17)) + cpbm |= BIT(18); + else + cpbm &= ~BIT(18); + + return cpbm; +} + +static u16 mpam_intpri_default_val(struct mpam_msc_ris *ris) +{ + struct mpam_class *class = ris->comp->class; + + switch (class->type) { + case MPAM_CLASS_MEMORY: + return resctrl_arch_get_resource(RDT_RESOURCE_MB_PRI)->default_ctrl; + + case MPAM_CLASS_CACHE: + if (class->level == 3) + return resctrl_arch_get_resource(RDT_RESOURCE_L3_PRI)->default_ctrl; + + if (class->level == 2) + return resctrl_arch_get_resource(RDT_RESOURCE_L2_PRI)->default_ctrl; + + default: + break; + } + + if (mpam_has_feature(mpam_feat_intpri_part_0_low, &ris->props)) + return GENMASK(ris->props.intpri_wd - 1, 0); + + return 0; +} + +static void mpam_reprogram_ris_partid(struct mpam_msc_ris *ris, u16 partid, + struct mpam_config *cfg) +{ + u16 dspri; + bool limit; + u32 pri_val = 0; + u16 cmax = MPAMCFG_CMAX_CMAX; + struct mpam_msc *msc = ris->msc; + u16 bwa_fract = MPAMCFG_MBW_MAX_MAX; + struct mpam_props *rprops = &ris->props; + + spin_lock(&msc->part_sel_lock); + __mpam_part_sel(ris->ris_idx, partid, msc); + + if (mpam_has_feature(mpam_feat_partid_nrw, rprops)) + mpam_write_partsel_reg(msc, INTPARTID, + (MPAMCFG_PART_SEL_INTERNAL | partid)); + + if (mpam_has_feature(mpam_feat_cpor_part, rprops)) { + if (mpam_has_feature(mpam_feat_cpor_part, cfg)) + mpam_write_partsel_reg(msc, CPBM, + mpam_cpbm_hisi_workaround(cfg->cpbm, + ris->comp->class->level)); + else + mpam_reset_msc_bitmap(msc, MPAMCFG_CPBM, + mpam_cpbm_wd_hisi_workaround( + rprops->cpbm_wd, + mpam_feat_cpor_part, + ris->comp->class->level)); + } + + if (mpam_has_feature(mpam_feat_ccap_part, rprops)) { + if (mpam_has_feature(mpam_feat_ccap_part, cfg)) + mpam_write_partsel_reg(msc, CMAX, cfg->cmax); + else + mpam_write_partsel_reg(msc, CMAX, cmax); + } + + if (mpam_has_feature(mpam_feat_cmin, rprops)) { + if (mpam_has_feature(mpam_feat_cmin, cfg)) + mpam_write_partsel_reg(msc, CMIN, cfg->cmin); + else + mpam_write_partsel_reg(msc, CMIN, 0); + } + + if (mpam_has_feature(mpam_feat_mbw_part, rprops)) { + if (mpam_has_feature(mpam_feat_mbw_part, cfg)) + mpam_write_partsel_reg(msc, MBW_PBM, cfg->mbw_pbm); + else + mpam_reset_msc_bitmap(msc, MPAMCFG_MBW_PBM, + rprops->mbw_pbm_bits); + } + + if (mpam_has_feature(mpam_feat_mbw_min, rprops)) { + if (mpam_has_feature(mpam_feat_mbw_min, cfg)) + mpam_write_partsel_reg(msc, MBW_MIN, cfg->mbw_min); + else + mpam_write_partsel_reg(msc, MBW_MIN, 0); + } + + if (mpam_has_feature(mpam_feat_mbw_max, rprops)) { + if (mpam_has_feature(mpam_feat_mbw_max, cfg)) + bwa_fract = cfg->mbw_max; + + if (mpam_has_feature(mpam_feat_max_limit, cfg)) + limit = cfg->max_limit; + else + limit = true; + + if (limit) + mpam_write_partsel_reg(msc, MBW_MAX, bwa_fract | + MPAMCFG_MBW_MAX_HARDLIM); + else + mpam_write_partsel_reg(msc, MBW_MAX, bwa_fract); + } + + if (mpam_has_feature(mpam_feat_mbw_prop, rprops)) + mpam_write_partsel_reg(msc, MBW_PROP, bwa_fract); + + if (mpam_has_feature(mpam_feat_intpri_part, rprops)) { + if (mpam_has_feature(mpam_feat_intpri_part, cfg)) + pri_val |= FIELD_PREP(MPAMCFG_PRI_INTPRI, cfg->intpri); + else + pri_val |= FIELD_PREP(MPAMCFG_PRI_INTPRI, + mpam_intpri_default_val(ris)); + } + + if (mpam_has_feature(mpam_feat_dspri_part_0_low, rprops)) + dspri = GENMASK(rprops->dspri_wd - 1, 0); + else + dspri = 0; + + if (mpam_has_feature(mpam_feat_dspri_part, rprops)) + pri_val |= FIELD_PREP(MPAMCFG_PRI_DSPRI, dspri); + + if (mpam_has_feature(mpam_feat_intpri_part, rprops) || + mpam_has_feature(mpam_feat_dspri_part, rprops)) + mpam_write_partsel_reg(msc, PRI, pri_val); + + spin_unlock(&msc->part_sel_lock); +} + +struct reprogram_ris { + struct mpam_msc_ris *ris; + struct mpam_config *cfg; +}; + +/* Call with MSC lock held */ +static int mpam_reprogram_ris(void *_arg) +{ + u16 partid, num_partid; + struct reprogram_ris *arg = _arg; + struct mpam_msc_ris *ris = arg->ris; + struct mpam_config *cfg = arg->cfg; + + if (ris->in_reset_state) + return 0; + + spin_lock(&partid_max_lock); + num_partid = resctrl_arch_get_num_closid(NULL); + spin_unlock(&partid_max_lock); + for (partid = 0; partid < num_partid; partid++) + mpam_reprogram_ris_partid(ris, partid, cfg); + + return 0; +} + +static int mpam_restore_mbwu_state(void *_ris) +{ + int i; + struct mon_read mwbu_arg = {}; + struct mpam_msc_ris *ris = _ris; + + for (i = 0; i < ris->props.num_mbwu_mon; i++) { + if (ris->mbwu_state[i].enabled) { + mwbu_arg.ris = ris; + mwbu_arg.ctx = &ris->mbwu_state[i].cfg; + mwbu_arg.type = mpam_feat_msmon_mbwu; + + __ris_msmon_read(&mwbu_arg); + } + } + + return 0; +} + +static int mpam_save_mbwu_state(void *arg) +{ + int i; + u64 val; + struct mon_cfg *cfg; + unsigned long flags; + u32 cur_flt, cur_ctl, mon_sel; + struct mpam_msc_ris *ris = arg; + struct mpam_msc *msc = ris->msc; + struct msmon_mbwu_state *mbwu_state; + + for (i = 0; i < ris->props.num_mbwu_mon; i++) { + mbwu_state = &ris->mbwu_state[i]; + cfg = &mbwu_state->cfg; + + spin_lock_irqsave(&msc->mon_sel_lock, flags); + mon_sel = FIELD_PREP(MSMON_CFG_MON_SEL_MON_SEL, i) | + FIELD_PREP(MSMON_CFG_MON_SEL_RIS, ris->ris_idx); + mpam_write_monsel_reg(msc, CFG_MON_SEL, mon_sel); + + cur_flt = mpam_read_monsel_reg(msc, CFG_MBWU_FLT); + cur_ctl = mpam_read_monsel_reg(msc, CFG_MBWU_CTL); + mpam_write_monsel_reg(msc, CFG_MBWU_CTL, 0); + + if (mpam_ris_has_mbwu_long_counter(ris)) { + val = mpam_msc_read_mbwu_l(msc); + mpam_msc_zero_mbwu_l(msc); + } else { + val = mpam_read_monsel_reg(msc, MBWU); + mpam_write_monsel_reg(msc, MBWU, 0); + } + + cfg->mon = i; + cfg->pmg = FIELD_GET(MSMON_CFG_MBWU_FLT_PMG, cur_flt); + cfg->match_pmg = FIELD_GET(MSMON_CFG_x_CTL_MATCH_PMG, cur_ctl); + cfg->partid = FIELD_GET(MSMON_CFG_MBWU_FLT_PARTID, cur_flt); + mbwu_state->correction += val; + mbwu_state->enabled = FIELD_GET(MSMON_CFG_x_CTL_EN, cur_ctl); + spin_unlock_irqrestore(&msc->mon_sel_lock, flags); + } + + return 0; +} + +/* + * Called via smp_call_on_cpu() to prevent migration, while still being + * pre-emptible. + */ +static int mpam_reset_ris(void *arg) +{ + struct mpam_msc_ris *ris = arg; + struct reprogram_ris reprogram_arg; + struct mpam_config empty_cfg = { 0 }; + + if (ris->in_reset_state) + return 0; + + reprogram_arg.ris = ris; + reprogram_arg.cfg = &empty_cfg; + + mpam_reprogram_ris(&reprogram_arg); + + return 0; +} + +/* + * Get the preferred CPU for this MSC. If it is accessible from this CPU, + * this CPU is preferred. This can be preempted/migrated, it will only result + * in more work. + */ +static int mpam_get_msc_preferred_cpu(struct mpam_msc *msc) +{ + int cpu = raw_smp_processor_id(); + + if (cpumask_test_cpu(cpu, &msc->accessibility)) + return cpu; + + return cpumask_first_and(&msc->accessibility, cpu_online_mask); +} + +static int mpam_touch_msc(struct mpam_msc *msc, int (*fn)(void *a), void *arg) +{ + lockdep_assert_irqs_enabled(); + lockdep_assert_cpus_held(); + lockdep_assert_held(&msc->lock); + + return smp_call_on_cpu(mpam_get_msc_preferred_cpu(msc), fn, arg, true); +} + +static void mpam_reset_msc(struct mpam_msc *msc, bool online) +{ + int idx; + struct mpam_msc_ris *ris; + + lockdep_assert_held(&msc->lock); + + idx = srcu_read_lock(&mpam_srcu); + list_for_each_entry_rcu(ris, &msc->ris, msc_list) { + mpam_touch_msc(msc, &mpam_reset_ris, ris); + + /* + * Set in_reset_state when coming online. The reset state + * for non-zero partid may be lost while the CPUs are offline. + */ + ris->in_reset_state = online; + + if (mpam_is_enabled() && !online) + mpam_touch_msc(msc, &mpam_save_mbwu_state, ris); + } + srcu_read_unlock(&mpam_srcu, idx); +} + +static void mpam_reprogram_msc(struct mpam_msc *msc) +{ + int idx; + u16 partid; + bool reset; + struct mpam_config *cfg; + struct mpam_msc_ris *ris; + + lockdep_assert_held(&msc->lock); + + idx = srcu_read_lock(&mpam_srcu); + list_for_each_entry_rcu(ris, &msc->ris, msc_list) { + if (!mpam_is_enabled() && !ris->in_reset_state) { + mpam_touch_msc(msc, &mpam_reset_ris, ris); + ris->in_reset_state = true; + continue; + } + + reset = true; + for (partid = 0; partid < resctrl_arch_get_num_closid(NULL); partid++) { + cfg = &ris->comp->cfg[partid]; + if (cfg->features) + reset = false; + + mpam_reprogram_ris_partid(ris, partid, cfg); + } + ris->in_reset_state = reset; + + if (mpam_has_feature(mpam_feat_msmon_mbwu, &ris->props)) + mpam_touch_msc(msc, &mpam_restore_mbwu_state, ris); + } + srcu_read_unlock(&mpam_srcu, idx); +} + +static void _enable_percpu_irq(void *_irq) +{ + int *irq = _irq; + + enable_percpu_irq(*irq, IRQ_TYPE_NONE); +} + +static int mpam_cpu_online(unsigned int cpu) +{ + int idx; + struct mpam_msc *msc; + + idx = srcu_read_lock(&mpam_srcu); + list_for_each_entry_rcu(msc, &mpam_all_msc, glbl_list) { + if (!cpumask_test_cpu(cpu, &msc->accessibility)) + continue; + + mutex_lock(&msc->lock); + if (msc->reenable_error_ppi) + _enable_percpu_irq(&msc->reenable_error_ppi); + + if (atomic_fetch_inc(&msc->online_refs) == 0) + mpam_reprogram_msc(msc); + mutex_unlock(&msc->lock); + } + srcu_read_unlock(&mpam_srcu, idx); + + if (mpam_is_enabled()) + mpam_resctrl_online_cpu(cpu); + + return 0; +} + +/* Before mpam is enabled, try to probe new MSC */ +static int mpam_discovery_cpu_online(unsigned int cpu) +{ + int err = 0; + struct mpam_msc *msc; + bool new_device_probed = false; + + if (mpam_is_enabled()) + return 0; + + mutex_lock(&mpam_list_lock); + list_for_each_entry(msc, &mpam_all_msc, glbl_list) { + if (!cpumask_test_cpu(cpu, &msc->accessibility)) + continue; + + mutex_lock(&msc->lock); + if (!msc->probed) + err = mpam_msc_hw_probe(msc); + mutex_unlock(&msc->lock); + + if (!err) + new_device_probed = true; + else + break; // mpam_broken + } + mutex_unlock(&mpam_list_lock); + + if (new_device_probed && !err) + schedule_work(&mpam_enable_work); + + if (err < 0) + return err; + + return mpam_cpu_online(cpu); +} + +static int mpam_cpu_offline(unsigned int cpu) +{ + int ret, idx; + struct mpam_msc *msc; + + ret = mpam_resctrl_prepare_offline(); + if (ret) + return ret; + + idx = srcu_read_lock(&mpam_srcu); + list_for_each_entry_rcu(msc, &mpam_all_msc, glbl_list) { + if (!cpumask_test_cpu(cpu, &msc->accessibility)) + continue; + + mutex_lock(&msc->lock); + if (msc->reenable_error_ppi) + disable_percpu_irq(msc->reenable_error_ppi); + + if (atomic_dec_and_test(&msc->online_refs)) + mpam_reset_msc(msc, false); + mutex_unlock(&msc->lock); + } + srcu_read_unlock(&mpam_srcu, idx); + + if (mpam_is_enabled()) + mpam_resctrl_offline_cpu(cpu); + + return 0; +} + +static void mpam_register_cpuhp_callbacks(int (*online)(unsigned int online)) +{ + mutex_lock(&mpam_cpuhp_state_lock); + mpam_cpuhp_state = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "mpam:online", + online, mpam_cpu_offline); + if (mpam_cpuhp_state <= 0) { + pr_err("Failed to register cpuhp callbacks"); + mpam_cpuhp_state = 0; + } + mutex_unlock(&mpam_cpuhp_state_lock); +} + +static int __setup_ppi(struct mpam_msc *msc) +{ + int cpu; + + msc->error_dev_id = alloc_percpu_gfp(struct mpam_msc *, GFP_KERNEL); + if (!msc->error_dev_id) + return -ENOMEM; + + for_each_cpu(cpu, &msc->accessibility) { + struct mpam_msc *empty = *per_cpu_ptr(msc->error_dev_id, cpu); + + if (empty != NULL) { + pr_err_once("%s shares PPI with %s!\n", dev_name(&msc->pdev->dev), + dev_name(&empty->pdev->dev)); + return -EBUSY; + } + *per_cpu_ptr(msc->error_dev_id, cpu) = msc; + } + + return 0; +} + +static int mpam_msc_setup_error_irq(struct mpam_msc *msc) +{ + int irq; + + irq = platform_get_irq_byname_optional(msc->pdev, "error"); + if (irq <= 0) + return 0; + + /* Allocate and initialise the percpu device pointer for PPI */ + if (irq_is_percpu(irq)) + return __setup_ppi(msc); + + /* sanity check: shared interrupts can be routed anywhere? */ + if (!cpumask_equal(&msc->accessibility, cpu_possible_mask)) { + pr_err_once("msc:%u is a private resource with a shared error interrupt", + msc->id); + return -EINVAL; + } + + return 0; +} + +static int mpam_dt_count_msc(void) +{ + int count = 0; + struct device_node *np; + + for_each_compatible_node(np, NULL, "arm,mpam-msc") + count++; + + return count; +} + +static int mpam_dt_parse_resource(struct mpam_msc *msc, struct device_node *np, + u32 ris_idx) +{ + int err = 0; + u32 level = 0; + unsigned long cache_id; + struct device_node *cache; + + do { + if (of_device_is_compatible(np, "arm,mpam-cache")) { + cache = of_parse_phandle(np, "arm,mpam-device", 0); + if (!cache) { + pr_err("Failed to read phandle\n"); + break; + } + } else if (of_device_is_compatible(np->parent, "cache")) { + cache = np->parent; + } else { + /* For now, only caches are supported */ + cache = NULL; + break; + } + + err = of_property_read_u32(cache, "cache-level", &level); + if (err) { + pr_err("Failed to read cache-level\n"); + break; + } + + cache_id = cache_of_get_id(cache); + if (cache_id == ~0UL) { + err = -ENOENT; + break; + } + + err = mpam_ris_create(msc, ris_idx, MPAM_CLASS_CACHE, level, + cache_id); + } while (0); + of_node_put(cache); + + return err; +} + + +static int mpam_dt_parse_resources(struct mpam_msc *msc, void *ignored) +{ + int err = 0, num_ris = 0; + const u32 *ris_idx_p; + struct device_node *iter, *np; + + np = msc->pdev->dev.of_node; + for_each_child_of_node(np, iter) { + ris_idx_p = of_get_property(iter, "reg", NULL); + if (ris_idx_p) { + num_ris++; + err = mpam_dt_parse_resource(msc, iter, *ris_idx_p); + if (err) { + of_node_put(iter); + return err; + } + } + } + + if (!num_ris) + mpam_dt_parse_resource(msc, np, 0); + + return err; +} + +static int get_msc_affinity(struct mpam_msc *msc) +{ + struct device_node *parent; + u32 affinity_id; + int err; + + if (!acpi_disabled) { + err = device_property_read_u32(&msc->pdev->dev, "cpu_affinity", + &affinity_id); + if (err) { + cpumask_copy(&msc->accessibility, cpu_possible_mask); + err = 0; + } else { + err = acpi_pptt_get_cpus_from_container(affinity_id, + &msc->accessibility); + } + + return err; + } + + /* This depends on the path to of_node */ + parent = of_get_parent(msc->pdev->dev.of_node); + if (parent == of_root) { + cpumask_copy(&msc->accessibility, cpu_possible_mask); + err = 0; + } else { + if (of_device_is_compatible(parent, "cache")) { + err = get_cpumask_from_cache(parent, + &msc->accessibility); + } else { + err = -EINVAL; + pr_err("Cannot determine accessibility of MSC: %s\n", + dev_name(&msc->pdev->dev)); + } + } + of_node_put(parent); + + return err; +} + +static int fw_num_msc; + +static void mpam_pcc_rx_callback(struct mbox_client *cl, void *msg) +{ + /* TODO: wake up tasks blocked on this MSC's PCC channel */ +} + +static int mpam_msc_drv_probe(struct platform_device *pdev) +{ + int err; + pgprot_t prot; + void * __iomem io; + struct mpam_msc *msc; + struct resource *msc_res; + void *plat_data = pdev->dev.platform_data; + + mutex_lock(&mpam_list_lock); + do { + msc = devm_kzalloc(&pdev->dev, sizeof(*msc), GFP_KERNEL); + if (!msc) { + err = -ENOMEM; + break; + } + + INIT_LIST_HEAD_RCU(&msc->glbl_list); + msc->pdev = pdev; + + err = device_property_read_u32(&pdev->dev, "arm,not-ready-us", + &msc->nrdy_usec); + if (err) { + /* This will prevent CSU monitors being usable */ + msc->nrdy_usec = 0; + } + + err = get_msc_affinity(msc); + if (err) + break; + if (cpumask_empty(&msc->accessibility)) { + pr_err_once("msc:%u is not accessible from any CPU!", + msc->id); + err = -EINVAL; + break; + } + + mutex_init(&msc->lock); + msc->id = mpam_num_msc++; + INIT_LIST_HEAD_RCU(&msc->ris); + spin_lock_init(&msc->part_sel_lock); + spin_lock_init(&msc->mon_sel_lock); + + err = mpam_msc_setup_error_irq(msc); + if (err) { + devm_kfree(&pdev->dev, msc); + msc = ERR_PTR(err); + break; + } + + if (device_property_read_u32(&pdev->dev, "pcc-channel", + &msc->pcc_subspace_id)) + msc->iface = MPAM_IFACE_MMIO; + else + msc->iface = MPAM_IFACE_PCC; + + if (msc->iface == MPAM_IFACE_MMIO) { + io = devm_platform_get_and_ioremap_resource(pdev, 0, + &msc_res); + if (IS_ERR(io)) { + pr_err("Failed to map MSC base address\n"); + devm_kfree(&pdev->dev, msc); + err = PTR_ERR(io); + break; + } + msc->mapped_hwpage_sz = msc_res->end - msc_res->start; + msc->mapped_hwpage = io; + } else if (msc->iface == MPAM_IFACE_PCC) { + msc->pcc_cl.dev = &pdev->dev; + msc->pcc_cl.rx_callback = mpam_pcc_rx_callback; + msc->pcc_cl.tx_block = false; + msc->pcc_cl.tx_tout = 1000; /* 1s */ + msc->pcc_cl.knows_txdone = false; + + msc->pcc_chan = pcc_mbox_request_channel(&msc->pcc_cl, + msc->pcc_subspace_id); + if (IS_ERR(msc->pcc_chan)) { + pr_err("Failed to request MSC PCC channel\n"); + devm_kfree(&pdev->dev, msc); + err = PTR_ERR(msc->pcc_chan); + break; + } + + prot = __acpi_get_mem_attribute(msc->pcc_chan->shmem_base_addr); + io = ioremap_prot(msc->pcc_chan->shmem_base_addr, + msc->pcc_chan->shmem_size, pgprot_val(prot)); + if (IS_ERR(io)) { + pr_err("Failed to map MSC base address\n"); + pcc_mbox_free_channel(msc->pcc_chan); + devm_kfree(&pdev->dev, msc); + err = PTR_ERR(io); + break; + } + + /* TODO: issue a read to update the registers */ + + msc->mapped_hwpage_sz = msc->pcc_chan->shmem_size; + msc->mapped_hwpage = io + sizeof(struct acpi_pcct_shared_memory); + } + + list_add_rcu(&msc->glbl_list, &mpam_all_msc); + platform_set_drvdata(pdev, msc); + } while (0); + mutex_unlock(&mpam_list_lock); + + if (!err) { + /* Create RIS entries described by firmware */ + if (!acpi_disabled) + err = acpi_mpam_parse_resources(msc, plat_data); + else + err = mpam_dt_parse_resources(msc, plat_data); + } + + if (!err && fw_num_msc == mpam_num_msc) + mpam_register_cpuhp_callbacks(&mpam_discovery_cpu_online); + + return err; +} + +/* + * If a resource doesn't match class feature/configuration, do the right thing. + * For 'num' properties we can just take the minimum. + * For properties where the mismatched unused bits would make a difference, we + * nobble the class feature, as we can't configure all the resources. + * e.g. The L3 cache is composed of two resources with 13 and 17 portion + * bitmaps respectively. + */ +static void +__resource_props_mismatch(struct mpam_msc_ris *ris, struct mpam_class *class) +{ + struct mpam_props *cprops = &class->props; + struct mpam_props *rprops = &ris->props; + + lockdep_assert_held(&mpam_list_lock); /* we modify class */ + + /* Clear missing features */ + cprops->features &= rprops->features; + + /* Set cpbm_wd with the min cpbm_wd among all cache msc */ + if (cprops->cpbm_wd != rprops->cpbm_wd) + cprops->cpbm_wd = min(cprops->cpbm_wd, rprops->cpbm_wd); + + if (cprops->mbw_pbm_bits != rprops->mbw_pbm_bits) + mpam_clear_feature(mpam_feat_mbw_part, &cprops->features); + + /* bwa_wd is a count of bits, fewer bits means less precision */ + if (cprops->bwa_wd != rprops->bwa_wd) + cprops->bwa_wd = min(cprops->bwa_wd, rprops->bwa_wd); + + /* For num properties, take the minimum */ + if (cprops->num_csu_mon != rprops->num_csu_mon) + cprops->num_csu_mon = min(cprops->num_csu_mon, rprops->num_csu_mon); + if (cprops->num_mbwu_mon != rprops->num_mbwu_mon) + cprops->num_mbwu_mon = min(cprops->num_mbwu_mon, rprops->num_mbwu_mon); + + if (cprops->intpri_wd != rprops->intpri_wd) + cprops->intpri_wd = min(cprops->intpri_wd, rprops->intpri_wd); + if (cprops->dspri_wd != rprops->dspri_wd) + cprops->dspri_wd = min(cprops->dspri_wd, rprops->dspri_wd); + + /* {int,ds}pri may not have differing 0-low behaviour */ + if (mpam_has_feature(mpam_feat_intpri_part_0_low, cprops) != + mpam_has_feature(mpam_feat_intpri_part_0_low, rprops)) + mpam_clear_feature(mpam_feat_intpri_part, &cprops->features); + if (mpam_has_feature(mpam_feat_dspri_part_0_low, cprops) != + mpam_has_feature(mpam_feat_dspri_part_0_low, rprops)) + mpam_clear_feature(mpam_feat_dspri_part, &cprops->features); +} + +/* + * Copy the first component's first resources's properties and features to the + * class. __resource_props_mismatch() will remove conflicts. + * It is not possible to have a class with no components, or a component with + * no resources. + */ +static void mpam_enable_init_class_features(struct mpam_class *class) +{ + struct mpam_msc_ris *ris; + struct mpam_component *comp; + + list_for_each_entry(comp, &class->components, class_list) { + list_for_each_entry(ris, &comp->ris, comp_list) { + if (ris->msc->probed) + break; + } + } + + if (WARN_ON(!comp) || WARN_ON(!ris)) + return; + + class->props = ris->props; +} + +/* Merge all the common resource features into class. */ +static void mpam_enable_merge_features(void) +{ + struct mpam_msc_ris *ris; + struct mpam_class *class; + struct mpam_component *comp; + + lockdep_assert_held(&mpam_list_lock); + + list_for_each_entry(class, &mpam_classes, classes_list) { + mpam_enable_init_class_features(class); + + list_for_each_entry(comp, &class->components, class_list) { + list_for_each_entry(ris, &comp->ris, comp_list) { + if (!ris->msc->probed) + continue; + + __resource_props_mismatch(ris, class); + + class->nrdy_usec = max(class->nrdy_usec, + ris->msc->nrdy_usec); + } + } + } +} + +static char *mpam_errcode_names[16] = { + [0] = "No error", + [1] = "PARTID_SEL_Range", + [2] = "Req_PARTID_Range", + [3] = "MSMONCFG_ID_RANGE", + [4] = "Req_PMG_Range", + [5] = "Monitor_Range", + [6] = "intPARTID_Range", + [7] = "Unexpected_INTERNAL", + [8] = "Undefined_RIS_PART_SEL", + [9] = "RIS_No_Control", + [10] = "Undefined_RIS_MON_SEL", + [11] = "RIS_No_Monitor", + [12 ... 15] = "Reserved" +}; + +static int mpam_enable_msc_ecr(void *_msc) +{ + struct mpam_msc *msc = _msc; + + writel_relaxed(1, msc->mapped_hwpage + MPAMF_ECR); + + return 0; +} + +static int mpam_disable_msc_ecr(void *_msc) +{ + struct mpam_msc *msc = _msc; + + writel_relaxed(0, msc->mapped_hwpage + MPAMF_ECR); + + return 0; +} + +static irqreturn_t __mpam_irq_handler(int irq, struct mpam_msc *msc) +{ + u64 reg; + u16 partid; + u8 errcode, pmg, ris; + + if (WARN_ON_ONCE(!msc) || + WARN_ON_ONCE(!cpumask_test_cpu(smp_processor_id(), + &msc->accessibility))) + return IRQ_NONE; + + reg = mpam_msc_read_esr(msc); + + errcode = FIELD_GET(MPAMF_ESR_ERRCODE, reg); + if (!errcode) + return IRQ_NONE; + + /* Clear level triggered irq */ + mpam_msc_zero_esr(msc); + + partid = FIELD_GET(MPAMF_ESR_PARTID_OR_MON, reg); + pmg = FIELD_GET(MPAMF_ESR_PMG, reg); + ris = FIELD_GET(MPAMF_ESR_PMG, reg); + + pr_err("error irq from msc:%u '%s', partid:%u, pmg: %u, ris: %u\n", + msc->id, mpam_errcode_names[errcode], partid, pmg, ris); + + if (irq_is_percpu(irq)) { + mpam_disable_msc_ecr(msc); + schedule_work(&mpam_broken_work); + return IRQ_HANDLED; + } + + return IRQ_WAKE_THREAD; +} + +static irqreturn_t mpam_ppi_handler(int irq, void *dev_id) +{ + struct mpam_msc *msc = *(struct mpam_msc **)dev_id; + + return __mpam_irq_handler(irq, msc); +} + +static irqreturn_t mpam_spi_handler(int irq, void *dev_id) +{ + struct mpam_msc *msc = dev_id; + + return __mpam_irq_handler(irq, msc); +} + +static irqreturn_t mpam_disable_thread(int irq, void *dev_id); + +static int mpam_register_irqs(void) +{ + int err, irq; + struct mpam_msc *msc; + + lockdep_assert_cpus_held(); + lockdep_assert_held(&mpam_list_lock); + + list_for_each_entry(msc, &mpam_all_msc, glbl_list) { + irq = platform_get_irq_byname_optional(msc->pdev, "error"); + if (irq <= 0) + continue; + + /* The MPAM spec says the interrupt can be SPI, PPI or LPI */ + /* We anticipate sharing the interrupt with other MSCs */ + if (irq_is_percpu(irq)) { + err = request_percpu_irq(irq, &mpam_ppi_handler, + "mpam:msc:error", + msc->error_dev_id); + if (err) + return err; + + mutex_lock(&msc->lock); + msc->reenable_error_ppi = irq; + smp_call_function_many(&msc->accessibility, + &_enable_percpu_irq, &irq, + true); + mutex_unlock(&msc->lock); + } else { + err = devm_request_threaded_irq(&msc->pdev->dev, irq, + &mpam_spi_handler, + &mpam_disable_thread, + IRQF_SHARED, + "mpam:msc:error", msc); + if (err) + return err; + } + + mutex_lock(&msc->lock); + msc->error_irq_requested = true; + mpam_touch_msc(msc, mpam_enable_msc_ecr, msc); + msc->error_irq_hw_enabled = true; + mutex_unlock(&msc->lock); + } + + return 0; +} + +static void mpam_unregister_irqs(void) +{ + int irq; + struct mpam_msc *msc; + + cpus_read_lock(); + /* take the lock as free_irq() can sleep */ + mutex_lock(&mpam_list_lock); + list_for_each_entry(msc, &mpam_all_msc, glbl_list) { + irq = platform_get_irq_byname_optional(msc->pdev, "error"); + if (irq <= 0) + continue; + + mutex_lock(&msc->lock); + if (msc->error_irq_hw_enabled) { + mpam_touch_msc(msc, mpam_disable_msc_ecr, msc); + msc->error_irq_hw_enabled = false; + } + + if (msc->error_irq_requested) { + if (irq_is_percpu(irq)) { + msc->reenable_error_ppi = 0; + free_percpu_irq(irq, msc->error_dev_id); + } else { + devm_free_irq(&msc->pdev->dev, irq, msc); + } + msc->error_irq_requested = false; + } + mutex_unlock(&msc->lock); + } + mutex_unlock(&mpam_list_lock); + cpus_read_unlock(); +} + +static void __destroy_component_cfg(struct mpam_component *comp) +{ + unsigned long flags; + struct mpam_msc_ris *ris; + struct msmon_mbwu_state *mbwu_state; + + kfree(comp->cfg); + list_for_each_entry(ris, &comp->ris, comp_list) { + mutex_lock(&ris->msc->lock); + spin_lock_irqsave(&ris->msc->mon_sel_lock, flags); + mbwu_state = ris->mbwu_state; + ris->mbwu_state = NULL; + spin_unlock_irqrestore(&ris->msc->mon_sel_lock, flags); + mutex_unlock(&ris->msc->lock); + + kfree(mbwu_state); + } +} + +static int __allocate_component_cfg(struct mpam_component *comp) +{ + unsigned long flags; + struct mpam_msc_ris *ris; + struct msmon_mbwu_state *mbwu_state; + + if (comp->cfg) + return 0; + + comp->cfg = kcalloc(resctrl_arch_get_num_closid(NULL), + sizeof(*comp->cfg), GFP_KERNEL); + if (!comp->cfg) + return -ENOMEM; + + list_for_each_entry(ris, &comp->ris, comp_list) { + if (!ris->props.num_mbwu_mon) + continue; + + mbwu_state = kcalloc(ris->props.num_mbwu_mon, + sizeof(*ris->mbwu_state), GFP_KERNEL); + if (!mbwu_state) { + __destroy_component_cfg(comp); + return -ENOMEM; + } + + mutex_lock(&ris->msc->lock); + spin_lock_irqsave(&ris->msc->mon_sel_lock, flags); + ris->mbwu_state = mbwu_state; + spin_unlock_irqrestore(&ris->msc->mon_sel_lock, flags); + mutex_unlock(&ris->msc->lock); + } + + return 0; +} + +static int mpam_allocate_config(void) +{ + int err = 0; + struct mpam_class *class; + struct mpam_component *comp; + + lockdep_assert_held(&mpam_list_lock); + + list_for_each_entry(class, &mpam_classes, classes_list) { + list_for_each_entry(comp, &class->components, class_list) { + err = __allocate_component_cfg(comp); + if (err) + return err; + } + } + + return 0; +} + +static void mpam_enable_once(void) +{ + int err; + + /* + * If all the MSC have been probed, enabling the IRQs happens next. + * That involves cross-calling to a CPU that can reach the MSC, and + * the locks must be taken in this order: + */ + cpus_read_lock(); + mutex_lock(&mpam_list_lock); + do { + mpam_enable_merge_features(); + + err = mpam_allocate_config(); + if (err) { + pr_err("Failed to allocate configuration arrays.\n"); + break; + } + + err = mpam_register_irqs(); + if (err) { + pr_warn("Failed to register irqs: %d\n", err); + break; + } + } while (0); + mutex_unlock(&mpam_list_lock); + cpus_read_unlock(); + + if (!err) { + err = mpam_resctrl_setup(); + if (err) + pr_err("Failed to initialise resctrl: %d\n", err); + } + + if (err) { + schedule_work(&mpam_broken_work); + return; + } + + mutex_lock(&mpam_cpuhp_state_lock); + cpuhp_remove_state(mpam_cpuhp_state); + mpam_cpuhp_state = 0; + mutex_unlock(&mpam_cpuhp_state_lock); + + /* + * Once the cpuhp callbacks have been changed, mpam_partid_max can no + * longer change. + */ + spin_lock(&partid_max_lock); + partid_max_published = true; + spin_unlock(&partid_max_lock); + + static_branch_enable(&mpam_enabled); + mpam_register_cpuhp_callbacks(mpam_cpu_online); + + pr_info("MPAM enabled with %u partid and %u pmg\n", + READ_ONCE(mpam_partid_max) + 1, mpam_pmg_max + 1); +} + +void mpam_reset_class(struct mpam_class *class) +{ + int idx; + struct mpam_msc_ris *ris; + struct mpam_component *comp; + + idx = srcu_read_lock(&mpam_srcu); + list_for_each_entry_rcu(comp, &class->components, class_list) { + memset(comp->cfg, 0, resctrl_arch_get_num_closid(NULL) * sizeof(*comp->cfg)); + + list_for_each_entry_rcu(ris, &comp->ris, comp_list) { + mutex_lock(&ris->msc->lock); + mpam_touch_msc(ris->msc, mpam_reset_ris, ris); + mutex_unlock(&ris->msc->lock); + ris->in_reset_state = true; + } + } + srcu_read_unlock(&mpam_srcu, idx); +} + +/* + * Called in response to an error IRQ. + * All of MPAMs errors indicate a software bug, restore any modified + * controls to their reset values. + */ +static irqreturn_t mpam_disable_thread(int irq, void *dev_id) +{ + int idx; + struct mpam_class *class; + + mutex_lock(&mpam_cpuhp_state_lock); + if (mpam_cpuhp_state) { + cpuhp_remove_state(mpam_cpuhp_state); + mpam_cpuhp_state = 0; + } + mutex_unlock(&mpam_cpuhp_state_lock); + + mpam_resctrl_exit(); + + static_branch_disable(&mpam_enabled); + + mpam_unregister_irqs(); + + idx = srcu_read_lock(&mpam_srcu); + list_for_each_entry_rcu(class, &mpam_classes, classes_list) + mpam_reset_class(class); + srcu_read_unlock(&mpam_srcu, idx); + + return IRQ_HANDLED; +} + +void mpam_disable(struct work_struct *ignored) +{ + mpam_disable_thread(0, NULL); +} + +/* + * Enable mpam once all devices have been probed. + * Scheduled by mpam_discovery_cpu_online() once all devices have been created. + * Also scheduled when new devices are probed when new CPUs come online. + */ +void mpam_enable(struct work_struct *work) +{ + cpumask_t mask; + static atomic_t once; + struct mpam_msc *msc; + bool all_devices_probed = true; + + mutex_lock(&mpam_list_lock); + list_for_each_entry(msc, &mpam_all_msc, glbl_list) { + mutex_lock(&msc->lock); + if (!msc->probed) { + cpumask_and(&mask, &msc->accessibility, cpu_online_mask); + if (!cpumask_empty(&mask)) + all_devices_probed = false; + } + mutex_unlock(&msc->lock); + + if (!all_devices_probed) + break; + } + mutex_unlock(&mpam_list_lock); + + if (all_devices_probed && !atomic_fetch_inc(&once)) + mpam_enable_once(); +} + +static int mpam_msc_drv_remove(struct platform_device *pdev) +{ + struct mpam_msc *msc = platform_get_drvdata(pdev); + + if (!msc) + return 0; + + mutex_lock(&mpam_list_lock); + mpam_num_msc--; + platform_set_drvdata(pdev, NULL); + list_del_rcu(&msc->glbl_list); + mpam_msc_destroy(msc); + synchronize_srcu(&mpam_srcu); + mutex_unlock(&mpam_list_lock); + + return 0; +} + +struct mpam_write_config_arg { + struct mpam_msc_ris *ris; + struct mpam_component *comp; + u16 partid; +}; + +static int __write_config(void *arg) +{ + struct mpam_write_config_arg *c = arg; + + mpam_reprogram_ris_partid(c->ris, c->partid, &c->comp->cfg[c->partid]); + + return 0; +} + +/* TODO: split into write_config/sync_config */ +/* TODO: add config_dirty bitmap to drive sync_config */ +int mpam_apply_config(struct mpam_component *comp, u16 partid, + struct mpam_config *cfg) +{ + struct mpam_write_config_arg arg; + struct mpam_msc_ris *ris; + int idx; + + lockdep_assert_cpus_held(); + + if (!memcmp(&comp->cfg[partid], cfg, sizeof(*cfg))) + return 0; + + comp->cfg[partid] = *cfg; + arg.comp = comp; + arg.partid = partid; + + idx = srcu_read_lock(&mpam_srcu); + list_for_each_entry_rcu(ris, &comp->ris, comp_list) { + arg.ris = ris; + mutex_lock(&ris->msc->lock); + mpam_touch_msc(ris->msc, __write_config, &arg); + mutex_unlock(&ris->msc->lock); + } + srcu_read_unlock(&mpam_srcu, idx); + + return 0; +} + +static const struct of_device_id mpam_of_match[] = { + { .compatible = "arm,mpam-msc", }, + {}, +}; +MODULE_DEVICE_TABLE(of, mpam_of_match); + +static struct platform_driver mpam_msc_driver = { + .driver = { + .name = "mpam_msc", + .of_match_table = of_match_ptr(mpam_of_match), + }, + .probe = mpam_msc_drv_probe, + .remove = mpam_msc_drv_remove, +}; + +/* + * MSC that are hidden under caches are not created as platform devices + * as there is no cache driver. Caches are also special-cased in + * get_msc_affinity(). + */ +static void mpam_dt_create_foundling_msc(void) +{ + int err; + struct device_node *cache; + + for_each_compatible_node(cache, NULL, "cache") { + err = of_platform_populate(cache, mpam_of_match, NULL, NULL); + if (err) + pr_err("Failed to create MSC devices under caches\n"); + } +} + +static int __init mpam_msc_driver_init(void) +{ + bool mpam_not_available = false; + + if (!mpam_cpus_have_feature()) + return -EOPNOTSUPP; + + init_srcu_struct(&mpam_srcu); + + /* + * If the MPAM CPU interface is not implemented, or reserved by + * firmware, there is no point touching the rest of the hardware. + */ + spin_lock(&partid_max_lock); + if (!partid_max_init || (!mpam_partid_max && !mpam_pmg_max)) + mpam_not_available = true; + spin_unlock(&partid_max_lock); + + if (mpam_not_available) + return 0; + + if (!acpi_disabled) + fw_num_msc = acpi_mpam_count_msc(); + else + fw_num_msc = mpam_dt_count_msc(); + + if (fw_num_msc <= 0) { + pr_err("No MSC devices found in firmware\n"); + return -EINVAL; + } + + if (acpi_disabled) + mpam_dt_create_foundling_msc(); + + return platform_driver_register(&mpam_msc_driver); +} +/* Must occur after arm64_mpam_register_cpus() from arch_initcall() */ +subsys_initcall(mpam_msc_driver_init); diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h new file mode 100644 index 0000000000000000000000000000000000000000..42b6e4e1a6eb6073638baa66298f1b76422b8426 --- /dev/null +++ b/drivers/platform/mpam/mpam_internal.h @@ -0,0 +1,593 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* Copyright (C) 2021 Arm Ltd. */ + +#ifndef MPAM_INTERNAL_H +#define MPAM_INTERNAL_H + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +DECLARE_STATIC_KEY_FALSE(mpam_enabled); + +/* Value to indicate the allocated monitor is derived from the RMID index. */ +#define USE_RMID_IDX (U16_MAX + 1) + +/* + * Only these event configuration bits are supported. MPAM can't know if + * data is being written back, these will show up as a write. + */ +#define MPAM_RESTRL_EVT_CONFIG_VALID (READS_TO_LOCAL_MEM | NON_TEMP_WRITE_TO_LOCAL_MEM) + +static inline bool mpam_is_enabled(void) +{ + return static_branch_likely(&mpam_enabled); +} + +struct mpam_msc { + /* member of mpam_all_msc */ + struct list_head glbl_list; + + int id; + struct platform_device *pdev; + + /* Not modified after mpam_is_enabled() becomes true */ + enum mpam_msc_iface iface; + u32 pcc_subspace_id; + struct mbox_client pcc_cl; + struct pcc_mbox_chan *pcc_chan; + u32 nrdy_usec; + cpumask_t accessibility; + bool has_extd_esr; + + int reenable_error_ppi; + struct mpam_msc * __percpu *error_dev_id; + + atomic_t online_refs; + + struct mutex lock; + bool probed; + bool error_irq_requested; + bool error_irq_hw_enabled; + u16 partid_max; + u8 pmg_max; + unsigned long ris_idxs[128 / BITS_PER_LONG]; + u32 ris_max; + + /* mpam_msc_ris of this component */ + struct list_head ris; + + /* + * part_sel_lock protects access to the MSC hardware registers that are + * affected by MPAMCFG_PART_SEL. (including the ID registers) + * If needed, take msc->lock first. + */ + spinlock_t part_sel_lock; + spinlock_t mon_sel_lock; + void __iomem *mapped_hwpage; + size_t mapped_hwpage_sz; +}; + +/* + * When we compact the supported features, we don't care what they are. + * Storing them as a bitmap makes life easy. + */ +typedef u32 mpam_features_t; + +/* Bits for mpam_features_t */ +enum mpam_device_features { + mpam_feat_ccap_part = 0, + mpam_feat_cpor_part, + mpam_feat_cmin, + mpam_feat_mbw_part, + mpam_feat_mbw_min, + mpam_feat_mbw_max, + mpam_feat_max_limit, + mpam_feat_mbw_prop, + mpam_feat_intpri_part, + mpam_feat_intpri_part_0_low, + mpam_feat_dspri_part, + mpam_feat_dspri_part_0_low, + mpam_feat_msmon, + mpam_feat_msmon_csu, + mpam_feat_msmon_csu_capture, + /* + * Having mpam_feat_msmon_mbwu set doesn't mean the regular 31 bit MBWU + * counter would be used. The exact counter used is decided based on the + * status of mpam_feat_msmon_mbwu_l/mpam_feat_msmon_mbwu_lwd as well. + */ + mpam_feat_msmon_mbwu, + mpam_feat_msmon_mbwu_44counter, + mpam_feat_msmon_mbwu_63counter, + mpam_feat_msmon_mbwu_capture, + mpam_feat_msmon_mbwu_rwbw, + mpam_feat_msmon_capt, + mpam_feat_partid_nrw, + MPAM_FEATURE_LAST, +}; +#define MPAM_ALL_FEATURES ((1<features) +#define mpam_set_feature(_feat, x) ((x)->features |= (1<<_feat)) + +static inline void mpam_clear_feature(enum mpam_device_features feat, + mpam_features_t *supported) +{ + *supported &= ~(1<lock. + * Changes to reset_on_next_read, prev_val and correction are protected by the + * msc's mon_sel_lock. + */ +struct msmon_mbwu_state { + bool enabled; + bool reset_on_next_read; + struct mon_cfg cfg; + + /* The value last read from the hardware. Used to detect overflow. */ + u64 prev_val; + + /* + * The value to add to the new reading to account for power management, + * and shifts to trigger the overflow interrupt. + */ + u64 correction; +}; + +struct mpam_msc_ris { + u8 ris_idx; + u64 idr; + struct mpam_props props; + bool in_reset_state; + + cpumask_t affinity; + + /* member of mpam_component:ris */ + struct list_head comp_list; + + /* member of mpam_msc:ris */ + struct list_head msc_list; + + /* parents: */ + struct mpam_msc *msc; + struct mpam_component *comp; + + /* msmon mbwu configuration is preserved over reset */ + struct msmon_mbwu_state *mbwu_state; +}; + +struct mpam_resctrl_dom { + struct mpam_component *comp; + struct rdt_domain resctrl_dom; + + u32 mbm_local_evt_cfg; +}; + +struct mpam_resctrl_res { + struct mpam_class *class; + struct rdt_resource resctrl_res; +}; + +static inline int mpam_alloc_csu_mon(struct mpam_class *class) +{ + struct mpam_props *cprops = &class->props; + + if (!mpam_has_feature(mpam_feat_msmon_csu, cprops)) + return -EOPNOTSUPP; + + return ida_alloc_range(&class->ida_csu_mon, 0, cprops->num_csu_mon - 1, + GFP_KERNEL); +} + +static inline void mpam_free_csu_mon(struct mpam_class *class, int csu_mon) +{ + ida_free(&class->ida_csu_mon, csu_mon); +} + +static inline int mpam_alloc_mbwu_mon(struct mpam_class *class) +{ + struct mpam_props *cprops = &class->props; + + if (!mpam_has_feature(mpam_feat_msmon_mbwu, cprops)) + return -EOPNOTSUPP; + + return ida_alloc_range(&class->ida_mbwu_mon, 0, + cprops->num_mbwu_mon - 1, GFP_KERNEL); +} + +static inline void mpam_free_mbwu_mon(struct mpam_class *class, int mbwu_mon) +{ + ida_free(&class->ida_mbwu_mon, mbwu_mon); +} + +/* List of all classes */ +extern struct list_head mpam_classes; +extern struct srcu_struct mpam_srcu; + +/* System wide partid/pmg values */ +extern u16 mpam_partid_max; +extern u8 mpam_pmg_max; + +/* Scheduled work callback to enable mpam once all MSC have been probed */ +void mpam_enable(struct work_struct *work); +void mpam_disable(struct work_struct *work); + +void mpam_reset_class(struct mpam_class *class); + +int mpam_apply_config(struct mpam_component *comp, u16 partid, + struct mpam_config *cfg); + +int mpam_msmon_read(struct mpam_component *comp, struct mon_cfg *ctx, + enum mpam_device_features, u64 *val); +void mpam_msmon_reset_mbwu(struct mpam_component *comp, struct mon_cfg *ctx); +void mpam_msmon_reset_all_mbwu(struct mpam_component *comp); + +int mpam_resctrl_online_cpu(unsigned int cpu); +int mpam_resctrl_offline_cpu(unsigned int cpu); + +int mpam_resctrl_setup(void); +void mpam_resctrl_exit(void); + +u16 mpam_cpbm_wd_hisi_workaround(u16 cpbm_wd, enum mpam_device_features feat, u8 cache_level); + +/* + * MPAM MSCs have the following register layout. See: + * Arm Architecture Reference Manual Supplement - Memory System Resource + * Partitioning and Monitoring (MPAM), for Armv8-A. DDI 0598A.a + */ +#define MPAM_ARCHITECTURE_V1 0x10 + +/* Memory mapped control pages: */ +/* ID Register offsets in the memory mapped page */ +#define MPAMF_IDR 0x0000 /* features id register */ +#define MPAMF_MSMON_IDR 0x0080 /* performance monitoring features */ +#define MPAMF_IMPL_IDR 0x0028 /* imp-def partitioning */ +#define MPAMF_CPOR_IDR 0x0030 /* cache-portion partitioning */ +#define MPAMF_CCAP_IDR 0x0038 /* cache-capacity partitioning */ +#define MPAMF_MBW_IDR 0x0040 /* mem-bw partitioning */ +#define MPAMF_PRI_IDR 0x0048 /* priority partitioning */ +#define MPAMF_CSUMON_IDR 0x0088 /* cache-usage monitor */ +#define MPAMF_MBWUMON_IDR 0x0090 /* mem-bw usage monitor */ +#define MPAMF_PARTID_NRW_IDR 0x0050 /* partid-narrowing */ +#define MPAMF_IIDR 0x0018 /* implementer id register */ +#define MPAMF_AIDR 0x0020 /* architectural id register */ + +/* Configuration and Status Register offsets in the memory mapped page */ +#define MPAMCFG_PART_SEL 0x0100 /* partid to configure: */ +#define MPAMCFG_CPBM 0x1000 /* cache-portion config */ +#define MPAMCFG_CMAX 0x0108 /* cache-capacity config */ +#define MPAMCFG_CMIN 0x0110 /* cache-capacity min config */ +#define MPAMCFG_MBW_MIN 0x0200 /* min mem-bw config */ +#define MPAMCFG_MBW_MAX 0x0208 /* max mem-bw config */ +#define MPAMCFG_MBW_WINWD 0x0220 /* mem-bw accounting window config */ +#define MPAMCFG_MBW_PBM 0x2000 /* mem-bw portion bitmap config */ +#define MPAMCFG_PRI 0x0400 /* priority partitioning config */ +#define MPAMCFG_MBW_PROP 0x0500 /* mem-bw stride config */ +#define MPAMCFG_INTPARTID 0x0600 /* partid-narrowing config */ + +#define MSMON_CFG_MON_SEL 0x0800 /* monitor selector */ +#define MSMON_CFG_CSU_FLT 0x0810 /* cache-usage monitor filter */ +#define MSMON_CFG_CSU_CTL 0x0818 /* cache-usage monitor config */ +#define MSMON_CFG_MBWU_FLT 0x0820 /* mem-bw monitor filter */ +#define MSMON_CFG_MBWU_CTL 0x0828 /* mem-bw monitor config */ +#define MSMON_CSU 0x0840 /* current cache-usage */ +#define MSMON_CSU_CAPTURE 0x0848 /* last cache-usage value captured */ +#define MSMON_MBWU 0x0860 /* current mem-bw usage value */ +#define MSMON_MBWU_CAPTURE 0x0868 /* last mem-bw value captured */ +#define MSMON_MBWU_L 0x0880 /* current long mem-bw usage value */ +#define MSMON_MBWU_CAPTURE_L 0x0890 /* last long mem-bw value captured */ +#define MSMON_CAPT_EVNT 0x0808 /* signal a capture event */ +#define MPAMF_ESR 0x00F8 /* error status register */ +#define MPAMF_ECR 0x00F0 /* error control register */ + +/* MPAMF_IDR - MPAM features ID register */ +#define MPAMF_IDR_PARTID_MAX GENMASK(15, 0) +#define MPAMF_IDR_PMG_MAX GENMASK(23, 16) +#define MPAMF_IDR_HAS_CCAP_PART BIT(24) +#define MPAMF_IDR_HAS_CPOR_PART BIT(25) +#define MPAMF_IDR_HAS_MBW_PART BIT(26) +#define MPAMF_IDR_HAS_PRI_PART BIT(27) +#define MPAMF_IDR_HAS_EXT BIT(28) +#define MPAMF_IDR_HAS_IMPL_IDR BIT(29) +#define MPAMF_IDR_HAS_MSMON BIT(30) +#define MPAMF_IDR_HAS_PARTID_NRW BIT(31) +#define MPAMF_IDR_HAS_RIS BIT(32) +#define MPAMF_IDR_HAS_EXT_ESR BIT(38) +#define MPAMF_IDR_HAS_ESR BIT(39) +#define MPAMF_IDR_RIS_MAX GENMASK(59, 56) + + +/* MPAMF_MSMON_IDR - MPAM performance monitoring ID register */ +#define MPAMF_MSMON_IDR_MSMON_CSU BIT(16) +#define MPAMF_MSMON_IDR_MSMON_MBWU BIT(17) +#define MPAMF_MSMON_IDR_HAS_LOCAL_CAPT_EVNT BIT(31) + +/* MPAMF_CPOR_IDR - MPAM features cache portion partitioning ID register */ +#define MPAMF_CPOR_IDR_CPBM_WD GENMASK(15, 0) + +/* MPAMF_CCAP_IDR - MPAM features cache capacity partitioning ID register */ +#define MPAMF_CCAP_IDR_CMAX_WD GENMASK(5, 0) +#define MPAMF_CCAP_IDR_HAS_CMIN BIT(29) +#define MPAMF_CCAP_IDR_NO_CMAX BIT(30) +#define MPAMF_CCAP_IDR_HAS_CMAX_SOFTLIM BIT(31) + +/* MPAMF_MBW_IDR - MPAM features memory bandwidth partitioning ID register */ +#define MPAMF_MBW_IDR_BWA_WD GENMASK(5, 0) +#define MPAMF_MBW_IDR_HAS_MIN BIT(10) +#define MPAMF_MBW_IDR_HAS_MAX BIT(11) +#define MPAMF_MBW_IDR_HAS_PBM BIT(12) +#define MPAMF_MBW_IDR_HAS_PROP BIT(13) +#define MPAMF_MBW_IDR_WINDWR BIT(14) +#define MPAMF_MBW_IDR_BWPBM_WD GENMASK(28, 16) + +/* MPAMF_PRI_IDR - MPAM features priority partitioning ID register */ +#define MPAMF_PRI_IDR_HAS_INTPRI BIT(0) +#define MPAMF_PRI_IDR_INTPRI_0_IS_LOW BIT(1) +#define MPAMF_PRI_IDR_INTPRI_WD GENMASK(9, 4) +#define MPAMF_PRI_IDR_HAS_DSPRI BIT(16) +#define MPAMF_PRI_IDR_DSPRI_0_IS_LOW BIT(17) +#define MPAMF_PRI_IDR_DSPRI_WD GENMASK(25, 20) + +/* MPAMF_CSUMON_IDR - MPAM cache storage usage monitor ID register */ +#define MPAMF_CSUMON_IDR_NUM_MON GENMASK(15, 0) +#define MPAMF_CSUMON_IDR_HAS_CAPTURE BIT(31) + +/* MPAMF_MBWUMON_IDR - MPAM memory bandwidth usage monitor ID register */ +#define MPAMF_MBWUMON_IDR_NUM_MON GENMASK(15, 0) +#define MPAMF_MBWUMON_IDR_HAS_RWBW BIT(28) +#define MPAMF_MBWUMON_IDR_LWD BIT(29) +#define MPAMF_MBWUMON_IDR_HAS_LONG BIT(30) +#define MPAMF_MBWUMON_IDR_HAS_CAPTURE BIT(31) + +/* MPAMF_PARTID_NRW_IDR - MPAM PARTID narrowing ID register */ +#define MPAMF_PARTID_NRW_IDR_INTPARTID_MAX GENMASK(15, 0) + +/* MPAMF_IIDR - MPAM implementation ID register */ +#define MPAMF_IIDR_PRODUCTID GENMASK(31, 20) +#define MPAMF_IIDR_PRODUCTID_SHIFT 20 +#define MPAMF_IIDR_VARIANT GENMASK(19, 16) +#define MPAMF_IIDR_VARIANT_SHIFT 16 +#define MPAMF_IIDR_REVISON GENMASK(15, 12) +#define MPAMF_IIDR_REVISON_SHIFT 12 +#define MPAMF_IIDR_IMPLEMENTER GENMASK(11, 0) +#define MPAMF_IIDR_IMPLEMENTER_SHIFT 0 + +/* MPAMF_AIDR - MPAM architecture ID register */ +#define MPAMF_AIDR_ARCH_MAJOR_REV GENMASK(7, 4) +#define MPAMF_AIDR_ARCH_MINOR_REV GENMASK(3, 0) + +/* MPAMCFG_PART_SEL - MPAM partition configuration selection register */ +#define MPAMCFG_PART_SEL_PARTID_SEL GENMASK(15, 0) +#define MPAMCFG_PART_SEL_INTERNAL BIT(16) +#define MPAMCFG_PART_SEL_RIS GENMASK(27, 24) + +/* MPAMCFG_CMAX - MPAM cache portion bitmap partition configuration register */ +#define MPAMCFG_CMAX_CMAX GENMASK(15, 0) + +/* + * MPAMCFG_MBW_MIN - MPAM memory minimum bandwidth partitioning configuration + * register + */ +#define MPAMCFG_MBW_MIN_MIN GENMASK(15, 0) + +/* + * MPAMCFG_MBW_MAX - MPAM memory maximum bandwidth partitioning configuration + * register + */ +#define MPAMCFG_MBW_MAX_MAX GENMASK(15, 0) +#define MPAMCFG_MBW_MAX_HARDLIM BIT(31) + +/* + * MPAMCFG_MBW_WINWD - MPAM memory bandwidth partitioning window width + * register + */ +#define MPAMCFG_MBW_WINWD_US_FRAC GENMASK(7, 0) +#define MPAMCFG_MBW_WINWD_US_INT GENMASK(23, 8) + + +/* MPAMCFG_PRI - MPAM priority partitioning configuration register */ +#define MPAMCFG_PRI_INTPRI GENMASK(15, 0) +#define MPAMCFG_PRI_DSPRI GENMASK(31, 16) + +/* + * MPAMCFG_MBW_PROP - Memory bandwidth proportional stride partitioning + * configuration register + */ +#define MPAMCFG_MBW_PROP_STRIDEM1 GENMASK(15, 0) +#define MPAMCFG_MBW_PROP_EN BIT(31) + +/* + * MPAMCFG_INTPARTID - MPAM internal partition narrowing configuration register + */ +#define MPAMCFG_INTPARTID_INTPARTID GENMASK(15, 0) +#define MPAMCFG_INTPARTID_INTERNAL BIT(16) + +/* MSMON_CFG_MON_SEL - Memory system performance monitor selection register */ +#define MSMON_CFG_MON_SEL_MON_SEL GENMASK(7, 0) +#define MSMON_CFG_MON_SEL_RIS GENMASK(27, 24) + +/* MPAMF_ESR - MPAM Error Status Register */ +#define MPAMF_ESR_PARTID_OR_MON GENMASK(15, 0) +#define MPAMF_ESR_PMG GENMASK(23, 16) +#define MPAMF_ESR_ERRCODE GENMASK(27, 24) +#define MPAMF_ESR_OVRWR BIT(31) +#define MPAMF_ESR_RIS GENMASK(35, 32) + +/* MPAMF_ECR - MPAM Error Control Register */ +#define MPAMF_ECR_INTEN BIT(0) + +/* Error conditions in accessing memory mapped registers */ +#define MPAM_ERRCODE_NONE 0 +#define MPAM_ERRCODE_PARTID_SEL_RANGE 1 +#define MPAM_ERRCODE_REQ_PARTID_RANGE 2 +#define MPAM_ERRCODE_MSMONCFG_ID_RANGE 3 +#define MPAM_ERRCODE_REQ_PMG_RANGE 4 +#define MPAM_ERRCODE_MONITOR_RANGE 5 +#define MPAM_ERRCODE_INTPARTID_RANGE 6 +#define MPAM_ERRCODE_UNEXPECTED_INTERNAL 7 + +/* + * MSMON_CFG_CSU_FLT - Memory system performance monitor configure cache storage + * usage monitor filter register + */ +#define MSMON_CFG_CSU_FLT_PARTID GENMASK(15, 0) +#define MSMON_CFG_CSU_FLT_PMG GENMASK(23, 16) + +/* + * MSMON_CFG_CSU_CTL - Memory system performance monitor configure cache storage + * usage monitor control register + * MSMON_CFG_MBWU_CTL - Memory system performance monitor configure memory + * bandwidth usage monitor control register + */ +#define MSMON_CFG_x_CTL_TYPE GENMASK(7, 0) +#define MSMON_CFG_x_CTL_MATCH_PARTID BIT(16) +#define MSMON_CFG_x_CTL_MATCH_PMG BIT(17) +#define MSMON_CFG_x_CTL_SCLEN BIT(19) +#define MSMON_CFG_x_CTL_SUBTYPE GENMASK(23, 20) +#define MSMON_CFG_x_CTL_OFLOW_FRZ BIT(24) +#define MSMON_CFG_x_CTL_OFLOW_INTR BIT(25) +#define MSMON_CFG_x_CTL_OFLOW_STATUS BIT(26) +#define MSMON_CFG_x_CTL_CAPT_RESET BIT(27) +#define MSMON_CFG_x_CTL_CAPT_EVNT GENMASK(30, 28) +#define MSMON_CFG_x_CTL_EN BIT(31) + +#define MSMON_CFG_MBWU_CTL_TYPE_MBWU 0x42 +#define MSMON_CFG_MBWU_CTL_TYPE_CSU 0x43 + +#define MSMON_CFG_MBWU_CTL_SUBTYPE_NONE 0 +#define MSMON_CFG_MBWU_CTL_SUBTYPE_READ 1 +#define MSMON_CFG_MBWU_CTL_SUBTYPE_WRITE 2 +#define MSMON_CFG_MBWU_CTL_SUBTYPE_BOTH 3 + +#define MSMON_CFG_MBWU_CTL_SUBTYPE_MAX 3 +#define MSMON_CFG_MBWU_CTL_SUBTYPE_MASK 0x3 + +/* + * MSMON_CFG_MBWU_FLT - Memory system performance monitor configure memory + * bandwidth usage monitor filter register + */ +#define MSMON_CFG_MBWU_FLT_PARTID GENMASK(15, 0) +#define MSMON_CFG_MBWU_FLT_PMG GENMASK(23, 16) +#define MSMON_CFG_MBWU_FLT_RWBW GENMASK(31, 30) + +/* + * MSMON_CSU - Memory system performance monitor cache storage usage monitor + * register + * MSMON_CSU_CAPTURE - Memory system performance monitor cache storage usage + * capture register + * MSMON_MBWU - Memory system performance monitor memory bandwidth usage + * monitor register + * MSMON_MBWU_CAPTURE - Memory system performance monitor memory bandwidth usage + * capture register + */ +#define MSMON___VALUE GENMASK(30, 0) +#define MSMON___NRDY BIT(31) +#define MSMON___NRDY_L BIT(63) +#define MSMON___L_VALUE GENMASK(43, 0) +#define MSMON___LWD_VALUE GENMASK(62, 0) + +/* + * MSMON_CAPT_EVNT - Memory system performance monitoring capture event + * generation register + */ +#define MSMON_CAPT_EVNT_NOW BIT(0) + +int mpam_resctrl_prepare_offline(void); + +#endif /* MPAM_INTERNAL_H */ diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c new file mode 100644 index 0000000000000000000000000000000000000000..ce13a227a7dc80f1c08a32c3b63354fd75dca8f5 --- /dev/null +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -0,0 +1,1648 @@ +// SPDX-License-Identifier: GPL-2.0 +// Copyright (C) 2021 Arm Ltd. + +#define pr_fmt(fmt) "mpam: resctrl: " fmt + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include "mpam_internal.h" + +u64 mpam_resctrl_default_group; + +DECLARE_WAIT_QUEUE_HEAD(resctrl_mon_ctx_waiters); + +/* + * The classes we've picked to map to resctrl resources. + * Class pointer may be NULL. + */ +static struct mpam_resctrl_res mpam_resctrl_exports[RDT_NUM_RESOURCES]; + +static bool exposed_alloc_capable; +static bool exposed_mon_capable; +static struct mpam_class *mbm_local_class; +static struct mpam_class *mbm_total_class; +static struct mpam_class *mbm_core_class; + +/* + * MPAM emulates CDP by setting different PARTID in the I/D fields of MPAM1_EL1. + * This applies globally to all traffic the CPU generates. + */ +static bool cdp_enabled; + +/* + * If resctrl_init() succeeded, resctrl_exit() can be used to remove support + * for the filesystem in the event of an error. + */ +static bool resctrl_enabled; + +/* + * mpam_resctrl_pick_caches() needs to know the size of the caches. cacheinfo + * populates this from a device_initcall(). mpam_resctrl_setup() must wait. + */ +static bool cacheinfo_ready; +static DECLARE_WAIT_QUEUE_HEAD(wait_cacheinfo_ready); + +/* A dummy mon context to use when the monitors were allocated up front */ +u32 __mon_is_rmid_idx = USE_RMID_IDX; + +bool resctrl_arch_alloc_capable(void) +{ + return exposed_alloc_capable; +} + +bool resctrl_arch_mon_capable(void) +{ + return exposed_mon_capable; +} + +bool resctrl_arch_is_mbm_local_enabled(void) +{ + return mbm_local_class; +} + +bool resctrl_arch_is_mbm_total_enabled(void) +{ + return mbm_total_class; +} + +bool resctrl_arch_is_mbm_core_enabled(void) +{ + return mbm_core_class; +} + +bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level rid) +{ + switch (rid) { + case RDT_RESOURCE_L2: + case RDT_RESOURCE_L3: + return cdp_enabled; + case RDT_RESOURCE_MBA: + default: + /* + * x86's MBA control doesn't support CDP, so user-space doesn't + * expect it. + */ + return false; + } +} + +int resctrl_arch_set_cdp_enabled(enum resctrl_res_level ignored, bool enable) +{ + u64 regval; + struct rdt_resource *r; + u32 i, partid, partid_i, partid_d; + + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + r = resctrl_arch_get_resource(i); + if (r->mon_capable) { + r->num_rmid = resctrl_arch_system_num_rmid_idx(); + if (enable) + r->num_rmid >>= 1; + } + } + + cdp_enabled = enable; + + partid = RESCTRL_RESERVED_CLOSID; + + if (enable) { + partid_d = resctrl_get_config_index(partid, CDP_DATA); + partid_i = resctrl_get_config_index(partid, CDP_CODE); + regval = FIELD_PREP(MPAM_SYSREG_PARTID_D, partid_d) | + FIELD_PREP(MPAM_SYSREG_PARTID_I, partid_i); + + } else { + regval = FIELD_PREP(MPAM_SYSREG_PARTID_D, partid) | + FIELD_PREP(MPAM_SYSREG_PARTID_I, partid); + } + + WRITE_ONCE(mpam_resctrl_default_group, regval); + + return 0; +} + +bool resctrl_arch_hide_cdp(enum resctrl_res_level rid) +{ + return cdp_enabled && !resctrl_arch_get_cdp_enabled(rid); +} + +/* + * MSC may raise an error interrupt if it sees an out or range partid/pmg, + * and go on to truncate the value. Regardless of what the hardware supports, + * only the system wide safe value is safe to use. + */ +u32 resctrl_arch_get_num_closid(struct rdt_resource *ignored) +{ + return mpam_partid_max + 1; +} + +u32 resctrl_arch_system_num_rmid_idx(void) +{ + u8 closid_shift = fls(mpam_pmg_max); + u32 num_partid = resctrl_arch_get_num_closid(NULL); + + return num_partid << closid_shift; +} + +u32 resctrl_arch_rmid_idx_encode(u32 closid, u32 rmid) +{ + u8 closid_shift = fls(mpam_pmg_max); + + WARN_ON_ONCE(closid_shift > 8); + + return (closid << closid_shift) | rmid; +} + +void resctrl_arch_rmid_idx_decode(u32 idx, u32 *closid, u32 *rmid) +{ + u8 closid_shift = fls(mpam_pmg_max); + u32 pmg_mask = ~(~0 << closid_shift); + + WARN_ON_ONCE(closid_shift > 8); + + *closid = idx >> closid_shift; + *rmid = idx & pmg_mask; +} + +void resctrl_sched_in(struct task_struct *tsk) +{ + lockdep_assert_preemption_disabled(); + + mpam_thread_switch(tsk); +} + +void resctrl_arch_set_cpu_default_closid_rmid(int cpu, u32 closid, u32 pmg) +{ + WARN_ON_ONCE(closid > U16_MAX); + WARN_ON_ONCE(pmg > U8_MAX); + + if (!cdp_enabled) { + mpam_set_cpu_defaults(cpu, closid, closid, pmg, pmg); + } else { + /* + * When CDP is enabled, resctrl halves the closid range and we + * use odd/even partid for one closid. + */ + u32 partid_d = resctrl_get_config_index(closid, CDP_DATA); + u32 partid_i = resctrl_get_config_index(closid, CDP_CODE); + + mpam_set_cpu_defaults(cpu, partid_d, partid_i, pmg, pmg); + } +} + +void resctrl_arch_sync_cpu_defaults(void *info) +{ + struct resctrl_cpu_sync *r = info; + + lockdep_assert_preemption_disabled(); + + if (r) { + resctrl_arch_set_cpu_default_closid_rmid(smp_processor_id(), + r->closid, r->rmid); + } + + resctrl_sched_in(current); +} + +void resctrl_arch_set_closid_rmid(struct task_struct *tsk, u32 closid, u32 rmid) +{ + + + WARN_ON_ONCE(closid > U16_MAX); + WARN_ON_ONCE(rmid > U8_MAX); + + if (!cdp_enabled) { + mpam_set_task_partid_pmg(tsk, closid, closid, rmid, rmid); + } else { + u32 partid_d = resctrl_get_config_index(closid, CDP_DATA); + u32 partid_i = resctrl_get_config_index(closid, CDP_CODE); + + mpam_set_task_partid_pmg(tsk, partid_d, partid_i, rmid, rmid); + } +} + +bool resctrl_arch_match_closid(struct task_struct *tsk, u32 closid) +{ + u64 regval = mpam_get_regval(tsk); + u32 tsk_closid = FIELD_GET(MPAM_SYSREG_PARTID_D, regval); + + if (cdp_enabled) + tsk_closid >>= 1; + + return tsk_closid == closid; +} + +/* The task's pmg is not unique, the partid must be considered too */ +bool resctrl_arch_match_rmid(struct task_struct *tsk, u32 closid, u32 rmid) +{ + u64 regval = mpam_get_regval(tsk); + u32 tsk_closid = FIELD_GET(MPAM_SYSREG_PARTID_D, regval); + u32 tsk_rmid = FIELD_GET(MPAM_SYSREG_PMG_D, regval); + + if (cdp_enabled) + tsk_closid >>= 1; + + return (tsk_closid == closid) && (tsk_rmid == rmid); +} + +struct rdt_resource *resctrl_arch_get_resource(enum resctrl_res_level l) +{ + if (l >= RDT_NUM_RESOURCES) + return NULL; + + return &mpam_resctrl_exports[l].resctrl_res; +} + +static void *resctrl_arch_mon_ctx_alloc_no_wait(struct rdt_resource *r, + int evtid) +{ + u32 *ret = kmalloc(sizeof(*ret), GFP_KERNEL); + + if (!ret) + return ERR_PTR(-ENOMEM); + + *ret = __mon_is_rmid_idx; + return ret; +} + +void *resctrl_arch_mon_ctx_alloc(struct rdt_resource *r, int evtid) +{ + DEFINE_WAIT(wait); + void *ret; + + might_sleep(); + + do { + prepare_to_wait(&resctrl_mon_ctx_waiters, &wait, + TASK_INTERRUPTIBLE); + ret = resctrl_arch_mon_ctx_alloc_no_wait(r, evtid); + if (PTR_ERR(ret) == -ENOSPC) + schedule(); + } while (PTR_ERR(ret) == -ENOSPC && !signal_pending(current)); + finish_wait(&resctrl_mon_ctx_waiters, &wait); + + return ret; +} + +void resctrl_arch_mon_ctx_free(struct rdt_resource *r, int evtid, + void *arch_mon_ctx) +{ + kfree(arch_mon_ctx); +} + +static enum mon_filter_options resctrl_evt_config_to_mpam(u32 local_evt_cfg) +{ + switch (local_evt_cfg) { + case READS_TO_LOCAL_MEM: + return COUNT_READ; + case NON_TEMP_WRITE_TO_LOCAL_MEM: + return COUNT_WRITE; + default: + return COUNT_BOTH; + } +} + +int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d, + u32 closid, u32 rmid, enum resctrl_event_id eventid, + u64 *val, void *arch_mon_ctx) +{ + int err; + u64 cdp_val; + u16 num_mon; + struct mon_cfg cfg; + struct mpam_resctrl_dom *dom; + struct mpam_resctrl_res *res; + enum mpam_device_features type; + + resctrl_arch_rmid_read_context_check(); + + dom = container_of(d, struct mpam_resctrl_dom, resctrl_dom); + + switch (eventid) { + case QOS_L3_OCCUP_EVENT_ID: + case QOS_L2_OCCUP_EVENT_ID: + type = mpam_feat_msmon_csu; + break; + case QOS_L3_MBM_LOCAL_EVENT_ID: + case QOS_L3_MBM_TOTAL_EVENT_ID: + case QOS_L2_MBM_CORE_EVENT_ID: + type = mpam_feat_msmon_mbwu; + break; + default: + return -EINVAL; + } + + /* + * The number of mbwu monitors can't support free run mode, + * adapt the remainder of rmid to the num_mon as compromise. + */ + res = container_of(r, struct mpam_resctrl_res, resctrl_res); + if (type == mpam_feat_msmon_mbwu) + num_mon = res->class->props.num_mbwu_mon; + else + num_mon = res->class->props.num_csu_mon; + + cfg.match_pmg = true; + cfg.pmg = rmid; + cfg.opts = resctrl_evt_config_to_mpam(dom->mbm_local_evt_cfg); + + if (cdp_enabled) { + cfg.partid = resctrl_get_config_index(closid, CDP_DATA); + cfg.mon = cfg.partid % num_mon; + err = mpam_msmon_read(dom->comp, &cfg, type, val); + if (err) + return err; + + cfg.partid = resctrl_get_config_index(closid, CDP_CODE); + cfg.mon = cfg.partid % num_mon; + err = mpam_msmon_read(dom->comp, &cfg, type, &cdp_val); + if (!err) { + pr_debug("read monitor rmid %u %s:%u CODE/DATA: %lld/%lld\n", + resctrl_arch_rmid_idx_encode(closid, rmid), + r->name, dom->comp->comp_id, cdp_val, *val); + *val += cdp_val; + } + } else { + cfg.partid = closid; + cfg.mon = cfg.partid % num_mon; + err = mpam_msmon_read(dom->comp, &cfg, type, val); + } + + return err; +} + +void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d, + u32 closid, u32 rmid, enum resctrl_event_id eventid) +{ + u16 num_mbwu_mon; + struct mon_cfg cfg; + struct mpam_resctrl_dom *dom; + struct mpam_resctrl_res *res; + + if (eventid != QOS_L3_MBM_LOCAL_EVENT_ID) + return; + + res = container_of(r, struct mpam_resctrl_res, resctrl_res); + num_mbwu_mon = res->class->props.num_mbwu_mon; + cfg.mon = resctrl_arch_rmid_idx_encode(closid, rmid) % num_mbwu_mon; + cfg.match_pmg = true; + cfg.pmg = rmid; + + dom = container_of(d, struct mpam_resctrl_dom, resctrl_dom); + + if (cdp_enabled) { + cfg.partid = closid << 1; + mpam_msmon_reset_mbwu(dom->comp, &cfg); + + cfg.partid += 1; + mpam_msmon_reset_mbwu(dom->comp, &cfg); + } else { + cfg.partid = closid; + mpam_msmon_reset_mbwu(dom->comp, &cfg); + } +} + +/* + * The rmid realloc threshold should be for the smallest cache exposed to + * resctrl. + */ +static void update_rmid_limits(unsigned int size) +{ + u32 num_unique_pmg = resctrl_arch_system_num_rmid_idx(); + + if (WARN_ON_ONCE(!size)) + return; + + if (resctrl_rmid_realloc_limit && size > resctrl_rmid_realloc_limit) + return; + + resctrl_rmid_realloc_limit = size; + resctrl_rmid_realloc_threshold = size / num_unique_pmg; +} + +static bool cache_has_usable_cpor(struct mpam_class *class) +{ + struct mpam_props *cprops = &class->props; + + if (!mpam_has_feature(mpam_feat_cpor_part, cprops)) + return false; + + /* TODO: Scaling is not yet supported */ + return (class->props.cpbm_wd <= RESCTRL_MAX_CBM); +} + +static bool cache_has_usable_cmax(struct mpam_class *class) +{ + struct mpam_props *cprops = &class->props; + + return mpam_has_feature(mpam_feat_ccap_part, cprops); +} + +static bool cache_has_usable_cmin(struct mpam_class *class) +{ + struct mpam_props *cprops = &class->props; + + return mpam_has_feature(mpam_feat_cmin, cprops); +} + +static bool cache_has_usable_csu(struct mpam_class *class) +{ + struct mpam_props *cprops; + + if (!class) + return false; + + cprops = &class->props; + + if (!mpam_has_feature(mpam_feat_msmon_csu, cprops)) + return false; + + /* + * CSU counters settle on the value, so we can get away with + * having only one. + */ + if (!cprops->num_csu_mon) + return false; + + return (mpam_partid_max > 1) || (mpam_pmg_max != 0); +} + +bool resctrl_arch_is_llc_occupancy_enabled(void) +{ + return cache_has_usable_csu(mpam_resctrl_exports[RDT_RESOURCE_L3].class); +} + +bool resctrl_arch_is_l2c_occupancy_enabled(void) +{ + return cache_has_usable_csu(mpam_resctrl_exports[RDT_RESOURCE_L2].class); +} + +static bool class_has_usable_mbwu(struct mpam_class *class) +{ + struct mpam_props *cprops = &class->props; + + if (!mpam_has_feature(mpam_feat_msmon_mbwu, cprops)) + return false; + + return (mpam_partid_max > 1) || (mpam_pmg_max != 0); +} + +static bool mba_class_use_mbw_part(struct mpam_props *cprops) +{ + /* TODO: Scaling is not yet supported */ + return (mpam_has_feature(mpam_feat_mbw_part, cprops) && + cprops->mbw_pbm_bits < MAX_MBA_BW); +} + +static bool class_has_usable_mba(struct mpam_props *cprops) +{ + if (mba_class_use_mbw_part(cprops) || + mpam_has_feature(mpam_feat_mbw_max, cprops)) + return true; + + return false; +} + +static bool class_has_usable_mbw_min(struct mpam_props *cprops) +{ + if (mpam_has_feature(mpam_feat_mbw_min, cprops)) + return true; + + return false; +} + +static bool class_has_usable_intpri(struct mpam_props *cprops) +{ + if (mpam_has_feature(mpam_feat_intpri_part, cprops)) + return true; + + return false; +} + +static bool class_has_usable_max_limit(struct mpam_props *cprops) +{ + if (mpam_has_feature(mpam_feat_max_limit, cprops)) + return true; + + return false; +} + +/* + * Calculate the percentage change from each implemented bit in the control + * This can return 0 when BWA_WD is greater than 6. (100 / (1<<7) == 0) + */ +static u32 get_mba_granularity(struct mpam_props *cprops) +{ + if (mba_class_use_mbw_part(cprops)) { + return MAX_MBA_BW / cprops->mbw_pbm_bits; + } else if (mpam_has_feature(mpam_feat_mbw_max, cprops)) { + /* + * bwa_wd is the number of bits implemented in the 0.xxx + * fixed point fraction. 1 bit is 50%, 2 is 25% etc. + */ + return MAX_MBA_BW / (1 << cprops->bwa_wd); + } + + return 0; +} + +static u32 mbw_pbm_to_percent(unsigned long mbw_pbm, struct mpam_props *cprops) +{ + u32 bit, result = 0, granularity = get_mba_granularity(cprops); + + for_each_set_bit(bit, &mbw_pbm, cprops->mbw_pbm_bits % 32) { + result += granularity; + } + + return result; +} + +static int get_wd_precision(u8 wd) +{ + int ret = (1 << wd) / MAX_MBA_BW; + + if (!ret) + return 1; + + return ret; +} + +static u32 mbw_max_to_percent(u16 mbw_max, u8 wd) +{ + u8 bit; + u32 divisor = 2, value = 0, precision = get_wd_precision(wd); + + if (mbw_max == GENMASK(15, 15 - wd + 1)) + return MAX_MBA_BW; + + for (bit = 15; bit; bit--) { + if (mbw_max & BIT(bit)) + value += MAX_MBA_BW * precision / divisor; + divisor <<= 1; + } + + return DIV_ROUND_UP(value, precision); +} + +static u32 percent_to_mbw_pbm(u32 pc, struct mpam_props *cprops) +{ + u32 granularity = get_mba_granularity(cprops); + u8 num_bits = pc / granularity; + + if (!num_bits) + return 0; + + /* TODO: pick bits at random to avoid contention */ + return (1 << num_bits) - 1; +} + +static u16 percent_to_mbw_max(u32 pc, u8 wd) +{ + u8 bit; + u32 divisor = 2, value = 0, precision = get_wd_precision(wd); + + if (WARN_ON_ONCE(wd > 15)) + return MAX_MBA_BW; + + if (pc == MAX_MBA_BW) + return GENMASK(15, 15 - wd + 1); + + pc *= precision; + + for (bit = 15; bit; bit--) { + if (pc >= MAX_MBA_BW * precision / divisor) { + pc -= MAX_MBA_BW * precision / divisor; + value |= BIT(bit); + } + divisor <<= 1; + + if (!pc || !(MAX_MBA_BW * precision / divisor)) + break; + } + + value &= GENMASK(15, 15 - wd + 1); + + return value; +} + +static u16 percent_to_ca_max(u32 pc, u8 wd) +{ + struct rdt_resource *l3 = resctrl_arch_get_resource(RDT_RESOURCE_L3); + u32 valid_max; + + if (read_cpuid_implementor() != ARM_CPU_IMP_HISI) + return percent_to_mbw_max(pc, wd); + + valid_max = mpam_cpbm_wd_hisi_workaround(l3->cache.cbm_len, + mpam_feat_ccap_part, + l3->cache_level); + + if (pc >= MAX_MBA_BW) + return valid_max << (16 - wd); + + return ((pc * valid_max + 50) / 100) << (16 - wd); +} + +static u16 ca_max_to_percent(u16 ca_max, u8 wd) +{ + struct rdt_resource *l3 = resctrl_arch_get_resource(RDT_RESOURCE_L3); + u32 valid_max; + + if (read_cpuid_implementor() != ARM_CPU_IMP_HISI) + return mbw_max_to_percent(ca_max, wd); + + valid_max = mpam_cpbm_wd_hisi_workaround(l3->cache.cbm_len, + mpam_feat_ccap_part, + l3->cache_level); + + ca_max = ca_max >> (16 - wd); + if (ca_max >= valid_max) + return MAX_MBA_BW; + + return (ca_max * 100 + valid_max / 2) / valid_max; +} + +/* Test whether we can export MPAM_CLASS_CACHE:{2,3}? */ +static void mpam_resctrl_pick_caches(void) +{ + int idx; + unsigned int cache_size; + struct mpam_class *class; + struct mpam_resctrl_res *res; + bool has_cpor, has_cmax, has_cmin, has_intpri; + + lockdep_assert_cpus_held(); + + idx = srcu_read_lock(&mpam_srcu); + list_for_each_entry_rcu(class, &mpam_classes, classes_list) { + struct mpam_props *cprops = &class->props; + + has_cpor = cache_has_usable_cpor(class); + has_cmax = cache_has_usable_cmax(class); + has_cmin = cache_has_usable_cmin(class); + has_intpri = class_has_usable_intpri(cprops); + + if (class->type != MPAM_CLASS_CACHE) { + pr_debug("pick_caches: Class is not a cache\n"); + continue; + } + + if (class->level != 2 && class->level != 3) { + pr_debug("pick_caches: not L2 or L3\n"); + continue; + } + + if (class->level == 2 && !has_cpor) { + pr_debug("pick_caches: L2 missing CPOR\n"); + continue; + } else if (!has_cpor && !cache_has_usable_csu(class)) { + pr_debug("pick_caches: Cache misses CPOR and CSU\n"); + continue; + } + + if (!cpumask_equal(&class->affinity, cpu_possible_mask)) { + pr_debug("pick_caches: Class has missing CPUs\n"); + continue; + } + + /* Assume cache levels are the same size for all CPUs... */ + cache_size = get_cpu_cacheinfo_size(smp_processor_id(), class->level); + if (!cache_size) { + pr_debug("pick_caches: Could not read cache size\n"); + continue; + } + + if (mpam_has_feature(mpam_feat_msmon_csu, cprops)) { + if (class->level == 3) + update_rmid_limits(cache_size); + } + + if (has_cpor) { + if (class->level == 2) { + res = &mpam_resctrl_exports[RDT_RESOURCE_L2]; + res->resctrl_res.name = "L2"; + } else { + res = &mpam_resctrl_exports[RDT_RESOURCE_L3]; + res->resctrl_res.name = "L3"; + } + res->class = class; + } + + if (has_cmax) { + if (class->level == 2) { + res = &mpam_resctrl_exports[RDT_RESOURCE_L2_MAX]; + res->resctrl_res.name = "L2MAX"; + } else { + res = &mpam_resctrl_exports[RDT_RESOURCE_L3_MAX]; + res->resctrl_res.name = "L3MAX"; + } + res->class = class; + } + + if (has_cmin) { + if (class->level == 2) { + res = &mpam_resctrl_exports[RDT_RESOURCE_L2_MIN]; + res->resctrl_res.name = "L2MIN"; + } else { + res = &mpam_resctrl_exports[RDT_RESOURCE_L3_MIN]; + res->resctrl_res.name = "L3MIN"; + } + res->class = class; + } + + if (has_intpri) { + if (class->level == 2) { + res = &mpam_resctrl_exports[RDT_RESOURCE_L2_PRI]; + res->resctrl_res.name = "L2PRI"; + } else { + res = &mpam_resctrl_exports[RDT_RESOURCE_L3_PRI]; + res->resctrl_res.name = "L3PRI"; + } + res->class = class; + } + } + srcu_read_unlock(&mpam_srcu, idx); +} + +static void mpam_resctrl_pick_mba(void) +{ + bool has_mba, has_mbw_min, has_intpri, has_limit; + struct mpam_resctrl_res *res; + struct mpam_class *class; + int idx; + + lockdep_assert_cpus_held(); + + idx = srcu_read_lock(&mpam_srcu); + list_for_each_entry_rcu(class, &mpam_classes, classes_list) { + struct mpam_props *cprops = &class->props; + + has_mba = class_has_usable_mba(cprops); + has_mbw_min = class_has_usable_mbw_min(cprops); + has_intpri = class_has_usable_intpri(cprops); + has_limit = class_has_usable_max_limit(cprops); + + if (class->level < 3) + continue; + + if (!cpumask_equal(&class->affinity, cpu_possible_mask)) + continue; + + if (has_mba) { + res = &mpam_resctrl_exports[RDT_RESOURCE_MBA]; + res->class = class; + res->resctrl_res.name = "MB"; + } + + if (has_mbw_min) { + res = &mpam_resctrl_exports[RDT_RESOURCE_MB_MIN]; + res->class = class; + res->resctrl_res.name = "MBMIN"; + } + + if (has_intpri) { + res = &mpam_resctrl_exports[RDT_RESOURCE_MB_PRI]; + res->class = class; + res->resctrl_res.name = "MBPRI"; + } + + if (has_limit) { + res = &mpam_resctrl_exports[RDT_RESOURCE_MB_HDL]; + res->class = class; + res->resctrl_res.name = "MBHDL"; + } + } + srcu_read_unlock(&mpam_srcu, idx); +} + +bool resctrl_arch_is_evt_configurable(enum resctrl_event_id evt) +{ + struct mpam_props *cprops; + + switch (evt) { + case QOS_L3_MBM_LOCAL_EVENT_ID: + if (!mbm_local_class) + return false; + cprops = &mbm_local_class->props; + + return mpam_has_feature(mpam_feat_msmon_mbwu_rwbw, cprops); + default: + return false; + } +} + +void resctrl_arch_mon_event_config_read(void *info) +{ + struct mpam_resctrl_dom *dom; + struct resctrl_mon_config_info *mon_info = info; + + dom = container_of(mon_info->d, struct mpam_resctrl_dom, resctrl_dom); + mon_info->mon_config = dom->mbm_local_evt_cfg & MAX_EVT_CONFIG_BITS; +} + +void resctrl_arch_mon_event_config_write(void *info) +{ + struct mpam_resctrl_dom *dom; + struct resctrl_mon_config_info *mon_info = info; + + if (mon_info->mon_config & ~MPAM_RESTRL_EVT_CONFIG_VALID) { + mon_info->err = -EOPNOTSUPP; + return; + } + + dom = container_of(mon_info->d, struct mpam_resctrl_dom, resctrl_dom); + dom->mbm_local_evt_cfg = mon_info->mon_config & MPAM_RESTRL_EVT_CONFIG_VALID; +} + +void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d) +{ + struct mpam_resctrl_dom *dom; + + dom = container_of(d, struct mpam_resctrl_dom, resctrl_dom); + dom->mbm_local_evt_cfg = MPAM_RESTRL_EVT_CONFIG_VALID; + mpam_msmon_reset_all_mbwu(dom->comp); +} + +static void mpam_llc_gran_hisi_workaround(struct rdt_resource *r) +{ + int cbm_len; + + if (read_cpuid_implementor() != ARM_CPU_IMP_HISI) + return; + + if (r->fflags != RFTYPE_RES_CACHE || r->cache_level != 3) + return; + + cbm_len = resctrl_arch_get_resource(RDT_RESOURCE_L3)->cache.cbm_len; + r->membw.bw_gran = max(100 / cbm_len, 1); +} + +static int mpam_resctrl_resource_init(struct mpam_resctrl_res *res) +{ + struct mpam_class *class = res->class; + struct mpam_props *cprops = &class->props; + struct rdt_resource *r = &res->resctrl_res; + bool has_mbwu = class_has_usable_mbwu(class); + bool has_csu = cache_has_usable_csu(class); + + /* Is this one of the two well-known caches? */ + switch (res->resctrl_res.rid) { + case RDT_RESOURCE_L2: + case RDT_RESOURCE_L3: + /* TODO: Scaling is not yet supported */ + r->cache.cbm_len = class->props.cpbm_wd; + r->cache.arch_has_sparse_bitmasks = true; + + /* mpam_devices will reject empty bitmaps */ + r->cache.min_cbm_bits = 1; + + /* TODO: kill these properties off as they are derivatives */ + r->format_str = "%d=%0*x"; + r->schema_fmt = RESCTRL_SCHEMA_BITMAP; + r->fflags = RFTYPE_RES_CACHE; + r->default_ctrl = BIT_MASK(class->props.cpbm_wd) - 1; + r->data_width = (class->props.cpbm_wd + 3) / 4; + r->cache_level = class->level; + + /* + * Which bits are shared with other ...things... + * Unknown devices use partid-0 which uses all the bitmap + * fields. Until we configured the SMMU and GIC not to do this + * 'all the bits' is the correct answer here. + */ + r->cache.shareable_bits = r->default_ctrl; + + if (mpam_has_feature(mpam_feat_cpor_part, cprops)) { + r->alloc_capable = true; + exposed_alloc_capable = true; + } + + /* + * MBWU counters may be 'local' or 'total' depending on where + * they are in the topology. Counters on caches are assumed to + * be local. If it's on the memory controller, its assumed to + * be global. + */ + if (has_mbwu) { + if (class->level == 3) { + mbm_local_class = class; + r->mon_capable = true; + + } else if (class->level == 2) { + mbm_core_class = class; + r->mon_capable = true; + } + } + + /* + * CSU counters only make sense on a cache. The file is called + * llc_occupancy, but its expected to the on the L3. + */ + if (has_csu && class->type == MPAM_CLASS_CACHE) + r->mon_capable = true; + + /* + * The power domain of L2 cache msc is shared with the + * core's, which will cause information of the L2 msc to + * be lost when the core enter power down state. + */ + if (class->level <= 2) + r->is_volatile = true; + break; + + case RDT_RESOURCE_MBA: + /* TODO: kill these properties off as they are derivatives */ + r->format_str = "%d=%0*u"; + r->schema_fmt = RESCTRL_SCHEMA_RANGE; + r->fflags = RFTYPE_RES_MB; + r->default_ctrl = MAX_MBA_BW; + r->membw.max_bw = MAX_MBA_BW; + r->data_width = 3; + + r->membw.delay_linear = true; + r->membw.throttle_mode = THREAD_THROTTLE_UNDEFINED; + r->membw.min_bw = 1; + r->membw.bw_gran = get_mba_granularity(cprops); + + /* Round up to at least 1% */ + if (!r->membw.bw_gran) + r->membw.bw_gran = 1; + + if (class_has_usable_mba(cprops)) { + r->alloc_capable = true; + exposed_alloc_capable = true; + } + + if (has_mbwu && class->type == MPAM_CLASS_MEMORY) { + mbm_total_class = class; + r->mon_capable = true; + } + break; + + case RDT_RESOURCE_L3_MAX: + case RDT_RESOURCE_L2_MAX: + r->format_str = "%d=%0*u"; + r->schema_fmt = RESCTRL_SCHEMA_RANGE; + r->fflags = RFTYPE_RES_CACHE; + r->default_ctrl = MAX_MBA_BW; + r->membw.max_bw = MAX_MBA_BW; + r->data_width = 3; + r->cache_level = class->level; + + if (cache_has_usable_cmax(class)) + r->alloc_capable = true; + + r->membw.min_bw = 1; + r->membw.bw_gran = max(100 / (1 << cprops->cmax_wd), 1); + mpam_llc_gran_hisi_workaround(r); + break; + + case RDT_RESOURCE_L3_MIN: + case RDT_RESOURCE_L2_MIN: + r->format_str = "%d=%0*u"; + r->schema_fmt = RESCTRL_SCHEMA_RANGE; + r->fflags = RFTYPE_RES_CACHE; + r->default_ctrl = 0; + r->membw.max_bw = MAX_MBA_BW; + r->data_width = 3; + r->cache_level = class->level; + + if (cache_has_usable_cmin(class)) + r->alloc_capable = true; + + r->membw.min_bw = 0; + r->membw.bw_gran = max(100 / (1 << cprops->cmax_wd), 1); + mpam_llc_gran_hisi_workaround(r); + break; + + case RDT_RESOURCE_MB_MIN: + r->format_str = "%d=%0*u"; + r->schema_fmt = RESCTRL_SCHEMA_RANGE; + r->fflags = RFTYPE_RES_MB; + r->default_ctrl = 0; + r->membw.max_bw = MAX_MBA_BW; + r->data_width = 3; + + r->membw.delay_linear = true; + r->membw.throttle_mode = THREAD_THROTTLE_UNDEFINED; + r->membw.bw_gran = get_mba_granularity(cprops); + + /* Round up to at least 1% */ + if (!r->membw.bw_gran) + r->membw.bw_gran = 1; + + if (class_has_usable_mbw_min(cprops)) + r->alloc_capable = true; + break; + + case RDT_RESOURCE_L3_PRI: + case RDT_RESOURCE_L2_PRI: + r->format_str = "%d=%0*u"; + r->schema_fmt = RESCTRL_SCHEMA_RANGE; + r->fflags = RFTYPE_RES_CACHE; + r->default_ctrl = 0; + r->membw.max_bw = GENMASK(cprops->intpri_wd - 1, 0); + r->data_width = 3; + r->cache_level = class->level; + + if (class_has_usable_intpri(cprops)) + r->alloc_capable = true; + + r->membw.min_bw = 0; + r->membw.bw_gran = 1; + break; + + case RDT_RESOURCE_MB_PRI: + r->format_str = "%d=%0*u"; + r->schema_fmt = RESCTRL_SCHEMA_RANGE; + r->fflags = RFTYPE_RES_MB; + r->default_ctrl = 3; + r->membw.max_bw = GENMASK(cprops->intpri_wd - 1, 0); + r->data_width = 3; + + r->membw.bw_gran = 1; + + if (class_has_usable_intpri(cprops)) + r->alloc_capable = true; + break; + + case RDT_RESOURCE_MB_HDL: + r->format_str = "%d=%0*u"; + r->schema_fmt = RESCTRL_SCHEMA_RANGE; + r->fflags = RFTYPE_RES_MB; + r->default_ctrl = 1; + r->membw.max_bw = 1; + r->data_width = 1; + + r->membw.bw_gran = 1; + + if (class_has_usable_max_limit(cprops)) + r->alloc_capable = true; + break; + + default: + break; + } + + if (r->mon_capable) { + exposed_mon_capable = true; + + /* + * Unfortunately, num_rmid doesn't mean anything for + * mpam, and its exposed to user-space! + * num-rmid is supposed to mean the number of groups + * that can be created, both control or monitor groups. + * For mpam, each control group has its own pmg/rmid + * space. + */ + r->num_rmid = resctrl_arch_system_num_rmid_idx(); + } + + return 0; +} + +int mpam_resctrl_setup(void) +{ + int err = 0; + struct mpam_resctrl_res *res; + enum resctrl_res_level i; + + wait_event(wait_cacheinfo_ready, cacheinfo_ready); + + cpus_read_lock(); + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + res = &mpam_resctrl_exports[i]; + INIT_LIST_HEAD(&res->resctrl_res.domains); + INIT_LIST_HEAD(&res->resctrl_res.evt_list); + res->resctrl_res.rid = i; + } + + mpam_resctrl_pick_caches(); + mpam_resctrl_pick_mba(); + /* TODO: mpam_resctrl_pick_counters(); */ + + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + res = &mpam_resctrl_exports[i]; + if (!res->class) + continue; // dummy resource + + err = mpam_resctrl_resource_init(res); + if (err) + break; + } + cpus_read_unlock(); + + if (!err && !exposed_alloc_capable && !exposed_mon_capable) + err = -EOPNOTSUPP; + + if (!err) { + if (!is_power_of_2(mpam_pmg_max + 1)) { + /* + * If not all the partid*pmg values are valid indexes, + * resctrl may allocate pmg that don't exist. This + * should cause an error interrupt. + */ + pr_warn("Number of PMG is not a power of 2! resctrl may misbehave"); + } + + err = resctrl_init(); + if (!err) + WRITE_ONCE(resctrl_enabled, true); + } + + return err; +} + +void mpam_resctrl_exit(void) +{ + if (!READ_ONCE(resctrl_enabled)) + return; + + WRITE_ONCE(resctrl_enabled, false); + resctrl_exit(); +} + +u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, + u32 closid, enum resctrl_conf_type type) +{ + u32 partid; + struct mpam_config *cfg; + struct mpam_props *cprops; + struct mpam_resctrl_res *res; + struct mpam_resctrl_dom *dom; + enum mpam_device_features configured_by; + + lockdep_assert_cpus_held(); + + if (!mpam_is_enabled()) + return r->default_ctrl; + + res = container_of(r, struct mpam_resctrl_res, resctrl_res); + dom = container_of(d, struct mpam_resctrl_dom, resctrl_dom); + cprops = &res->class->props; + + /* + * When CDP is enabled, but the resource doesn't support it, we + * need to get the configuration from the CDP_CODE resctrl_conf_type + * which is same as the CDP_DATA one. + */ + if (resctrl_arch_hide_cdp(r->rid)) + partid = resctrl_get_config_index(closid, CDP_CODE); + else + partid = resctrl_get_config_index(closid, type); + + cfg = &dom->comp->cfg[partid]; + + switch (r->rid) { + case RDT_RESOURCE_L2: + case RDT_RESOURCE_L3: + configured_by = mpam_feat_cpor_part; + break; + case RDT_RESOURCE_L2_MAX: + case RDT_RESOURCE_L3_MAX: + configured_by = mpam_feat_ccap_part; + break; + case RDT_RESOURCE_L2_MIN: + case RDT_RESOURCE_L3_MIN: + configured_by = mpam_feat_cmin; + break; + case RDT_RESOURCE_L2_PRI: + case RDT_RESOURCE_L3_PRI: + case RDT_RESOURCE_MB_PRI: + configured_by = mpam_feat_intpri_part; + break; + + case RDT_RESOURCE_MBA: + if (mba_class_use_mbw_part(cprops)) { + configured_by = mpam_feat_mbw_part; + break; + } else if (mpam_has_feature(mpam_feat_mbw_max, cprops)) { + configured_by = mpam_feat_mbw_max; + break; + } + return -EINVAL; + + case RDT_RESOURCE_MB_MIN: + configured_by = mpam_feat_mbw_min; + break; + + case RDT_RESOURCE_MB_HDL: + configured_by = mpam_feat_max_limit; + break; + + default: + return -EINVAL; + } + + if (!r->alloc_capable || partid >= resctrl_arch_get_num_closid(r) || + !mpam_has_feature(configured_by, cfg)) + return r->default_ctrl; + + switch (configured_by) { + case mpam_feat_cpor_part: + /* TODO: Scaling is not yet supported */ + return cfg->cpbm; + case mpam_feat_ccap_part: + return ca_max_to_percent(cfg->cmax, cprops->cmax_wd); + case mpam_feat_cmin: + return ca_max_to_percent(cfg->cmin, cprops->cmax_wd); + case mpam_feat_intpri_part: + return cfg->intpri; + case mpam_feat_mbw_part: + /* TODO: Scaling is not yet supported */ + return mbw_pbm_to_percent(cfg->mbw_pbm, cprops); + case mpam_feat_mbw_max: + return mbw_max_to_percent(cfg->mbw_max, cprops->bwa_wd); + case mpam_feat_mbw_min: + return mbw_max_to_percent(cfg->mbw_min, cprops->bwa_wd); + case mpam_feat_max_limit: + return cfg->max_limit; + default: + return -EINVAL; + } +} + +int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d, + u32 closid, enum resctrl_conf_type t, u32 cfg_val) +{ + u32 partid; + struct mpam_config cfg; + struct mpam_props *cprops; + struct mpam_resctrl_res *res; + struct mpam_resctrl_dom *dom; + + lockdep_assert_cpus_held(); + lockdep_assert_irqs_enabled(); + + /* NOTE: don't check the CPU as mpam_apply_config() doesn't care, + * and resctrl_arch_update_domains() depends on this. + */ + res = container_of(r, struct mpam_resctrl_res, resctrl_res); + dom = container_of(d, struct mpam_resctrl_dom, resctrl_dom); + cprops = &res->class->props; + + partid = resctrl_get_config_index(closid, t); + if (!r->alloc_capable || partid >= resctrl_arch_get_num_closid(r)) + return -EINVAL; + + cfg = dom->comp->cfg[partid]; + + switch (r->rid) { + case RDT_RESOURCE_L2: + case RDT_RESOURCE_L3: + /* TODO: Scaling is not yet supported */ + cfg.cpbm = cfg_val; + mpam_set_feature(mpam_feat_cpor_part, &cfg); + break; + case RDT_RESOURCE_L2_MAX: + case RDT_RESOURCE_L3_MAX: + cfg.cmax = percent_to_ca_max(cfg_val, cprops->cmax_wd); + mpam_set_feature(mpam_feat_ccap_part, &cfg); + break; + case RDT_RESOURCE_L2_MIN: + case RDT_RESOURCE_L3_MIN: + cfg.cmin = percent_to_ca_max(cfg_val, cprops->cmax_wd); + mpam_set_feature(mpam_feat_cmin, &cfg); + break; + case RDT_RESOURCE_L2_PRI: + case RDT_RESOURCE_L3_PRI: + case RDT_RESOURCE_MB_PRI: + cfg.intpri = cfg_val; + mpam_set_feature(mpam_feat_intpri_part, &cfg); + break; + case RDT_RESOURCE_MBA: + if (mba_class_use_mbw_part(cprops)) { + cfg.mbw_pbm = percent_to_mbw_pbm(cfg_val, cprops); + mpam_set_feature(mpam_feat_mbw_part, &cfg); + break; + } else if (mpam_has_feature(mpam_feat_mbw_max, cprops)) { + cfg.mbw_max = percent_to_mbw_max(cfg_val, cprops->bwa_wd); + mpam_set_feature(mpam_feat_mbw_max, &cfg); + break; + } + return -EINVAL; + case RDT_RESOURCE_MB_MIN: + cfg.mbw_min = percent_to_mbw_max(cfg_val, cprops->bwa_wd); + mpam_set_feature(mpam_feat_mbw_min, &cfg); + break; + case RDT_RESOURCE_MB_HDL: + cfg.max_limit = cfg_val; + mpam_set_feature(mpam_feat_max_limit, &cfg); + break; + default: + return -EINVAL; + } + + return mpam_apply_config(dom->comp, partid, &cfg); +} + +/* TODO: this is IPI heavy */ +int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid) +{ + int err = 0; + struct rdt_domain *d; + enum resctrl_conf_type t; + struct resctrl_staged_config *cfg; + + lockdep_assert_cpus_held(); + lockdep_assert_irqs_enabled(); + + list_for_each_entry(d, &r->domains, list) { + for (t = 0; t < CDP_NUM_TYPES; t++) { + cfg = &d->staged_config[t]; + if (!cfg->have_new_ctrl) + continue; + + err = resctrl_arch_update_one(r, d, closid, t, + cfg->new_ctrl); + if (err) + return err; + } + } + + return err; +} + +void resctrl_arch_reset_resources(void) +{ + int i, idx; + struct mpam_class *class; + struct mpam_resctrl_res *res; + + lockdep_assert_cpus_held(); + + if (!mpam_is_enabled()) + return; + + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + res = &mpam_resctrl_exports[i]; + + if (!res->class) + continue; // dummy resource + + if (!res->resctrl_res.alloc_capable) + continue; + + idx = srcu_read_lock(&mpam_srcu); + list_for_each_entry_rcu(class, &mpam_classes, classes_list) + mpam_reset_class(class); + srcu_read_unlock(&mpam_srcu, idx); + } +} + +static struct mpam_resctrl_dom * +mpam_resctrl_alloc_domain(unsigned int cpu, struct mpam_resctrl_res *res) +{ + struct mpam_resctrl_dom *dom; + struct mpam_class *class = res->class; + struct mpam_component *comp_iter, *comp; + + comp = NULL; + list_for_each_entry(comp_iter, &class->components, class_list) { + if (cpumask_test_cpu(cpu, &comp_iter->affinity)) { + comp = comp_iter; + break; + } + } + + /* cpu with unknown exported component? */ + if (WARN_ON_ONCE(!comp)) + return ERR_PTR(-EINVAL); + + dom = kzalloc_node(sizeof(*dom), GFP_KERNEL, cpu_to_node(cpu)); + if (!dom) + return ERR_PTR(-ENOMEM); + + dom->comp = comp; + INIT_LIST_HEAD(&dom->resctrl_dom.list); + dom->resctrl_dom.id = comp->comp_id; + dom->mbm_local_evt_cfg = MPAM_RESTRL_EVT_CONFIG_VALID; + cpumask_set_cpu(cpu, &dom->resctrl_dom.cpu_mask); + + /* TODO: this list should be sorted */ + list_add_tail(&dom->resctrl_dom.list, &res->resctrl_res.domains); + + return dom; +} + +/* Like resctrl_get_domain_from_cpu(), but for offline CPUs */ +static struct mpam_resctrl_dom * +mpam_get_domain_from_cpu(int cpu, struct mpam_resctrl_res *res) +{ + struct rdt_domain *d; + struct mpam_resctrl_dom *dom; + + lockdep_assert_cpus_held(); + + list_for_each_entry(d, &res->resctrl_res.domains, list) { + dom = container_of(d, struct mpam_resctrl_dom, resctrl_dom); + + if (cpumask_test_cpu(cpu, &dom->comp->affinity)) + return dom; + } + + return NULL; +} + +struct rdt_domain *resctrl_arch_find_domain(struct rdt_resource *r, int id) +{ + struct rdt_domain *d; + struct mpam_resctrl_dom *dom; + + lockdep_assert_cpus_held(); + + list_for_each_entry(d, &r->domains, list) { + dom = container_of(d, struct mpam_resctrl_dom, resctrl_dom); + if (dom->comp->comp_id == id) + return &dom->resctrl_dom; + } + + return NULL; +} + +int mpam_resctrl_online_cpu(unsigned int cpu) +{ + int i, err; + struct mpam_resctrl_dom *dom; + struct mpam_resctrl_res *res; + + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + res = &mpam_resctrl_exports[i]; + + if (!res->class) + continue; // dummy_resource; + + dom = mpam_get_domain_from_cpu(cpu, res); + if (dom) { + cpumask_set_cpu(cpu, &dom->resctrl_dom.cpu_mask); + continue; + } + + dom = mpam_resctrl_alloc_domain(cpu, res); + if (IS_ERR(dom)) + return PTR_ERR(dom); + err = resctrl_online_domain(&res->resctrl_res, &dom->resctrl_dom); + if (err) + return err; + } + + resctrl_online_cpu(cpu); + return 0; +} + +int mpam_resctrl_prepare_offline(void) +{ + struct mpam_resctrl_res *res; + int i; + + if (resctrl_mounted) { + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + res = &mpam_resctrl_exports[i]; + + if (res->resctrl_res.is_volatile && + !res->resctrl_res.invisible) { + pr_info("%s is working, umount /sys/fs/resctrl first.\n", + res->resctrl_res.name); + return -EBUSY; + } + } + } + + return 0; +} + +int mpam_resctrl_offline_cpu(unsigned int cpu) +{ + int i; + struct rdt_domain *d; + struct mpam_resctrl_res *res; + struct mpam_resctrl_dom *dom; + + resctrl_offline_cpu(cpu); + + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + res = &mpam_resctrl_exports[i]; + + if (!res->class) + continue; // dummy resource + + d = resctrl_get_domain_from_cpu(cpu, &res->resctrl_res); + dom = container_of(d, struct mpam_resctrl_dom, resctrl_dom); + + /* The last one standing was ahead of us... */ + if (WARN_ON_ONCE(!d)) + continue; + + cpumask_clear_cpu(cpu, &d->cpu_mask); + + if (!cpumask_empty(&d->cpu_mask)) + continue; + + resctrl_offline_domain(&res->resctrl_res, &dom->resctrl_dom); + list_del(&d->list); + kfree(dom); + } + + return 0; +} + +static struct mon_evt llc_occupancy_event = { + .name = "llc_occupancy", + .evtid = QOS_L3_OCCUP_EVENT_ID, +}; + +static struct mon_evt l2c_occupancy_event = { + .name = "l2c_occupancy", + .evtid = QOS_L2_OCCUP_EVENT_ID, +}; + +static struct mon_evt mbm_total_event = { + .name = "mbm_total_bytes", + .evtid = QOS_L3_MBM_TOTAL_EVENT_ID, +}; + +static struct mon_evt mbm_local_event = { + .name = "mbm_local_bytes", + .evtid = QOS_L3_MBM_LOCAL_EVENT_ID, +}; + +static struct mon_evt mbm_core_event = { + .name = "mbm_core_bytes", + .evtid = QOS_L2_MBM_CORE_EVENT_ID, +}; + +/* + * Initialize the event list for the resource. + * + * Note that MBM events are also part of RDT_RESOURCE_L3 resource + * because as per the SDM the total and local memory bandwidth + * are enumerated as part of L3 monitoring. + */ +static void l3_mon_evt_init(struct rdt_resource *r) +{ + INIT_LIST_HEAD(&r->evt_list); + + if (!r->mon_capable) + return; + + if (r->rid == RDT_RESOURCE_L3) { + if (resctrl_arch_is_llc_occupancy_enabled()) + list_add_tail(&llc_occupancy_event.list, &r->evt_list); + + if (resctrl_arch_is_mbm_local_enabled()) + list_add_tail(&mbm_local_event.list, &r->evt_list); + } + + if (r->rid == RDT_RESOURCE_L2) { + if (resctrl_arch_is_l2c_occupancy_enabled()) + list_add_tail(&l2c_occupancy_event.list, &r->evt_list); + + if (resctrl_arch_is_mbm_core_enabled()) + list_add_tail(&mbm_core_event.list, &r->evt_list); + } + + if ((r->rid == RDT_RESOURCE_MBA) && + resctrl_arch_is_mbm_total_enabled()) + list_add_tail(&mbm_total_event.list, &r->evt_list); +} + +int resctrl_arch_mon_resource_init(void) +{ + l3_mon_evt_init(resctrl_arch_get_resource(RDT_RESOURCE_L3)); + l3_mon_evt_init(resctrl_arch_get_resource(RDT_RESOURCE_L2)); + l3_mon_evt_init(resctrl_arch_get_resource(RDT_RESOURCE_MBA)); + + if (resctrl_arch_is_evt_configurable(QOS_L3_MBM_TOTAL_EVENT_ID)) { + mbm_total_event.configurable = true; + mbm_config_rftype_init("mbm_total_bytes_config"); + } + if (resctrl_arch_is_evt_configurable(QOS_L3_MBM_LOCAL_EVENT_ID)) { + mbm_local_event.configurable = true; + mbm_config_rftype_init("mbm_local_bytes_config"); + } + + return 0; +} + +static int __init __cacheinfo_ready(void) +{ + cacheinfo_ready = true; + wake_up(&wait_cacheinfo_ready); + + return 0; +} +device_initcall_sync(__cacheinfo_ready); diff --git a/fs/Kconfig b/fs/Kconfig index 8d41bc425e455c66e95b34e7953a58bf7f87bfbd..53c2f5152b75c11d04748ca34ce60d64a6b8c447 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -337,6 +337,7 @@ source "fs/omfs/Kconfig" source "fs/hpfs/Kconfig" source "fs/qnx4/Kconfig" source "fs/qnx6/Kconfig" +source "fs/resctrl/Kconfig" source "fs/romfs/Kconfig" source "fs/pstore/Kconfig" source "fs/sysv/Kconfig" diff --git a/fs/Makefile b/fs/Makefile index 777a161d6bab5b7be7c8f74bb16a17cb3eeb703f..1975dcd748d34c3d97df89ccf3295a859094b827 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -131,3 +131,4 @@ obj-$(CONFIG_EFIVAR_FS) += efivarfs/ obj-$(CONFIG_EROFS_FS) += erofs/ obj-$(CONFIG_VBOXSF_FS) += vboxsf/ obj-$(CONFIG_ZONEFS_FS) += zonefs/ +obj-$(CONFIG_RESCTRL_FS) += resctrl/ diff --git a/fs/resctrl/Kconfig b/fs/resctrl/Kconfig new file mode 100644 index 0000000000000000000000000000000000000000..540da76e118917d88319a6a1d60ac4e614184559 --- /dev/null +++ b/fs/resctrl/Kconfig @@ -0,0 +1,24 @@ +config RESCTRL_FS + bool "CPU Resource Control Filesystem (resctrl)" + depends on ARCH_HAS_CPU_RESCTRL + depends on ARM_CPU_RESCTRL + select KERNFS + select PROC_CPU_RESCTRL if PROC_FS + help + Resctrl is a filesystem interface + to control allocation and + monitoring of system resources + used by the CPUs. + +config RESCTRL_FS_PSEUDO_LOCK + bool + help + Software mechanism to try and pin data in a cache portion using + micro-architecture tricks. + +config RESCTRL_RMID_DEPENDS_ON_CLOSID + bool + help + Enable by the architecture when the RMID values depend on the CLOSID. + This causes the closid allocator to search for CLOSID with clean + RMID. diff --git a/fs/resctrl/Makefile b/fs/resctrl/Makefile new file mode 100644 index 0000000000000000000000000000000000000000..4c32e5e914b620b0a34ff8d775a057ccc1a92d26 --- /dev/null +++ b/fs/resctrl/Makefile @@ -0,0 +1 @@ +obj-$(CONFIG_RESCTRL_FS) += rdtgroup.o ctrlmondata.o monitor.o psuedo_lock.o diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c new file mode 100644 index 0000000000000000000000000000000000000000..061d2e42f8210fc903bf39f6b0e7eb821e1c1edc --- /dev/null +++ b/fs/resctrl/ctrlmondata.c @@ -0,0 +1,563 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Resource Director Technology(RDT) + * - Cache Allocation code. + * + * Copyright (C) 2016 Intel Corporation + * + * Authors: + * Fenghua Yu + * Tony Luck + * + * More information about RDT be found in the Intel (R) x86 Architecture + * Software Developer Manual June 2016, volume 3, section 17.17. + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include "internal.h" + +struct rdt_parse_data { + struct rdtgroup *rdtgrp; + char *buf; +}; + +typedef int (ctrlval_parser_t)(struct rdt_parse_data *data, + struct resctrl_schema *s, + struct rdt_domain *d); + +/* + * Check whether MBA bandwidth percentage value is correct. The value is + * checked against the minimum and max bandwidth values specified by the + * hardware. The allocated bandwidth percentage is rounded to the next + * control step available on the hardware. + */ +static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r) +{ + int ret; + u32 bw; + + /* + * Only linear delay values is supported for current Intel SKUs. + */ + if (!r->membw.delay_linear && r->membw.arch_needs_linear) { + rdt_last_cmd_printf("No support for non-linear %s domains\n", + r->name); + return false; + } + + ret = kstrtou32(buf, 10, &bw); + if (ret) { + rdt_last_cmd_printf("Non-decimal digit in %s value %s\n", + r->name, buf); + } + + /* Nothing else to do if software controller is enabled. */ + if (is_mba_sc(r)) { + *data = bw; + return true; + } + + if (bw < r->membw.min_bw || bw > r->membw.max_bw) { + rdt_last_cmd_printf("%s value %d out of range [%d,%d]\n", + r->name, bw, r->membw.min_bw, + r->membw.max_bw); + return false; + } + + *data = roundup(bw, (unsigned long)r->membw.bw_gran); + return true; +} + +static int parse_bw_conf_type(struct rdt_parse_data *data, struct resctrl_schema *s, + struct rdt_domain *d, enum resctrl_conf_type conf_type) +{ + struct resctrl_staged_config *cfg; + u32 closid = data->rdtgrp->closid; + struct rdt_resource *r = s->res; + u32 bw_val; + + cfg = &d->staged_config[conf_type]; + if (cfg->have_new_ctrl) { + rdt_last_cmd_printf("Duplicate domain %d\n", d->id); + return -EINVAL; + } + + if (!bw_validate(data->buf, &bw_val, r)) + return -EINVAL; + + if (is_mba_sc(r)) { + d->mbps_val[closid] = bw_val; + return 0; + } + + cfg->new_ctrl = bw_val; + cfg->have_new_ctrl = true; + + return 0; +} + +static int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s, + struct rdt_domain *d) +{ + struct rdt_resource *r = s->res; + int err; + + /* + * When CDP is enabled, but the resource doesn't support it, we + * need to apply the same configuration to both of the CDP_CODE + * and CDP_DATA resctrl_conf_type. + */ + if (resctrl_arch_hide_cdp(r->rid)) { + err = parse_bw_conf_type(data, s, d, CDP_CODE); + if (err) + return err; + + return parse_bw_conf_type(data, s, d, CDP_DATA); + } + + return parse_bw_conf_type(data, s, d, s->conf_type); +} + +/* + * Check whether a cache bit mask is valid. + * On Intel CPUs, non-contiguous 1s value support is indicated by CPUID: + * - CPUID.0x10.1:ECX[3]: L3 non-contiguous 1s value supported if 1 + * - CPUID.0x10.2:ECX[3]: L2 non-contiguous 1s value supported if 1 + * + * Haswell does not support a non-contiguous 1s value and additionally + * requires at least two bits set. + * AMD allows non-contiguous bitmasks. + */ +static bool cbm_validate(char *buf, u32 *data, struct rdt_resource *r) +{ + unsigned long first_bit, zero_bit, val; + unsigned int cbm_len = r->cache.cbm_len; + int ret; + + ret = kstrtoul(buf, 16, &val); + if (ret) { + rdt_last_cmd_printf("Non-hex character in the mask %s\n", buf); + return false; + } + + if ((r->cache.min_cbm_bits > 0 && val == 0) || val > r->default_ctrl) { + rdt_last_cmd_puts("Mask out of range\n"); + return false; + } + + first_bit = find_first_bit(&val, cbm_len); + zero_bit = find_next_zero_bit(&val, cbm_len, first_bit); + + /* Are non-contiguous bitmasks allowed? */ + if (!r->cache.arch_has_sparse_bitmasks && + (find_next_bit(&val, cbm_len, zero_bit) < cbm_len)) { + rdt_last_cmd_printf("The mask %lx has non-consecutive 1-bits\n", val); + return false; + } + + if ((zero_bit - first_bit) < r->cache.min_cbm_bits) { + rdt_last_cmd_printf("Need at least %d bits in the mask\n", + r->cache.min_cbm_bits); + return false; + } + + *data = val; + return true; +} + +/* + * Read one cache bit mask (hex). Check that it is valid for the current + * resource type. + */ +static int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s, + struct rdt_domain *d) +{ + struct rdtgroup *rdtgrp = data->rdtgrp; + struct resctrl_staged_config *cfg; + struct rdt_resource *r = s->res; + u32 cbm_val; + + cfg = &d->staged_config[s->conf_type]; + if (cfg->have_new_ctrl) { + rdt_last_cmd_printf("Duplicate domain %d\n", d->id); + return -EINVAL; + } + + /* + * Cannot set up more than one pseudo-locked region in a cache + * hierarchy. + */ + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP && + rdtgroup_pseudo_locked_in_hierarchy(d)) { + rdt_last_cmd_puts("Pseudo-locked region in hierarchy\n"); + return -EINVAL; + } + + if (!cbm_validate(data->buf, &cbm_val, r)) + return -EINVAL; + + if (IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK) && + (rdtgrp->mode == RDT_MODE_EXCLUSIVE || + rdtgrp->mode == RDT_MODE_SHAREABLE) && + rdtgroup_cbm_overlaps_pseudo_locked(d, cbm_val)) { + rdt_last_cmd_puts("CBM overlaps with pseudo-locked region\n"); + return -EINVAL; + } + + /* + * The CBM may not overlap with the CBM of another closid if + * either is exclusive. + */ + if (rdtgroup_cbm_overlaps(s, d, cbm_val, rdtgrp->closid, true)) { + rdt_last_cmd_puts("Overlaps with exclusive group\n"); + return -EINVAL; + } + + if (rdtgroup_cbm_overlaps(s, d, cbm_val, rdtgrp->closid, false)) { + if (rdtgrp->mode == RDT_MODE_EXCLUSIVE || + rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { + rdt_last_cmd_puts("Overlaps with other group\n"); + return -EINVAL; + } + } + + cfg->new_ctrl = cbm_val; + cfg->have_new_ctrl = true; + + return 0; +} + +static ctrlval_parser_t *get_parser(struct rdt_resource *res) +{ + if (res->schema_fmt == RESCTRL_SCHEMA_BITMAP) + return &parse_cbm; + else + return &parse_bw; +} + +/* + * For each domain in this resource we expect to find a series of: + * id=mask + * separated by ";". The "id" is in decimal, and must match one of + * the "id"s for this resource. + */ +static int parse_line(char *line, struct resctrl_schema *s, + struct rdtgroup *rdtgrp) +{ + ctrlval_parser_t *parse_ctrlval = get_parser(s->res); + enum resctrl_conf_type t = s->conf_type; + struct resctrl_staged_config *cfg; + struct rdt_resource *r = s->res; + struct rdt_parse_data data; + char *dom = NULL, *id; + struct rdt_domain *d; + unsigned long dom_id; + + /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP && + (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)) { + rdt_last_cmd_puts("Cannot pseudo-lock MBA resource\n"); + return -EINVAL; + } + +next: + if (!line || line[0] == '\0') + return 0; + dom = strsep(&line, ";"); + id = strsep(&dom, "="); + if (!dom || kstrtoul(id, 10, &dom_id)) { + rdt_last_cmd_puts("Missing '=' or non-numeric domain\n"); + return -EINVAL; + } + dom = strim(dom); + list_for_each_entry(d, &r->domains, list) { + if (d->id == dom_id) { + data.buf = dom; + data.rdtgrp = rdtgrp; + if (parse_ctrlval(&data, s, d)) + return -EINVAL; + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { + cfg = &d->staged_config[t]; + /* + * In pseudo-locking setup mode and just + * parsed a valid CBM that should be + * pseudo-locked. Only one locked region per + * resource group and domain so just do + * the required initialization for single + * region and return. + */ + rdtgrp->plr->s = s; + rdtgrp->plr->d = d; + rdtgrp->plr->cbm = cfg->new_ctrl; + d->plr = rdtgrp->plr; + return 0; + } + goto next; + } + } + return -EINVAL; +} + +static int rdtgroup_parse_resource(char *resname, char *tok, + struct rdtgroup *rdtgrp) +{ + struct resctrl_schema *s; + + list_for_each_entry(s, &resctrl_schema_all, list) { + if (s->res->invisible) + continue; + + if (!strcmp(resname, s->name) && rdtgrp->closid < s->num_closid) + return parse_line(tok, s, rdtgrp); + } + rdt_last_cmd_printf("Unknown or unsupported resource name '%s'\n", resname); + return -EINVAL; +} + +ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct resctrl_schema *s; + struct rdtgroup *rdtgrp; + struct rdt_resource *r; + char *tok, *resname; + int ret = 0; + + /* Valid input requires a trailing newline */ + if (nbytes == 0 || buf[nbytes - 1] != '\n') + return -EINVAL; + buf[nbytes - 1] = '\0'; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (!rdtgrp) { + rdtgroup_kn_unlock(of->kn); + return -ENOENT; + } + rdt_last_cmd_clear(); + + /* + * No changes to pseudo-locked region allowed. It has to be removed + * and re-created instead. + */ + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) { + ret = -EINVAL; + rdt_last_cmd_puts("Resource group is pseudo-locked\n"); + goto out; + } + + rdt_staged_configs_clear(); + + while ((tok = strsep(&buf, "\n")) != NULL) { + resname = strim(strsep(&tok, ":")); + if (!tok) { + rdt_last_cmd_puts("Missing ':'\n"); + ret = -EINVAL; + goto out; + } + if (tok[0] == '\0') { + rdt_last_cmd_printf("Missing '%s' value\n", resname); + ret = -EINVAL; + goto out; + } + ret = rdtgroup_parse_resource(resname, tok, rdtgrp); + if (ret) + goto out; + } + + list_for_each_entry(s, &resctrl_schema_all, list) { + r = s->res; + + /* + * Writes to mba_sc resources update the software controller, + * not the control MSR. + */ + if (is_mba_sc(r)) + continue; + + ret = resctrl_arch_update_domains(r, rdtgrp->closid); + if (ret) + goto out; + } + + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { + /* + * If pseudo-locking fails we keep the resource group in + * mode RDT_MODE_PSEUDO_LOCKSETUP with its class of service + * active and updated for just the domain the pseudo-locked + * region was requested for. + */ + ret = rdtgroup_pseudo_lock_create(rdtgrp); + } + +out: + rdt_staged_configs_clear(); + rdtgroup_kn_unlock(of->kn); + return ret ?: nbytes; +} + +static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int closid) +{ + struct rdt_resource *r = schema->res; + struct rdt_domain *dom; + bool sep = false; + u32 ctrl_val; + + /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + + seq_printf(s, "%*s:", max_name_width, schema->name); + list_for_each_entry(dom, &r->domains, list) { + if (sep) + seq_puts(s, ";"); + + if (is_mba_sc(r)) + ctrl_val = dom->mbps_val[closid]; + else + ctrl_val = resctrl_arch_get_config(r, dom, closid, + schema->conf_type); + + seq_printf(s, r->format_str, dom->id, max_data_width, + ctrl_val); + sep = true; + } + seq_puts(s, "\n"); +} + +int rdtgroup_schemata_show(struct kernfs_open_file *of, + struct seq_file *s, void *v) +{ + struct resctrl_schema *schema; + struct rdtgroup *rdtgrp; + int ret = 0; + u32 closid; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (rdtgrp) { + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { + list_for_each_entry(schema, &resctrl_schema_all, list) { + seq_printf(s, "%s:uninitialized\n", schema->name); + } + } else if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) { + if (!rdtgrp->plr->d) { + rdt_last_cmd_clear(); + rdt_last_cmd_puts("Cache domain offline\n"); + ret = -ENODEV; + } else { + seq_printf(s, "%s:%d=%x\n", + rdtgrp->plr->s->res->name, + rdtgrp->plr->d->id, + rdtgrp->plr->cbm); + } + } else { + closid = rdtgrp->closid; + list_for_each_entry(schema, &resctrl_schema_all, list) { + if (schema->res->invisible) + continue; + + if (closid < schema->num_closid) + show_doms(s, schema, closid); + } + } + } else { + ret = -ENOENT; + } + rdtgroup_kn_unlock(of->kn); + return ret; +} + +static int smp_mon_event_count(void *arg) +{ + mon_event_count(arg); + + return 0; +} + +void mon_event_read(struct rmid_read *rr, struct rdt_resource *r, + struct rdt_domain *d, struct rdtgroup *rdtgrp, + int evtid, int first) +{ + int cpu; + + /* When picking a CPU from cpu_mask, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + + /* + * Setup the parameters to pass to mon_event_count() to read the data. + */ + rr->rgrp = rdtgrp; + rr->evtid = evtid; + rr->r = r; + rr->d = d; + rr->val = 0; + rr->first = first; + rr->arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, evtid); + if (IS_ERR(rr->arch_mon_ctx)) { + rr->err = -EINVAL; + return; + } + + cpu = cpumask_any_housekeeping(&d->cpu_mask, RESCTRL_PICK_ANY_CPU); + + /* + * cpumask_any_housekeeping() prefers housekeeping CPUs, but + * are all the CPUs nohz_full? If yes, pick a CPU to IPI. + * MPAM's resctrl_arch_rmid_read() is unable to read the + * counters on some platforms if its called in IRQ context. + */ + if (tick_nohz_full_cpu(cpu) && !IS_ENABLED(CONFIG_ARM64_MPAM)) + smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1); + else + smp_call_on_cpu(cpu, smp_mon_event_count, rr, false); + + resctrl_arch_mon_ctx_free(r, evtid, rr->arch_mon_ctx); +} + +int rdtgroup_mondata_show(struct seq_file *m, void *arg) +{ + struct kernfs_open_file *of = m->private; + u32 resid, evtid, domid; + struct rdtgroup *rdtgrp; + struct rdt_resource *r; + union mon_data_bits md; + struct rdt_domain *d; + struct rmid_read rr; + int ret = 0; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (!rdtgrp) { + ret = -ENOENT; + goto out; + } + + md.priv = of->kn->priv; + resid = md.u.rid; + domid = md.u.domid; + evtid = md.u.evtid; + + r = resctrl_arch_get_resource(resid); + d = resctrl_arch_find_domain(r, domid); + if (IS_ERR_OR_NULL(d)) { + ret = -ENOENT; + goto out; + } + + mon_event_read(&rr, r, d, rdtgrp, evtid, false); + + if (rr.err == -EIO) + seq_puts(m, "Error\n"); + else if (rr.err == -EINVAL) + seq_puts(m, "Unavailable\n"); + else + seq_printf(m, "%llu\n", rr.val); + +out: + rdtgroup_kn_unlock(of->kn); + return ret; +} diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h new file mode 100644 index 0000000000000000000000000000000000000000..d8498ed1480be0ed6ae60cb90e3ebace8c96534e --- /dev/null +++ b/fs/resctrl/internal.h @@ -0,0 +1,296 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _FS_RESCTRL_INTERNAL_H +#define _FS_RESCTRL_INTERNAL_H + +#include +#include +#include +#include +#include +#include + +#include + +/** + * cpumask_any_housekeeping() - Choose any CPU in @mask, preferring those that + * aren't marked nohz_full + * @mask: The mask to pick a CPU from. + * @exclude_cpu:The CPU to avoid picking. + * + * Returns a CPU from @mask, but not @exclude_cpu. If there are housekeeping + * CPUs that don't use nohz_full, these are preferred. Pass + * RESCTRL_PICK_ANY_CPU to avoid excluding any CPUs. + * + * When a CPU is excluded, returns >= nr_cpu_ids if no CPUs are available. + */ +static inline unsigned int +cpumask_any_housekeeping(const struct cpumask *mask, int exclude_cpu) +{ + unsigned int cpu, hk_cpu; + + if (exclude_cpu == RESCTRL_PICK_ANY_CPU) + cpu = cpumask_any(mask); + else + cpu = cpumask_any_but(mask, exclude_cpu); + + /* Only continue if tick_nohz_full_mask has been initialized. */ + if (!tick_nohz_full_enabled()) + return cpu; + + /* If the CPU picked isn't marked nohz_full nothing more needs doing. */ + if (cpu < nr_cpu_ids && !tick_nohz_full_cpu(cpu)) + return cpu; + + /* Try to find a CPU that isn't nohz_full to use in preference */ + hk_cpu = cpumask_nth_andnot(0, mask, tick_nohz_full_mask); + if (hk_cpu == exclude_cpu) + hk_cpu = cpumask_nth_andnot(1, mask, tick_nohz_full_mask); + + if (hk_cpu < nr_cpu_ids) + cpu = hk_cpu; + + return cpu; +} + +struct rdt_fs_context { + struct kernfs_fs_context kfc; + bool enable_cdpl2; + bool enable_cdpl3; + bool enable_mba_mbps; + bool enable_debug; + bool enable_l2; +}; + +static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc) +{ + struct kernfs_fs_context *kfc = fc->fs_private; + + return container_of(kfc, struct rdt_fs_context, kfc); +} + +/** + * union mon_data_bits - Monitoring details for each event file + * @priv: Used to store monitoring event data in @u + * as kernfs private data + * @rid: Resource id associated with the event file + * @evtid: Event id associated with the event file + * @domid: The domain to which the event file belongs + * @u: Name of the bit fields struct + */ +union mon_data_bits { + void *priv; + struct { + unsigned int rid : 10; + enum resctrl_event_id evtid : 8; + unsigned int domid : 14; + } u; +}; + +struct rmid_read { + struct rdtgroup *rgrp; + struct rdt_resource *r; + struct rdt_domain *d; + enum resctrl_event_id evtid; + bool first; + int err; + u64 val; + void *arch_mon_ctx; +}; + +extern struct list_head resctrl_schema_all; + +enum rdt_group_type { + RDTCTRL_GROUP = 0, + RDTMON_GROUP, + RDT_NUM_GROUP, +}; + +/** + * enum rdtgrp_mode - Mode of a RDT resource group + * @RDT_MODE_SHAREABLE: This resource group allows sharing of its allocations + * @RDT_MODE_EXCLUSIVE: No sharing of this resource group's allocations allowed + * @RDT_MODE_PSEUDO_LOCKSETUP: Resource group will be used for Pseudo-Locking + * @RDT_MODE_PSEUDO_LOCKED: No sharing of this resource group's allocations + * allowed AND the allocations are Cache Pseudo-Locked + * @RDT_NUM_MODES: Total number of modes + * + * The mode of a resource group enables control over the allowed overlap + * between allocations associated with different resource groups (classes + * of service). User is able to modify the mode of a resource group by + * writing to the "mode" resctrl file associated with the resource group. + * + * The "shareable", "exclusive", and "pseudo-locksetup" modes are set by + * writing the appropriate text to the "mode" file. A resource group enters + * "pseudo-locked" mode after the schemata is written while the resource + * group is in "pseudo-locksetup" mode. + */ +enum rdtgrp_mode { + RDT_MODE_SHAREABLE = 0, + RDT_MODE_EXCLUSIVE, + RDT_MODE_PSEUDO_LOCKSETUP, + RDT_MODE_PSEUDO_LOCKED, + + /* Must be last */ + RDT_NUM_MODES, +}; + +/** + * struct mongroup - store mon group's data in resctrl fs. + * @mon_data_kn: kernfs node for the mon_data directory + * @parent: parent rdtgrp + * @crdtgrp_list: child rdtgroup node list + * @rmid: rmid for this rdtgroup + */ +struct mongroup { + struct kernfs_node *mon_data_kn; + struct rdtgroup *parent; + struct list_head crdtgrp_list; + u32 rmid; +}; + +/** + * struct rdtgroup - store rdtgroup's data in resctrl file system. + * @kn: kernfs node + * @rdtgroup_list: linked list for all rdtgroups + * @closid: closid for this rdtgroup + * @cpu_mask: CPUs assigned to this rdtgroup + * @flags: status bits + * @waitcount: how many cpus expect to find this + * group when they acquire rdtgroup_mutex + * @type: indicates type of this rdtgroup - either + * monitor only or ctrl_mon group + * @mon: mongroup related data + * @mode: mode of resource group + * @plr: pseudo-locked region + */ +struct rdtgroup { + struct kernfs_node *kn; + struct list_head rdtgroup_list; + u32 closid; + struct cpumask cpu_mask; + int flags; + atomic_t waitcount; + enum rdt_group_type type; + struct mongroup mon; + enum rdtgrp_mode mode; + struct pseudo_lock_region *plr; +}; + +/* List of all resource groups */ +extern struct list_head rdt_all_groups; + +extern int max_name_width, max_data_width; + +/** + * struct rftype - describe each file in the resctrl file system + * @name: File name + * @mode: Access mode + * @kf_ops: File operations + * @flags: File specific RFTYPE_FLAGS_* flags + * @fflags: File specific RFTYPE_* flags + * @seq_show: Show content of the file + * @write: Write to the file + */ +struct rftype { + char *name; + umode_t mode; + const struct kernfs_ops *kf_ops; + unsigned long flags; + unsigned long fflags; + + int (*seq_show)(struct kernfs_open_file *of, + struct seq_file *sf, void *v); + /* + * write() is the generic write callback which maps directly to + * kernfs write operation and overrides all other operations. + * Maximum write size is determined by ->max_write_len. + */ + ssize_t (*write)(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off); +}; + +/** + * struct mbm_state - status for each MBM counter in each domain + * @prev_bw_bytes: Previous bytes value read for bandwidth calculation + * @prev_bw: The most recent bandwidth in MBps + */ +struct mbm_state { + u64 prev_bw_bytes; + u32 prev_bw; +}; + +static inline bool is_mba_sc(struct rdt_resource *r) +{ + if (!r) + r = resctrl_arch_get_resource(RDT_RESOURCE_MBA); + + /* + * The software controller support is only applicable to MBA resource. + * Make sure to check for resource type. + */ + if (r->rid != RDT_RESOURCE_MBA) + return false; + + return r->membw.mba_sc; +} + +extern struct mutex rdtgroup_mutex; +extern struct rdtgroup rdtgroup_default; +extern struct dentry *debugfs_resctrl; + +void rdt_last_cmd_clear(void); +void rdt_last_cmd_puts(const char *s); +__printf(1, 2) +void rdt_last_cmd_printf(const char *fmt, ...); + +struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn); +void rdtgroup_kn_unlock(struct kernfs_node *kn); +int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name); +int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name, + umode_t mask); +ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off); +int rdtgroup_schemata_show(struct kernfs_open_file *of, + struct seq_file *s, void *v); +bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d, + unsigned long cbm, int closid, bool exclusive); +unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_domain *d, + unsigned long cbm); +enum rdtgrp_mode rdtgroup_mode_by_closid(int closid); +int rdtgroup_tasks_assigned(struct rdtgroup *r); +int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp); +int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp); +bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm); +bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d); +int rdt_pseudo_lock_init(void); +void rdt_pseudo_lock_release(void); +int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp); +void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp); +int closids_supported(void); +bool closid_allocated(unsigned int closid); +bool resctrl_closid_is_dirty(u32 closid); +void closid_free(int closid); +int alloc_rmid(u32 closid); +void free_rmid(u32 closid, u32 rmid); +void resctrl_mon_resource_exit(void); +void mon_event_count(void *info); +int rdtgroup_mondata_show(struct seq_file *m, void *arg); +void mon_event_read(struct rmid_read *rr, struct rdt_resource *r, + struct rdt_domain *d, struct rdtgroup *rdtgrp, + int evtid, int first); +int resctrl_mon_resource_init(void); +void mbm_setup_overflow_handler(struct rdt_domain *dom, + unsigned long delay_ms, + int exclude_cpu); +void mbm_handle_overflow(struct work_struct *work); +void setup_default_ctrlval(struct rdt_resource *r, u32 *dc); +void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms, + int exclude_cpu); +void cqm_handle_limbo(struct work_struct *work); +bool has_busy_rmid(struct rdt_domain *d); +void __check_limbo(struct rdt_domain *d, bool force_free); +void rdt_staged_configs_clear(void); +int resctrl_find_cleanest_closid(void); +struct rmid_entry *resctrl_find_free_rmid(u32 closid); + +#endif /* _FS_RESCTRL_INTERNAL_H */ diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c new file mode 100644 index 0000000000000000000000000000000000000000..4f55614bbf8c67f6bb7698c7dacfe5fef3299710 --- /dev/null +++ b/fs/resctrl/monitor.c @@ -0,0 +1,814 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Resource Director Technology(RDT) + * - Monitoring code + * + * Copyright (C) 2017 Intel Corporation + * + * Author: + * Vikas Shivappa + * + * This replaces the cqm.c based on perf but we reuse a lot of + * code and datastructures originally from Peter Zijlstra and Matt Fleming. + * + * More information about RDT be found in the Intel (R) x86 Architecture + * Software Developer Manual June 2016, volume 3, section 17.17. + */ + +#include +#include +#include +#include +#include "internal.h" + +/* + * struct rmid_entry - dirty tracking for all RMID. + * @closid: The CLOSID for this entry. + * @rmid: The RMID for this entry. + * @busy: The number of domains with cached data using this RMID. + * @list: Member of the rmid_free_lru list when busy == 0. + * + * Depending on the architecture the correct monitor is accessed using + * both @closid and @rmid, or @rmid only. + * + * Take the rdtgroup_mutex when accessing. + */ +struct rmid_entry { + u32 closid; + u32 rmid; + int busy; + struct list_head list; +}; + +/* + * @rmid_free_lru - A least recently used list of free RMIDs + * These RMIDs are guaranteed to have an occupancy less than the + * threshold occupancy + */ +static LIST_HEAD(rmid_free_lru); + +/** + * @closid_num_dirty_rmid The number of dirty RMID each CLOSID has. + * Only allocated when CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID is defined. + * Indexed by CLOSID. Protected by rdtgroup_mutex. + */ +static u32 *closid_num_dirty_rmid; + +/* + * @rmid_limbo_count - count of currently unused but (potentially) + * dirty RMIDs. + * This counts RMIDs that no one is currently using but that + * may have a occupancy value > resctrl_rmid_realloc_threshold. User can + * change the threshold occupancy value. + */ +static unsigned int rmid_limbo_count; + +/* + * @rmid_entry - The entry in the limbo and free lists. + */ +static struct rmid_entry *rmid_ptrs; + +/* + * This is the threshold cache occupancy in bytes at which we will consider an + * RMID available for re-allocation. + */ +unsigned int resctrl_rmid_realloc_threshold; + +/* + * This is the maximum value for the reallocation threshold, in bytes. + */ +unsigned int resctrl_rmid_realloc_limit; + +/* + * x86 and arm64 differ in their handling of monitoring. + * x86's RMID are independent numbers, there is only one source of traffic + * with an RMID value of '1'. + * arm64's PMG extends the PARTID/CLOSID space, there are multiple sources of + * traffic with a PMG value of '1', one for each CLOSID, meaning the RMID + * value is no longer unique. + * To account for this, resctrl uses an index. On x86 this is just the RMID, + * on arm64 it encodes the CLOSID and RMID. This gives a unique number. + * + * The domain's rmid_busy_llc and rmid_ptrs[] are sized by index. The arch code + * must accept an attempt to read every index. + */ +static inline struct rmid_entry *__rmid_entry(u32 idx) +{ + struct rmid_entry *entry; + u32 closid, rmid; + + entry = &rmid_ptrs[idx]; + resctrl_arch_rmid_idx_decode(idx, &closid, &rmid); + + WARN_ON_ONCE(entry->closid != closid); + WARN_ON_ONCE(entry->rmid != rmid); + + return entry; +} + +static void limbo_release_entry(struct rmid_entry *entry) +{ + lockdep_assert_held(&rdtgroup_mutex); + + rmid_limbo_count--; + list_add_tail(&entry->list, &rmid_free_lru); + + if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) + closid_num_dirty_rmid[entry->closid]--; +} + +/* + * Check the RMIDs that are marked as busy for this domain. If the + * reported LLC occupancy is below the threshold clear the busy bit and + * decrement the count. If the busy count gets to zero on an RMID, we + * free the RMID + */ +void __check_limbo(struct rdt_domain *d, bool force_free) +{ + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3); + u32 idx_limit = resctrl_arch_system_num_rmid_idx(); + struct rmid_entry *entry; + u32 idx, cur_idx = 1; + void *arch_mon_ctx; + bool rmid_dirty; + u64 val = 0; + + arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, QOS_L3_OCCUP_EVENT_ID); + if (IS_ERR(arch_mon_ctx)) { + pr_warn_ratelimited("Failed to allocate monitor context: %ld", + PTR_ERR(arch_mon_ctx)); + return; + } + + /* + * Skip RMID 0 and start from RMID 1 and check all the RMIDs that + * are marked as busy for occupancy < threshold. If the occupancy + * is less than the threshold decrement the busy counter of the + * RMID and move it to the free list when the counter reaches 0. + */ + for (;;) { + idx = find_next_bit(d->rmid_busy_llc, idx_limit, cur_idx); + if (idx >= idx_limit) + break; + + entry = __rmid_entry(idx); + if (resctrl_arch_rmid_read(r, d, entry->closid, entry->rmid, + QOS_L3_OCCUP_EVENT_ID, &val, + arch_mon_ctx)) { + rmid_dirty = true; + } else { + rmid_dirty = (val >= resctrl_rmid_realloc_threshold); + } + + if (force_free || !rmid_dirty) { + clear_bit(idx, d->rmid_busy_llc); + if (!--entry->busy) + limbo_release_entry(entry); + } + cur_idx = idx + 1; + } + + resctrl_arch_mon_ctx_free(r, QOS_L3_OCCUP_EVENT_ID, arch_mon_ctx); +} + +bool has_busy_rmid(struct rdt_domain *d) +{ + u32 idx_limit = resctrl_arch_system_num_rmid_idx(); + + return find_first_bit(d->rmid_busy_llc, idx_limit) != idx_limit; +} + +struct rmid_entry *resctrl_find_free_rmid(u32 closid) +{ + struct rmid_entry *itr; + u32 itr_idx, cmp_idx; + + if (list_empty(&rmid_free_lru)) + return rmid_limbo_count ? ERR_PTR(-EBUSY) : ERR_PTR(-ENOSPC); + + list_for_each_entry(itr, &rmid_free_lru, list) { + /* + * Get the index of this free RMID, and the index it would need + * to be if it were used with this CLOSID. + * If the CLOSID is irrelevant on this architecture, the two + * index values are always the same on every entry and thus the + * very first entry will be returned. + */ + itr_idx = resctrl_arch_rmid_idx_encode(itr->closid, itr->rmid); + cmp_idx = resctrl_arch_rmid_idx_encode(closid, itr->rmid); + + if (itr_idx == cmp_idx) + return itr; + } + + return ERR_PTR(-ENOSPC); +} + +/** + * resctrl_find_cleanest_closid() - Find a CLOSID where all the associated + * RMID are clean, or the CLOSID that has + * the most clean RMID. + * + * MPAM's equivalent of RMID are per-CLOSID, meaning a freshly allocated CLOSID + * may not be able to allocate clean RMID. To avoid this the allocator will + * choose the CLOSID with the most clean RMID. + * + * When the CLOSID and RMID are independent numbers, the first free CLOSID will + * be returned. + */ +int resctrl_find_cleanest_closid(void) +{ + u32 cleanest_closid = ~0; + int i = 0; + + lockdep_assert_held(&rdtgroup_mutex); + + if (!IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) + return -EIO; + + for (i = 0; i < closids_supported(); i++) { + int num_dirty; + + if (closid_allocated(i)) + continue; + + num_dirty = closid_num_dirty_rmid[i]; + if (num_dirty == 0) + return i; + + if (cleanest_closid == ~0) + cleanest_closid = i; + + if (num_dirty < closid_num_dirty_rmid[cleanest_closid]) + cleanest_closid = i; + } + + if (cleanest_closid == ~0) + return -ENOSPC; + + return cleanest_closid; +} + +/* + * For MPAM the RMID value is not unique, and has to be considered with + * the CLOSID. The (CLOSID, RMID) pair is allocated on all domains, which + * allows all domains to be managed by a single free list. + * Each domain also has a rmid_busy_llc to reduce the work of the limbo handler. + */ +int alloc_rmid(u32 closid) +{ + struct rmid_entry *entry; + + lockdep_assert_held(&rdtgroup_mutex); + + entry = resctrl_find_free_rmid(closid); + if (IS_ERR(entry)) + return PTR_ERR(entry); + + list_del(&entry->list); + return entry->rmid; +} + +static void add_rmid_to_limbo(struct rmid_entry *entry) +{ + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3); + struct rdt_domain *d; + u32 idx; + + lockdep_assert_held(&rdtgroup_mutex); + + /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + + idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid); + + entry->busy = 0; + list_for_each_entry(d, &r->domains, list) { + /* + * For the first limbo RMID in the domain, + * setup up the limbo worker. + */ + if (!has_busy_rmid(d)) + cqm_setup_limbo_handler(d, 0, + RESCTRL_PICK_ANY_CPU); + set_bit(idx, d->rmid_busy_llc); + entry->busy++; + } + + rmid_limbo_count++; + if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) + closid_num_dirty_rmid[entry->closid]++; +} + +void free_rmid(u32 closid, u32 rmid) +{ + u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid); + struct rmid_entry *entry; + + lockdep_assert_held(&rdtgroup_mutex); + + /* + * Do not allow the default rmid to be free'd. Comparing by index + * allows architectures that ignore the closid parameter to avoid an + * unnecessary check. + */ + if (!resctrl_arch_mon_capable() || + idx == resctrl_arch_rmid_idx_encode(RESCTRL_RESERVED_CLOSID, + RESCTRL_RESERVED_RMID)) + return; + + entry = __rmid_entry(idx); + + if (resctrl_arch_is_llc_occupancy_enabled()) + add_rmid_to_limbo(entry); + else + list_add_tail(&entry->list, &rmid_free_lru); +} + +static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 closid, + u32 rmid, enum resctrl_event_id evtid) +{ + u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid); + + switch (evtid) { + case QOS_L3_MBM_TOTAL_EVENT_ID: + return &d->mbm_total[idx]; + case QOS_L3_MBM_LOCAL_EVENT_ID: + return &d->mbm_local[idx]; + default: + return NULL; + } +} + +static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr) +{ + struct mbm_state *m; + u64 tval = 0; + + if (rr->first) { + resctrl_arch_reset_rmid(rr->r, rr->d, closid, rmid, rr->evtid); + m = get_mbm_state(rr->d, closid, rmid, rr->evtid); + if (m) + memset(m, 0, sizeof(struct mbm_state)); + return 0; + } + + rr->err = resctrl_arch_rmid_read(rr->r, rr->d, closid, rmid, rr->evtid, + &tval, rr->arch_mon_ctx); + if (rr->err) + return rr->err; + + rr->val += tval; + + return 0; +} + +/* + * mbm_bw_count() - Update bw count from values previously read by + * __mon_event_count(). + * @closid: The closid used to identify the cached mbm_state. + * @rmid: The rmid used to identify the cached mbm_state. + * @rr: The struct rmid_read populated by __mon_event_count(). + * + * Supporting function to calculate the memory bandwidth + * and delta bandwidth in MBps. The chunks value previously read by + * __mon_event_count() is compared with the chunks value from the previous + * invocation. This must be called once per second to maintain values in MBps. + */ +static void mbm_bw_count(u32 closid, u32 rmid, struct rmid_read *rr) +{ + u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid); + struct mbm_state *m = &rr->d->mbm_local[idx]; + u64 cur_bw, bytes, cur_bytes; + + cur_bytes = rr->val; + bytes = cur_bytes - m->prev_bw_bytes; + m->prev_bw_bytes = cur_bytes; + + cur_bw = bytes / SZ_1M; + + m->prev_bw = cur_bw; +} + +/* + * This is scheduled by mon_event_read() to read the CQM/MBM counters + * on a domain. + */ +void mon_event_count(void *info) +{ + struct rdtgroup *rdtgrp, *entry; + struct rmid_read *rr = info; + struct list_head *head; + int ret; + + rdtgrp = rr->rgrp; + + ret = __mon_event_count(rdtgrp->closid, rdtgrp->mon.rmid, rr); + + /* + * For Ctrl groups read data from child monitor groups and + * add them together. Count events which are read successfully. + * Discard the rmid_read's reporting errors. + */ + head = &rdtgrp->mon.crdtgrp_list; + + if (rdtgrp->type == RDTCTRL_GROUP) { + list_for_each_entry(entry, head, mon.crdtgrp_list) { + if (__mon_event_count(entry->closid, entry->mon.rmid, + rr) == 0) + ret = 0; + } + } + + /* + * __mon_event_count() calls for newly created monitor groups may + * report -EINVAL/Unavailable if the monitor hasn't seen any traffic. + * Discard error if any of the monitor event reads succeeded. + */ + if (ret == 0) + rr->err = 0; +} + +/* + * Feedback loop for MBA software controller (mba_sc) + * + * mba_sc is a feedback loop where we periodically read MBM counters and + * adjust the bandwidth percentage values via the IA32_MBA_THRTL_MSRs so + * that: + * + * current bandwidth(cur_bw) < user specified bandwidth(user_bw) + * + * This uses the MBM counters to measure the bandwidth and MBA throttle + * MSRs to control the bandwidth for a particular rdtgrp. It builds on the + * fact that resctrl rdtgroups have both monitoring and control. + * + * The frequency of the checks is 1s and we just tag along the MBM overflow + * timer. Having 1s interval makes the calculation of bandwidth simpler. + * + * Although MBA's goal is to restrict the bandwidth to a maximum, there may + * be a need to increase the bandwidth to avoid unnecessarily restricting + * the L2 <-> L3 traffic. + * + * Since MBA controls the L2 external bandwidth where as MBM measures the + * L3 external bandwidth the following sequence could lead to such a + * situation. + * + * Consider an rdtgroup which had high L3 <-> memory traffic in initial + * phases -> mba_sc kicks in and reduced bandwidth percentage values -> but + * after some time rdtgroup has mostly L2 <-> L3 traffic. + * + * In this case we may restrict the rdtgroup's L2 <-> L3 traffic as its + * throttle MSRs already have low percentage values. To avoid + * unnecessarily restricting such rdtgroups, we also increase the bandwidth. + */ +static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm) +{ + u32 closid, rmid, cur_msr_val, new_msr_val; + struct mbm_state *pmbm_data, *cmbm_data; + struct rdt_resource *r_mba; + struct rdt_domain *dom_mba; + u32 cur_bw, user_bw, idx; + struct list_head *head; + struct rdtgroup *entry; + + if (!resctrl_arch_is_mbm_local_enabled()) + return; + + r_mba = resctrl_arch_get_resource(RDT_RESOURCE_MBA); + + closid = rgrp->closid; + rmid = rgrp->mon.rmid; + idx = resctrl_arch_rmid_idx_encode(closid, rmid); + pmbm_data = &dom_mbm->mbm_local[idx]; + + dom_mba = resctrl_get_domain_from_cpu(smp_processor_id(), r_mba); + if (!dom_mba) { + pr_warn_once("Failure to get domain for MBA update\n"); + return; + } + + cur_bw = pmbm_data->prev_bw; + user_bw = dom_mba->mbps_val[closid]; + + /* MBA resource doesn't support CDP */ + cur_msr_val = resctrl_arch_get_config(r_mba, dom_mba, closid, CDP_NONE); + + /* + * For Ctrl groups read data from child monitor groups. + */ + head = &rgrp->mon.crdtgrp_list; + list_for_each_entry(entry, head, mon.crdtgrp_list) { + cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid]; + cur_bw += cmbm_data->prev_bw; + } + + /* + * Scale up/down the bandwidth linearly for the ctrl group. The + * bandwidth step is the bandwidth granularity specified by the + * hardware. + * Always increase throttling if current bandwidth is above the + * target set by user. + * But avoid thrashing up and down on every poll by checking + * whether a decrease in throttling is likely to push the group + * back over target. E.g. if currently throttling to 30% of bandwidth + * on a system with 10% granularity steps, check whether moving to + * 40% would go past the limit by multiplying current bandwidth by + * "(30 + 10) / 30". + */ + if (cur_msr_val > r_mba->membw.min_bw && user_bw < cur_bw) { + new_msr_val = cur_msr_val - r_mba->membw.bw_gran; + } else if (cur_msr_val < MAX_MBA_BW && + (user_bw > (cur_bw * (cur_msr_val + r_mba->membw.min_bw) / cur_msr_val))) { + new_msr_val = cur_msr_val + r_mba->membw.bw_gran; + } else { + return; + } + + resctrl_arch_update_one(r_mba, dom_mba, closid, CDP_NONE, new_msr_val); +} + +static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, + u32 closid, u32 rmid) +{ + struct rmid_read rr; + + rr.first = false; + rr.r = r; + rr.d = d; + + /* + * This is protected from concurrent reads from user + * as both the user and we hold the global mutex. + */ + if (resctrl_arch_is_mbm_total_enabled()) { + rr.evtid = QOS_L3_MBM_TOTAL_EVENT_ID; + rr.val = 0; + rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, rr.evtid); + if (IS_ERR(rr.arch_mon_ctx)) { + pr_warn_ratelimited("Failed to allocate monitor context: %ld", + PTR_ERR(rr.arch_mon_ctx)); + return; + } + + __mon_event_count(closid, rmid, &rr); + + resctrl_arch_mon_ctx_free(rr.r, rr.evtid, rr.arch_mon_ctx); + } + if (resctrl_arch_is_mbm_local_enabled()) { + rr.evtid = QOS_L3_MBM_LOCAL_EVENT_ID; + rr.val = 0; + rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, rr.evtid); + if (IS_ERR(rr.arch_mon_ctx)) { + pr_warn_ratelimited("Failed to allocate monitor context: %ld", + PTR_ERR(rr.arch_mon_ctx)); + return; + } + + __mon_event_count(closid, rmid, &rr); + + /* + * Call the MBA software controller only for the + * control groups and when user has enabled + * the software controller explicitly. + */ + if (is_mba_sc(NULL)) + mbm_bw_count(closid, rmid, &rr); + + resctrl_arch_mon_ctx_free(rr.r, rr.evtid, rr.arch_mon_ctx); + } + if (resctrl_arch_is_mbm_core_enabled()) { + rr.evtid = QOS_L2_MBM_CORE_EVENT_ID; + rr.val = 0; + rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, rr.evtid); + if (IS_ERR(rr.arch_mon_ctx)) { + pr_warn_ratelimited("Failed to allocate monitor context: %ld", + PTR_ERR(rr.arch_mon_ctx)); + return; + } + + __mon_event_count(closid, rmid, &rr); + + resctrl_arch_mon_ctx_free(rr.r, rr.evtid, rr.arch_mon_ctx); + } +} + +/* + * Handler to scan the limbo list and move the RMIDs + * to free list whose occupancy < threshold_occupancy. + */ +void cqm_handle_limbo(struct work_struct *work) +{ + unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL); + struct rdt_domain *d; + + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + + d = container_of(work, struct rdt_domain, cqm_limbo.work); + + __check_limbo(d, false); + + if (has_busy_rmid(d)) { + d->cqm_work_cpu = cpumask_any_housekeeping(&d->cpu_mask, + RESCTRL_PICK_ANY_CPU); + schedule_delayed_work_on(d->cqm_work_cpu, &d->cqm_limbo, + delay); + } + + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); +} + +/** + * cqm_setup_limbo_handler() - Schedule the limbo handler to run for this + * domain. + * @dom: The domain the limbo handler should run for. + * @delay_ms: How far in the future the handler should run. + * @exclude_cpu: Which CPU the handler should not run on, + * RESCTRL_PICK_ANY_CPU to pick any CPU. + */ +void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms, + int exclude_cpu) +{ + unsigned long delay = msecs_to_jiffies(delay_ms); + int cpu; + + cpu = cpumask_any_housekeeping(&dom->cpu_mask, exclude_cpu); + dom->cqm_work_cpu = cpu; + + if (cpu < nr_cpu_ids) + schedule_delayed_work_on(cpu, &dom->cqm_limbo, delay); +} + +void mbm_handle_overflow(struct work_struct *work) +{ + unsigned long delay = msecs_to_jiffies(MBM_OVERFLOW_INTERVAL); + struct rdtgroup *prgrp, *crgrp; + struct list_head *head; + struct rdt_resource *r; + struct rdt_domain *d; + + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + + /* + * If the filesystem has been unmounted this work no longer needs to + * run. + */ + if (!resctrl_mounted || !resctrl_arch_mon_capable()) + goto out_unlock; + + r = resctrl_arch_get_resource(RDT_RESOURCE_L3); + d = container_of(work, struct rdt_domain, mbm_over.work); + + list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) { + mbm_update(r, d, prgrp->closid, prgrp->mon.rmid); + + head = &prgrp->mon.crdtgrp_list; + list_for_each_entry(crgrp, head, mon.crdtgrp_list) + mbm_update(r, d, crgrp->closid, crgrp->mon.rmid); + + if (is_mba_sc(NULL)) + update_mba_bw(prgrp, d); + } + + /* + * Re-check for housekeeping CPUs. This allows the overflow handler to + * move off a nohz_full CPU quickly. + */ + d->mbm_work_cpu = cpumask_any_housekeeping(&d->cpu_mask, + RESCTRL_PICK_ANY_CPU); + schedule_delayed_work_on(d->mbm_work_cpu, &d->mbm_over, delay); + +out_unlock: + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); +} + +/** + * mbm_setup_overflow_handler() - Schedule the overflow handler to run for this + * domain. + * @dom: The domain the overflow handler should run for. + * @delay_ms: How far in the future the handler should run. + * @exclude_cpu: Which CPU the handler should not run on, + * RESCTRL_PICK_ANY_CPU to pick any CPU. + */ +void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms, + int exclude_cpu) +{ + unsigned long delay = msecs_to_jiffies(delay_ms); + int cpu; + + /* + * When a domain comes online there is no guarantee the filesystem is + * mounted. If not, there is no need to catch counter overflow. + */ + if (!resctrl_mounted || !resctrl_arch_mon_capable()) + return; + cpu = cpumask_any_housekeeping(&dom->cpu_mask, exclude_cpu); + dom->mbm_work_cpu = cpu; + + if (cpu < nr_cpu_ids) + schedule_delayed_work_on(cpu, &dom->mbm_over, delay); +} + +static int dom_data_init(struct rdt_resource *r) +{ + u32 idx_limit = resctrl_arch_system_num_rmid_idx(); + u32 num_closid = resctrl_arch_get_num_closid(r); + struct rmid_entry *entry = NULL; + int err = 0, i; + u32 idx; + + mutex_lock(&rdtgroup_mutex); + if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) { + u32 *tmp; + + /* + * If the architecture hasn't provided a sanitised value here, + * this may result in larger arrays than necessary. Resctrl will + * use a smaller system wide value based on the resources in + * use. + */ + tmp = kcalloc(num_closid, sizeof(*tmp), GFP_KERNEL); + if (!tmp) { + err = -ENOMEM; + goto out_unlock; + } + + closid_num_dirty_rmid = tmp; + } + + rmid_ptrs = kcalloc(idx_limit, sizeof(struct rmid_entry), GFP_KERNEL); + if (!rmid_ptrs) { + if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) { + kfree(closid_num_dirty_rmid); + closid_num_dirty_rmid = NULL; + } + err = -ENOMEM; + goto out_unlock; + } + + for (i = 0; i < idx_limit; i++) { + entry = &rmid_ptrs[i]; + INIT_LIST_HEAD(&entry->list); + + resctrl_arch_rmid_idx_decode(i, &entry->closid, &entry->rmid); + list_add_tail(&entry->list, &rmid_free_lru); + } + + /* + * RESCTRL_RESERVED_CLOSID and RESCTRL_RESERVED_RMID are special and + * are always allocated. These are used for the rdtgroup_default + * control group, which will be setup later in rdtgroup_init(). + */ + idx = resctrl_arch_rmid_idx_encode(RESCTRL_RESERVED_CLOSID, + RESCTRL_RESERVED_RMID); + entry = __rmid_entry(idx); + list_del(&entry->list); + +out_unlock: + mutex_unlock(&rdtgroup_mutex); + + return err; +} + +static void dom_data_exit(void) +{ + mutex_lock(&rdtgroup_mutex); + + if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) { + kfree(closid_num_dirty_rmid); + closid_num_dirty_rmid = NULL; + } + + kfree(rmid_ptrs); + rmid_ptrs = NULL; + + mutex_unlock(&rdtgroup_mutex); +} + +int resctrl_mon_resource_init(void) +{ + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3); + int ret; + + if (!r->mon_capable) + return 0; + + ret = dom_data_init(r); + if (ret) + return ret; + + return resctrl_arch_mon_resource_init(); +} + +void resctrl_mon_resource_exit(void) +{ + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3); + + if (!r->mon_capable) + return; + + dom_data_exit(); +} diff --git a/fs/resctrl/psuedo_lock.c b/fs/resctrl/psuedo_lock.c new file mode 100644 index 0000000000000000000000000000000000000000..f2737fd562cf52c5bbf86d57fcccde93a5e2e6a2 --- /dev/null +++ b/fs/resctrl/psuedo_lock.c @@ -0,0 +1,1134 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Resource Director Technology (RDT) + * + * Pseudo-locking support built on top of Cache Allocation Technology (CAT) + * + * Copyright (C) 2018 Intel Corporation + * + * Author: Reinette Chatre + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +#include "internal.h" + +/* + * Major number assigned to and shared by all devices exposing + * pseudo-locked regions. + */ +static unsigned int pseudo_lock_major; +static unsigned long pseudo_lock_minor_avail = GENMASK(MINORBITS, 0); + +static char *pseudo_lock_devnode(const struct device *dev, umode_t *mode) +{ + const struct rdtgroup *rdtgrp; + + rdtgrp = dev_get_drvdata(dev); + if (mode) + *mode = 0600; + return kasprintf(GFP_KERNEL, "pseudo_lock/%s", rdtgrp->kn->name); +} + +static const struct class pseudo_lock_class = { + .name = "pseudo_lock", + .devnode = pseudo_lock_devnode, +}; + +/** + * pseudo_lock_minor_get - Obtain available minor number + * @minor: Pointer to where new minor number will be stored + * + * A bitmask is used to track available minor numbers. Here the next free + * minor number is marked as unavailable and returned. + * + * Return: 0 on success, <0 on failure. + */ +static int pseudo_lock_minor_get(unsigned int *minor) +{ + unsigned long first_bit; + + first_bit = find_first_bit(&pseudo_lock_minor_avail, MINORBITS); + + if (first_bit == MINORBITS) + return -ENOSPC; + + __clear_bit(first_bit, &pseudo_lock_minor_avail); + *minor = first_bit; + + return 0; +} + +/** + * pseudo_lock_minor_release - Return minor number to available + * @minor: The minor number made available + */ +static void pseudo_lock_minor_release(unsigned int minor) +{ + __set_bit(minor, &pseudo_lock_minor_avail); +} + +/** + * region_find_by_minor - Locate a pseudo-lock region by inode minor number + * @minor: The minor number of the device representing pseudo-locked region + * + * When the character device is accessed we need to determine which + * pseudo-locked region it belongs to. This is done by matching the minor + * number of the device to the pseudo-locked region it belongs. + * + * Minor numbers are assigned at the time a pseudo-locked region is associated + * with a cache instance. + * + * Return: On success return pointer to resource group owning the pseudo-locked + * region, NULL on failure. + */ +static struct rdtgroup *region_find_by_minor(unsigned int minor) +{ + struct rdtgroup *rdtgrp, *rdtgrp_match = NULL; + + list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) { + if (rdtgrp->plr && rdtgrp->plr->minor == minor) { + rdtgrp_match = rdtgrp; + break; + } + } + return rdtgrp_match; +} + +/** + * struct pseudo_lock_pm_req - A power management QoS request list entry + * @list: Entry within the @pm_reqs list for a pseudo-locked region + * @req: PM QoS request + */ +struct pseudo_lock_pm_req { + struct list_head list; + struct dev_pm_qos_request req; +}; + +static void pseudo_lock_cstates_relax(struct pseudo_lock_region *plr) +{ + struct pseudo_lock_pm_req *pm_req, *next; + + list_for_each_entry_safe(pm_req, next, &plr->pm_reqs, list) { + dev_pm_qos_remove_request(&pm_req->req); + list_del(&pm_req->list); + kfree(pm_req); + } +} + +/** + * pseudo_lock_cstates_constrain - Restrict cores from entering C6 + * @plr: Pseudo-locked region + * + * To prevent the cache from being affected by power management entering + * C6 has to be avoided. This is accomplished by requesting a latency + * requirement lower than lowest C6 exit latency of all supported + * platforms as found in the cpuidle state tables in the intel_idle driver. + * At this time it is possible to do so with a single latency requirement + * for all supported platforms. + * + * Since Goldmont is supported, which is affected by X86_BUG_MONITOR, + * the ACPI latencies need to be considered while keeping in mind that C2 + * may be set to map to deeper sleep states. In this case the latency + * requirement needs to prevent entering C2 also. + * + * Return: 0 on success, <0 on failure + */ +static int pseudo_lock_cstates_constrain(struct pseudo_lock_region *plr) +{ + struct pseudo_lock_pm_req *pm_req; + int cpu; + int ret; + + for_each_cpu(cpu, &plr->d->cpu_mask) { + pm_req = kzalloc(sizeof(*pm_req), GFP_KERNEL); + if (!pm_req) { + rdt_last_cmd_puts("Failure to allocate memory for PM QoS\n"); + ret = -ENOMEM; + goto out_err; + } + ret = dev_pm_qos_add_request(get_cpu_device(cpu), + &pm_req->req, + DEV_PM_QOS_RESUME_LATENCY, + 30); + if (ret < 0) { + rdt_last_cmd_printf("Failed to add latency req CPU%d\n", + cpu); + kfree(pm_req); + ret = -1; + goto out_err; + } + list_add(&pm_req->list, &plr->pm_reqs); + } + + return 0; + +out_err: + pseudo_lock_cstates_relax(plr); + return ret; +} + +/** + * pseudo_lock_region_clear - Reset pseudo-lock region data + * @plr: pseudo-lock region + * + * All content of the pseudo-locked region is reset - any memory allocated + * freed. + * + * Return: void + */ +static void pseudo_lock_region_clear(struct pseudo_lock_region *plr) +{ + plr->size = 0; + plr->line_size = 0; + kfree(plr->kmem); + plr->kmem = NULL; + plr->s = NULL; + if (plr->d) + plr->d->plr = NULL; + plr->d = NULL; + plr->cbm = 0; + plr->debugfs_dir = NULL; +} + +/** + * pseudo_lock_region_init - Initialize pseudo-lock region information + * @plr: pseudo-lock region + * + * Called after user provided a schemata to be pseudo-locked. From the + * schemata the &struct pseudo_lock_region is on entry already initialized + * with the resource, domain, and capacity bitmask. Here the information + * required for pseudo-locking is deduced from this data and &struct + * pseudo_lock_region initialized further. This information includes: + * - size in bytes of the region to be pseudo-locked + * - cache line size to know the stride with which data needs to be accessed + * to be pseudo-locked + * - a cpu associated with the cache instance on which the pseudo-locking + * flow can be executed + * + * Return: 0 on success, <0 on failure. Descriptive error will be written + * to last_cmd_status buffer. + */ +static int pseudo_lock_region_init(struct pseudo_lock_region *plr) +{ + struct cpu_cacheinfo *ci; + int ret; + int i; + + /* Pick the first cpu we find that is associated with the cache. */ + plr->cpu = cpumask_first(&plr->d->cpu_mask); + + if (!cpu_online(plr->cpu)) { + rdt_last_cmd_printf("CPU %u associated with cache not online\n", + plr->cpu); + ret = -ENODEV; + goto out_region; + } + + ci = get_cpu_cacheinfo(plr->cpu); + + plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm); + + for (i = 0; i < ci->num_leaves; i++) { + if (ci->info_list[i].level == plr->s->res->cache_level) { + plr->line_size = ci->info_list[i].coherency_line_size; + return 0; + } + } + + ret = -1; + rdt_last_cmd_puts("Unable to determine cache line size\n"); +out_region: + pseudo_lock_region_clear(plr); + return ret; +} + +/** + * pseudo_lock_init - Initialize a pseudo-lock region + * @rdtgrp: resource group to which new pseudo-locked region will belong + * + * A pseudo-locked region is associated with a resource group. When this + * association is created the pseudo-locked region is initialized. The + * details of the pseudo-locked region are not known at this time so only + * allocation is done and association established. + * + * Return: 0 on success, <0 on failure + */ +static int pseudo_lock_init(struct rdtgroup *rdtgrp) +{ + struct pseudo_lock_region *plr; + + plr = kzalloc(sizeof(*plr), GFP_KERNEL); + if (!plr) + return -ENOMEM; + + init_waitqueue_head(&plr->lock_thread_wq); + INIT_LIST_HEAD(&plr->pm_reqs); + rdtgrp->plr = plr; + return 0; +} + +/** + * pseudo_lock_region_alloc - Allocate kernel memory that will be pseudo-locked + * @plr: pseudo-lock region + * + * Initialize the details required to set up the pseudo-locked region and + * allocate the contiguous memory that will be pseudo-locked to the cache. + * + * Return: 0 on success, <0 on failure. Descriptive error will be written + * to last_cmd_status buffer. + */ +static int pseudo_lock_region_alloc(struct pseudo_lock_region *plr) +{ + int ret; + + ret = pseudo_lock_region_init(plr); + if (ret < 0) + return ret; + + /* + * We do not yet support contiguous regions larger than + * KMALLOC_MAX_SIZE. + */ + if (plr->size > KMALLOC_MAX_SIZE) { + rdt_last_cmd_puts("Requested region exceeds maximum size\n"); + ret = -E2BIG; + goto out_region; + } + + plr->kmem = kzalloc(plr->size, GFP_KERNEL); + if (!plr->kmem) { + rdt_last_cmd_puts("Unable to allocate memory\n"); + ret = -ENOMEM; + goto out_region; + } + + ret = 0; + goto out; +out_region: + pseudo_lock_region_clear(plr); +out: + return ret; +} + +/** + * pseudo_lock_free - Free a pseudo-locked region + * @rdtgrp: resource group to which pseudo-locked region belonged + * + * The pseudo-locked region's resources have already been released, or not + * yet created at this point. Now it can be freed and disassociated from the + * resource group. + * + * Return: void + */ +static void pseudo_lock_free(struct rdtgroup *rdtgrp) +{ + pseudo_lock_region_clear(rdtgrp->plr); + kfree(rdtgrp->plr); + rdtgrp->plr = NULL; +} + +/** + * rdtgroup_monitor_in_progress - Test if monitoring in progress + * @rdtgrp: resource group being queried + * + * Return: 1 if monitor groups have been created for this resource + * group, 0 otherwise. + */ +static int rdtgroup_monitor_in_progress(struct rdtgroup *rdtgrp) +{ + return !list_empty(&rdtgrp->mon.crdtgrp_list); +} + +/** + * rdtgroup_locksetup_user_restrict - Restrict user access to group + * @rdtgrp: resource group needing access restricted + * + * A resource group used for cache pseudo-locking cannot have cpus or tasks + * assigned to it. This is communicated to the user by restricting access + * to all the files that can be used to make such changes. + * + * Permissions restored with rdtgroup_locksetup_user_restore() + * + * Return: 0 on success, <0 on failure. If a failure occurs during the + * restriction of access an attempt will be made to restore permissions but + * the state of the mode of these files will be uncertain when a failure + * occurs. + */ +static int rdtgroup_locksetup_user_restrict(struct rdtgroup *rdtgrp) +{ + int ret; + + ret = rdtgroup_kn_mode_restrict(rdtgrp, "tasks"); + if (ret) + return ret; + + ret = rdtgroup_kn_mode_restrict(rdtgrp, "cpus"); + if (ret) + goto err_tasks; + + ret = rdtgroup_kn_mode_restrict(rdtgrp, "cpus_list"); + if (ret) + goto err_cpus; + + if (resctrl_arch_mon_capable()) { + ret = rdtgroup_kn_mode_restrict(rdtgrp, "mon_groups"); + if (ret) + goto err_cpus_list; + } + + ret = 0; + goto out; + +err_cpus_list: + rdtgroup_kn_mode_restore(rdtgrp, "cpus_list", 0777); +err_cpus: + rdtgroup_kn_mode_restore(rdtgrp, "cpus", 0777); +err_tasks: + rdtgroup_kn_mode_restore(rdtgrp, "tasks", 0777); +out: + return ret; +} + +/** + * rdtgroup_locksetup_user_restore - Restore user access to group + * @rdtgrp: resource group needing access restored + * + * Restore all file access previously removed using + * rdtgroup_locksetup_user_restrict() + * + * Return: 0 on success, <0 on failure. If a failure occurs during the + * restoration of access an attempt will be made to restrict permissions + * again but the state of the mode of these files will be uncertain when + * a failure occurs. + */ +static int rdtgroup_locksetup_user_restore(struct rdtgroup *rdtgrp) +{ + int ret; + + ret = rdtgroup_kn_mode_restore(rdtgrp, "tasks", 0777); + if (ret) + return ret; + + ret = rdtgroup_kn_mode_restore(rdtgrp, "cpus", 0777); + if (ret) + goto err_tasks; + + ret = rdtgroup_kn_mode_restore(rdtgrp, "cpus_list", 0777); + if (ret) + goto err_cpus; + + if (resctrl_arch_mon_capable()) { + ret = rdtgroup_kn_mode_restore(rdtgrp, "mon_groups", 0777); + if (ret) + goto err_cpus_list; + } + + ret = 0; + goto out; + +err_cpus_list: + rdtgroup_kn_mode_restrict(rdtgrp, "cpus_list"); +err_cpus: + rdtgroup_kn_mode_restrict(rdtgrp, "cpus"); +err_tasks: + rdtgroup_kn_mode_restrict(rdtgrp, "tasks"); +out: + return ret; +} + +/** + * rdtgroup_locksetup_enter - Resource group enters locksetup mode + * @rdtgrp: resource group requested to enter locksetup mode + * + * A resource group enters locksetup mode to reflect that it would be used + * to represent a pseudo-locked region and is in the process of being set + * up to do so. A resource group used for a pseudo-locked region would + * lose the closid associated with it so we cannot allow it to have any + * tasks or cpus assigned nor permit tasks or cpus to be assigned in the + * future. Monitoring of a pseudo-locked region is not allowed either. + * + * The above and more restrictions on a pseudo-locked region are checked + * for and enforced before the resource group enters the locksetup mode. + * + * Returns: 0 if the resource group successfully entered locksetup mode, <0 + * on failure. On failure the last_cmd_status buffer is updated with text to + * communicate details of failure to the user. + */ +int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp) +{ + int ret; + + /* + * The default resource group can neither be removed nor lose the + * default closid associated with it. + */ + if (rdtgrp == &rdtgroup_default) { + rdt_last_cmd_puts("Cannot pseudo-lock default group\n"); + return -EINVAL; + } + + /* + * Cache Pseudo-locking not supported when CDP is enabled. + * + * Some things to consider if you would like to enable this + * support (using L3 CDP as example): + * - When CDP is enabled two separate resources are exposed, + * L3DATA and L3CODE, but they are actually on the same cache. + * The implication for pseudo-locking is that if a + * pseudo-locked region is created on a domain of one + * resource (eg. L3CODE), then a pseudo-locked region cannot + * be created on that same domain of the other resource + * (eg. L3DATA). This is because the creation of a + * pseudo-locked region involves a call to wbinvd that will + * affect all cache allocations on particular domain. + * - Considering the previous, it may be possible to only + * expose one of the CDP resources to pseudo-locking and + * hide the other. For example, we could consider to only + * expose L3DATA and since the L3 cache is unified it is + * still possible to place instructions there are execute it. + * - If only one region is exposed to pseudo-locking we should + * still keep in mind that availability of a portion of cache + * for pseudo-locking should take into account both resources. + * Similarly, if a pseudo-locked region is created in one + * resource, the portion of cache used by it should be made + * unavailable to all future allocations from both resources. + */ + if (resctrl_arch_get_cdp_enabled(RDT_RESOURCE_L3) || + resctrl_arch_get_cdp_enabled(RDT_RESOURCE_L2)) { + rdt_last_cmd_puts("CDP enabled\n"); + return -EINVAL; + } + + /* + * Not knowing the bits to disable prefetching implies that this + * platform does not support Cache Pseudo-Locking. + */ + if (resctrl_arch_get_prefetch_disable_bits() == 0) { + rdt_last_cmd_puts("Pseudo-locking not supported\n"); + return -EINVAL; + } + + if (rdtgroup_monitor_in_progress(rdtgrp)) { + rdt_last_cmd_puts("Monitoring in progress\n"); + return -EINVAL; + } + + if (rdtgroup_tasks_assigned(rdtgrp)) { + rdt_last_cmd_puts("Tasks assigned to resource group\n"); + return -EINVAL; + } + + if (!cpumask_empty(&rdtgrp->cpu_mask)) { + rdt_last_cmd_puts("CPUs assigned to resource group\n"); + return -EINVAL; + } + + if (rdtgroup_locksetup_user_restrict(rdtgrp)) { + rdt_last_cmd_puts("Unable to modify resctrl permissions\n"); + return -EIO; + } + + ret = pseudo_lock_init(rdtgrp); + if (ret) { + rdt_last_cmd_puts("Unable to init pseudo-lock region\n"); + goto out_release; + } + + /* + * If this system is capable of monitoring a rmid would have been + * allocated when the control group was created. This is not needed + * anymore when this group would be used for pseudo-locking. This + * is safe to call on platforms not capable of monitoring. + */ + free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); + + ret = 0; + goto out; + +out_release: + rdtgroup_locksetup_user_restore(rdtgrp); +out: + return ret; +} + +/** + * rdtgroup_locksetup_exit - resource group exist locksetup mode + * @rdtgrp: resource group + * + * When a resource group exits locksetup mode the earlier restrictions are + * lifted. + * + * Return: 0 on success, <0 on failure + */ +int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp) +{ + int ret; + + if (!IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) + return -EOPNOTSUPP; + + if (resctrl_arch_mon_capable()) { + ret = alloc_rmid(rdtgrp->closid); + if (ret < 0) { + rdt_last_cmd_puts("Out of RMIDs\n"); + return ret; + } + rdtgrp->mon.rmid = ret; + } + + ret = rdtgroup_locksetup_user_restore(rdtgrp); + if (ret) { + free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); + return ret; + } + + pseudo_lock_free(rdtgrp); + return 0; +} + +/** + * rdtgroup_cbm_overlaps_pseudo_locked - Test if CBM or portion is pseudo-locked + * @d: RDT domain + * @cbm: CBM to test + * + * @d represents a cache instance and @cbm a capacity bitmask that is + * considered for it. Determine if @cbm overlaps with any existing + * pseudo-locked region on @d. + * + * @cbm is unsigned long, even if only 32 bits are used, to make the + * bitmap functions work correctly. + * + * Return: true if @cbm overlaps with pseudo-locked region on @d, false + * otherwise. + */ +bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm) +{ + unsigned int cbm_len; + unsigned long cbm_b; + + if (d->plr) { + cbm_len = d->plr->s->res->cache.cbm_len; + cbm_b = d->plr->cbm; + if (bitmap_intersects(&cbm, &cbm_b, cbm_len)) + return true; + } + return false; +} + +/** + * rdtgroup_pseudo_locked_in_hierarchy - Pseudo-locked region in cache hierarchy + * @d: RDT domain under test + * + * The setup of a pseudo-locked region affects all cache instances within + * the hierarchy of the region. It is thus essential to know if any + * pseudo-locked regions exist within a cache hierarchy to prevent any + * attempts to create new pseudo-locked regions in the same hierarchy. + * + * Return: true if a pseudo-locked region exists in the hierarchy of @d or + * if it is not possible to test due to memory allocation issue, + * false otherwise. + */ +bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d) +{ + cpumask_var_t cpu_with_psl; + enum resctrl_res_level i; + struct rdt_resource *r; + struct rdt_domain *d_i; + bool ret = false; + + /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + + if (!IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) + return false; + + if (!zalloc_cpumask_var(&cpu_with_psl, GFP_KERNEL)) + return true; + + /* + * First determine which cpus have pseudo-locked regions + * associated with them. + */ + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + r = resctrl_arch_get_resource(i); + if (!r->alloc_capable) + continue; + + list_for_each_entry(d_i, &r->domains, list) { + if (d_i->plr) + cpumask_or(cpu_with_psl, cpu_with_psl, + &d_i->cpu_mask); + } + } + + /* + * Next test if new pseudo-locked region would intersect with + * existing region. + */ + if (cpumask_intersects(&d->cpu_mask, cpu_with_psl)) + ret = true; + + free_cpumask_var(cpu_with_psl); + return ret; +} + +/** + * pseudo_lock_measure_cycles - Trigger latency measure to pseudo-locked region + * @rdtgrp: Resource group to which the pseudo-locked region belongs. + * @sel: Selector of which measurement to perform on a pseudo-locked region. + * + * The measurement of latency to access a pseudo-locked region should be + * done from a cpu that is associated with that pseudo-locked region. + * Determine which cpu is associated with this region and start a thread on + * that cpu to perform the measurement, wait for that thread to complete. + * + * Return: 0 on success, <0 on failure + */ +static int pseudo_lock_measure_cycles(struct rdtgroup *rdtgrp, int sel) +{ + struct pseudo_lock_region *plr = rdtgrp->plr; + struct task_struct *thread; + unsigned int cpu; + int ret = -1; + + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + + if (rdtgrp->flags & RDT_DELETED) { + ret = -ENODEV; + goto out; + } + + if (!plr->d) { + ret = -ENODEV; + goto out; + } + + plr->thread_done = 0; + cpu = cpumask_first(&plr->d->cpu_mask); + if (!cpu_online(cpu)) { + ret = -ENODEV; + goto out; + } + + plr->cpu = cpu; + + if (sel == 1) + thread = kthread_create_on_node(resctrl_arch_measure_cycles_lat_fn, + plr, cpu_to_node(cpu), + "pseudo_lock_measure/%u", + cpu); + else if (sel == 2) + thread = kthread_create_on_node(resctrl_arch_measure_l2_residency, + plr, cpu_to_node(cpu), + "pseudo_lock_measure/%u", + cpu); + else if (sel == 3) + thread = kthread_create_on_node(resctrl_arch_measure_l3_residency, + plr, cpu_to_node(cpu), + "pseudo_lock_measure/%u", + cpu); + else + goto out; + + if (IS_ERR(thread)) { + ret = PTR_ERR(thread); + goto out; + } + kthread_bind(thread, cpu); + wake_up_process(thread); + + ret = wait_event_interruptible(plr->lock_thread_wq, + plr->thread_done == 1); + if (ret < 0) + goto out; + + ret = 0; + +out: + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); + return ret; +} + +static ssize_t pseudo_lock_measure_trigger(struct file *file, + const char __user *user_buf, + size_t count, loff_t *ppos) +{ + struct rdtgroup *rdtgrp = file->private_data; + size_t buf_size; + char buf[32]; + int ret; + int sel; + + buf_size = min(count, (sizeof(buf) - 1)); + if (copy_from_user(buf, user_buf, buf_size)) + return -EFAULT; + + buf[buf_size] = '\0'; + ret = kstrtoint(buf, 10, &sel); + if (ret == 0) { + if (sel != 1 && sel != 2 && sel != 3) + return -EINVAL; + ret = debugfs_file_get(file->f_path.dentry); + if (ret) + return ret; + ret = pseudo_lock_measure_cycles(rdtgrp, sel); + if (ret == 0) + ret = count; + debugfs_file_put(file->f_path.dentry); + } + + return ret; +} + +static const struct file_operations pseudo_measure_fops = { + .write = pseudo_lock_measure_trigger, + .open = simple_open, + .llseek = default_llseek, +}; + +/** + * rdtgroup_pseudo_lock_create - Create a pseudo-locked region + * @rdtgrp: resource group to which pseudo-lock region belongs + * + * Called when a resource group in the pseudo-locksetup mode receives a + * valid schemata that should be pseudo-locked. Since the resource group is + * in pseudo-locksetup mode the &struct pseudo_lock_region has already been + * allocated and initialized with the essential information. If a failure + * occurs the resource group remains in the pseudo-locksetup mode with the + * &struct pseudo_lock_region associated with it, but cleared from all + * information and ready for the user to re-attempt pseudo-locking by + * writing the schemata again. + * + * Return: 0 if the pseudo-locked region was successfully pseudo-locked, <0 + * on failure. Descriptive error will be written to last_cmd_status buffer. + */ +int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp) +{ + struct pseudo_lock_region *plr = rdtgrp->plr; + struct task_struct *thread; + unsigned int new_minor; + struct device *dev; + int ret; + + if (!IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) + return -EOPNOTSUPP; + + ret = pseudo_lock_region_alloc(plr); + if (ret < 0) + return ret; + + ret = pseudo_lock_cstates_constrain(plr); + if (ret < 0) { + ret = -EINVAL; + goto out_region; + } + + plr->thread_done = 0; + + plr->closid = rdtgrp->closid; + thread = kthread_create_on_node(resctrl_arch_pseudo_lock_fn, plr, + cpu_to_node(plr->cpu), + "pseudo_lock/%u", plr->cpu); + if (IS_ERR(thread)) { + ret = PTR_ERR(thread); + rdt_last_cmd_printf("Locking thread returned error %d\n", ret); + goto out_cstates; + } + + kthread_bind(thread, plr->cpu); + wake_up_process(thread); + + ret = wait_event_interruptible(plr->lock_thread_wq, + plr->thread_done == 1); + if (ret < 0) { + /* + * If the thread does not get on the CPU for whatever + * reason and the process which sets up the region is + * interrupted then this will leave the thread in runnable + * state and once it gets on the CPU it will dereference + * the cleared, but not freed, plr struct resulting in an + * empty pseudo-locking loop. + */ + rdt_last_cmd_puts("Locking thread interrupted\n"); + goto out_cstates; + } + + ret = pseudo_lock_minor_get(&new_minor); + if (ret < 0) { + rdt_last_cmd_puts("Unable to obtain a new minor number\n"); + goto out_cstates; + } + + /* + * Unlock access but do not release the reference. The + * pseudo-locked region will still be here on return. + * + * The mutex has to be released temporarily to avoid a potential + * deadlock with the mm->mmap_lock which is obtained in the + * device_create() and debugfs_create_dir() callpath below as well as + * before the mmap() callback is called. + */ + mutex_unlock(&rdtgroup_mutex); + + if (!IS_ERR_OR_NULL(debugfs_resctrl)) { + plr->debugfs_dir = debugfs_create_dir(rdtgrp->kn->name, + debugfs_resctrl); + if (!IS_ERR_OR_NULL(plr->debugfs_dir)) + debugfs_create_file("pseudo_lock_measure", 0200, + plr->debugfs_dir, rdtgrp, + &pseudo_measure_fops); + } + + dev = device_create(&pseudo_lock_class, NULL, + MKDEV(pseudo_lock_major, new_minor), + rdtgrp, "%s", rdtgrp->kn->name); + + mutex_lock(&rdtgroup_mutex); + + if (IS_ERR(dev)) { + ret = PTR_ERR(dev); + rdt_last_cmd_printf("Failed to create character device: %d\n", + ret); + goto out_debugfs; + } + + /* We released the mutex - check if group was removed while we did so */ + if (rdtgrp->flags & RDT_DELETED) { + ret = -ENODEV; + goto out_device; + } + + plr->minor = new_minor; + + rdtgrp->mode = RDT_MODE_PSEUDO_LOCKED; + closid_free(rdtgrp->closid); + rdtgroup_kn_mode_restore(rdtgrp, "cpus", 0444); + rdtgroup_kn_mode_restore(rdtgrp, "cpus_list", 0444); + + ret = 0; + goto out; + +out_device: + device_destroy(&pseudo_lock_class, MKDEV(pseudo_lock_major, new_minor)); +out_debugfs: + debugfs_remove_recursive(plr->debugfs_dir); + pseudo_lock_minor_release(new_minor); +out_cstates: + pseudo_lock_cstates_relax(plr); +out_region: + pseudo_lock_region_clear(plr); +out: + return ret; +} + +/** + * rdtgroup_pseudo_lock_remove - Remove a pseudo-locked region + * @rdtgrp: resource group to which the pseudo-locked region belongs + * + * The removal of a pseudo-locked region can be initiated when the resource + * group is removed from user space via a "rmdir" from userspace or the + * unmount of the resctrl filesystem. On removal the resource group does + * not go back to pseudo-locksetup mode before it is removed, instead it is + * removed directly. There is thus asymmetry with the creation where the + * &struct pseudo_lock_region is removed here while it was not created in + * rdtgroup_pseudo_lock_create(). + * + * Return: void + */ +void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp) +{ + struct pseudo_lock_region *plr = rdtgrp->plr; + + if (!IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) + return; + + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { + /* + * Default group cannot be a pseudo-locked region so we can + * free closid here. + */ + closid_free(rdtgrp->closid); + goto free; + } + + pseudo_lock_cstates_relax(plr); + debugfs_remove_recursive(rdtgrp->plr->debugfs_dir); + device_destroy(&pseudo_lock_class, MKDEV(pseudo_lock_major, plr->minor)); + pseudo_lock_minor_release(plr->minor); + +free: + pseudo_lock_free(rdtgrp); +} + +static int pseudo_lock_dev_open(struct inode *inode, struct file *filp) +{ + struct rdtgroup *rdtgrp; + + mutex_lock(&rdtgroup_mutex); + + rdtgrp = region_find_by_minor(iminor(inode)); + if (!rdtgrp) { + mutex_unlock(&rdtgroup_mutex); + return -ENODEV; + } + + filp->private_data = rdtgrp; + atomic_inc(&rdtgrp->waitcount); + /* Perform a non-seekable open - llseek is not supported */ + filp->f_mode &= ~(FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE); + + mutex_unlock(&rdtgroup_mutex); + + return 0; +} + +static int pseudo_lock_dev_release(struct inode *inode, struct file *filp) +{ + struct rdtgroup *rdtgrp; + + mutex_lock(&rdtgroup_mutex); + rdtgrp = filp->private_data; + WARN_ON(!rdtgrp); + if (!rdtgrp) { + mutex_unlock(&rdtgroup_mutex); + return -ENODEV; + } + filp->private_data = NULL; + atomic_dec(&rdtgrp->waitcount); + mutex_unlock(&rdtgroup_mutex); + return 0; +} + +static int pseudo_lock_dev_mremap(struct vm_area_struct *area) +{ + /* Not supported */ + return -EINVAL; +} + +static const struct vm_operations_struct pseudo_mmap_ops = { + .mremap = pseudo_lock_dev_mremap, +}; + +static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma) +{ + unsigned long vsize = vma->vm_end - vma->vm_start; + unsigned long off = vma->vm_pgoff << PAGE_SHIFT; + struct pseudo_lock_region *plr; + struct rdtgroup *rdtgrp; + unsigned long physical; + unsigned long psize; + + mutex_lock(&rdtgroup_mutex); + + rdtgrp = filp->private_data; + WARN_ON(!rdtgrp); + if (!rdtgrp) { + mutex_unlock(&rdtgroup_mutex); + return -ENODEV; + } + + plr = rdtgrp->plr; + + if (!plr->d) { + mutex_unlock(&rdtgroup_mutex); + return -ENODEV; + } + + /* + * Task is required to run with affinity to the cpus associated + * with the pseudo-locked region. If this is not the case the task + * may be scheduled elsewhere and invalidate entries in the + * pseudo-locked region. + */ + if (!cpumask_subset(current->cpus_ptr, &plr->d->cpu_mask)) { + mutex_unlock(&rdtgroup_mutex); + return -EINVAL; + } + + physical = __pa(plr->kmem) >> PAGE_SHIFT; + psize = plr->size - off; + + if (off > plr->size) { + mutex_unlock(&rdtgroup_mutex); + return -ENOSPC; + } + + /* + * Ensure changes are carried directly to the memory being mapped, + * do not allow copy-on-write mapping. + */ + if (!(vma->vm_flags & VM_SHARED)) { + mutex_unlock(&rdtgroup_mutex); + return -EINVAL; + } + + if (vsize > psize) { + mutex_unlock(&rdtgroup_mutex); + return -ENOSPC; + } + + memset(plr->kmem + off, 0, vsize); + + if (remap_pfn_range(vma, vma->vm_start, physical + vma->vm_pgoff, + vsize, vma->vm_page_prot)) { + mutex_unlock(&rdtgroup_mutex); + return -EAGAIN; + } + vma->vm_ops = &pseudo_mmap_ops; + mutex_unlock(&rdtgroup_mutex); + return 0; +} + +static const struct file_operations pseudo_lock_dev_fops = { + .owner = THIS_MODULE, + .llseek = no_llseek, + .read = NULL, + .write = NULL, + .open = pseudo_lock_dev_open, + .release = pseudo_lock_dev_release, + .mmap = pseudo_lock_dev_mmap, +}; + +int rdt_pseudo_lock_init(void) +{ + int ret; + + ret = register_chrdev(0, "pseudo_lock", &pseudo_lock_dev_fops); + if (ret < 0) + return ret; + + pseudo_lock_major = ret; + + ret = class_register(&pseudo_lock_class); + if (ret) { + unregister_chrdev(pseudo_lock_major, "pseudo_lock"); + return ret; + } + + return 0; +} + +void rdt_pseudo_lock_release(void) +{ + class_unregister(&pseudo_lock_class); + unregister_chrdev(pseudo_lock_major, "pseudo_lock"); + pseudo_lock_major = 0; +} diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c new file mode 100644 index 0000000000000000000000000000000000000000..bc8ade25deb7d8e97be95006a5cf73213b9d72c5 --- /dev/null +++ b/fs/resctrl/rdtgroup.c @@ -0,0 +1,4084 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * User interface for Resource Allocation in Resource Director Technology(RDT) + * + * Copyright (C) 2016 Intel Corporation + * + * Author: Fenghua Yu + * + * More information about RDT be found in the Intel (R) x86 Architecture + * Software Developer Manual. + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include +#include "internal.h" + +/* Mutex to protect rdtgroup access. */ +DEFINE_MUTEX(rdtgroup_mutex); + +static struct kernfs_root *rdt_root; +struct rdtgroup rdtgroup_default; +LIST_HEAD(rdt_all_groups); + +/* list of entries for the schemata file */ +LIST_HEAD(resctrl_schema_all); + +/* The filesystem can only be mounted once. */ +bool resctrl_mounted; + +/* Kernel fs node for "info" directory under root */ +static struct kernfs_node *kn_info; + +/* Kernel fs node for "mon_groups" directory under root */ +static struct kernfs_node *kn_mongrp; + +/* Kernel fs node for "mon_data" directory under root */ +static struct kernfs_node *kn_mondata; + +/* + * Used to store the max resource name width and max resource data width + * to display the schemata in a tabular format + */ +int max_name_width, max_data_width; + +static struct seq_buf last_cmd_status; +static char last_cmd_status_buf[512]; + +static int rdtgroup_setup_root(struct rdt_fs_context *ctx); +static void rdtgroup_destroy_root(void); + +struct dentry *debugfs_resctrl; + +static bool resctrl_debug; + +void rdt_last_cmd_clear(void) +{ + lockdep_assert_held(&rdtgroup_mutex); + seq_buf_clear(&last_cmd_status); +} + +void rdt_last_cmd_puts(const char *s) +{ + lockdep_assert_held(&rdtgroup_mutex); + seq_buf_puts(&last_cmd_status, s); +} + +void rdt_last_cmd_printf(const char *fmt, ...) +{ + va_list ap; + + va_start(ap, fmt); + lockdep_assert_held(&rdtgroup_mutex); + seq_buf_vprintf(&last_cmd_status, fmt, ap); + va_end(ap); +} + +void rdt_staged_configs_clear(void) +{ + struct rdt_resource *r; + struct rdt_domain *dom; + int i; + + lockdep_assert_held(&rdtgroup_mutex); + + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + r = resctrl_arch_get_resource(i); + if (!r->alloc_capable) + continue; + + list_for_each_entry(dom, &r->domains, list) + memset(dom->staged_config, 0, sizeof(dom->staged_config)); + } +} + +static bool resctrl_is_mbm_enabled(void) +{ + return (resctrl_arch_is_mbm_total_enabled() || + resctrl_arch_is_mbm_local_enabled() || + resctrl_arch_is_mbm_core_enabled()); +} + +static bool resctrl_is_mbm_event(int e) +{ + return (e == QOS_L3_MBM_TOTAL_EVENT_ID || + e == QOS_L3_MBM_LOCAL_EVENT_ID || + e == QOS_L2_MBM_CORE_EVENT_ID); +} + +/* + * Trivial allocator for CLOSIDs. Use BITMAP APIs to manipulate a bitmap + * of free CLOSIDs. + * + * Using a global CLOSID across all resources has some advantages and + * some drawbacks: + * + We can simply set current's closid to assign a task to a resource + * group. + * + Context switch code can avoid extra memory references deciding which + * CLOSID to load into the PQR_ASSOC MSR + * - We give up some options in configuring resource groups across multi-socket + * systems. + * - Our choices on how to configure each resource become progressively more + * limited as the number of resources grows. + */ +static unsigned long *closid_free_map; +static int closid_free_map_len; + +int closids_supported(void) +{ + return closid_free_map_len; +} + +static int closid_init(void) +{ + struct resctrl_schema *s; + u32 rdt_min_closid = ~0; + + /* Compute rdt_min_closid across all resources */ + list_for_each_entry(s, &resctrl_schema_all, list) + rdt_min_closid = min(rdt_min_closid, s->num_closid); + + if (rdt_min_closid == ~0) + return -EOPNOTSUPP; + + closid_free_map = bitmap_alloc(rdt_min_closid, GFP_KERNEL); + if (!closid_free_map) + return -ENOMEM; + + bitmap_fill(closid_free_map, rdt_min_closid); + + /* RESCTRL_RESERVED_CLOSID is always reserved for the default group */ + __clear_bit(RESCTRL_RESERVED_CLOSID, closid_free_map); + closid_free_map_len = rdt_min_closid; + + return 0; +} + +static void closid_exit(void) +{ + bitmap_free(closid_free_map); +} + +static int closid_alloc(void) +{ + int cleanest_closid; + u32 closid; + + lockdep_assert_held(&rdtgroup_mutex); + + if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) { + cleanest_closid = resctrl_find_cleanest_closid(); + if (cleanest_closid < 0) + return cleanest_closid; + closid = cleanest_closid; + } else { + closid = find_first_bit(closid_free_map, closid_free_map_len); + if (closid == closid_free_map_len) + return -ENOSPC; + } + + __clear_bit(closid, closid_free_map); + + return closid; +} + +void closid_free(int closid) +{ + lockdep_assert_held(&rdtgroup_mutex); + + __set_bit(closid, closid_free_map); +} + +/** + * closid_allocated - test if provided closid is in use + * @closid: closid to be tested + * + * Return: true if @closid is currently associated with a resource group, + * false if @closid is free + */ +bool closid_allocated(unsigned int closid) +{ + lockdep_assert_held(&rdtgroup_mutex); + + return !test_bit(closid, closid_free_map); +} + +/** + * rdtgroup_mode_by_closid - Return mode of resource group with closid + * @closid: closid if the resource group + * + * Each resource group is associated with a @closid. Here the mode + * of a resource group can be queried by searching for it using its closid. + * + * Return: mode as &enum rdtgrp_mode of resource group with closid @closid + */ +enum rdtgrp_mode rdtgroup_mode_by_closid(int closid) +{ + struct rdtgroup *rdtgrp; + + list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) { + if (rdtgrp->closid == closid) + return rdtgrp->mode; + } + + return RDT_NUM_MODES; +} + +static const char * const rdt_mode_str[] = { + [RDT_MODE_SHAREABLE] = "shareable", + [RDT_MODE_EXCLUSIVE] = "exclusive", + [RDT_MODE_PSEUDO_LOCKSETUP] = "pseudo-locksetup", + [RDT_MODE_PSEUDO_LOCKED] = "pseudo-locked", +}; + +/** + * rdtgroup_mode_str - Return the string representation of mode + * @mode: the resource group mode as &enum rdtgroup_mode + * + * Return: string representation of valid mode, "unknown" otherwise + */ +static const char *rdtgroup_mode_str(enum rdtgrp_mode mode) +{ + if (mode < RDT_MODE_SHAREABLE || mode >= RDT_NUM_MODES) + return "unknown"; + + return rdt_mode_str[mode]; +} + +/* set uid and gid of rdtgroup dirs and files to that of the creator */ +static int rdtgroup_kn_set_ugid(struct kernfs_node *kn) +{ + struct iattr iattr = { .ia_valid = ATTR_UID | ATTR_GID, + .ia_uid = current_fsuid(), + .ia_gid = current_fsgid(), }; + + if (uid_eq(iattr.ia_uid, GLOBAL_ROOT_UID) && + gid_eq(iattr.ia_gid, GLOBAL_ROOT_GID)) + return 0; + + return kernfs_setattr(kn, &iattr); +} + +static int rdtgroup_add_file(struct kernfs_node *parent_kn, struct rftype *rft) +{ + struct kernfs_node *kn; + int ret; + + kn = __kernfs_create_file(parent_kn, rft->name, rft->mode, + GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, + 0, rft->kf_ops, rft, NULL, NULL); + if (IS_ERR(kn)) + return PTR_ERR(kn); + + ret = rdtgroup_kn_set_ugid(kn); + if (ret) { + kernfs_remove(kn); + return ret; + } + + return 0; +} + +static int rdtgroup_seqfile_show(struct seq_file *m, void *arg) +{ + struct kernfs_open_file *of = m->private; + struct rftype *rft = of->kn->priv; + + if (rft->seq_show) + return rft->seq_show(of, m, arg); + return 0; +} + +static ssize_t rdtgroup_file_write(struct kernfs_open_file *of, char *buf, + size_t nbytes, loff_t off) +{ + struct rftype *rft = of->kn->priv; + + if (rft->write) + return rft->write(of, buf, nbytes, off); + + return -EINVAL; +} + +static const struct kernfs_ops rdtgroup_kf_single_ops = { + .atomic_write_len = PAGE_SIZE, + .write = rdtgroup_file_write, + .seq_show = rdtgroup_seqfile_show, +}; + +static const struct kernfs_ops kf_mondata_ops = { + .atomic_write_len = PAGE_SIZE, + .seq_show = rdtgroup_mondata_show, +}; + +static bool is_cpu_list(struct kernfs_open_file *of) +{ + struct rftype *rft = of->kn->priv; + + return rft->flags & RFTYPE_FLAGS_CPUS_LIST; +} + +static int rdtgroup_cpus_show(struct kernfs_open_file *of, + struct seq_file *s, void *v) +{ + struct rdtgroup *rdtgrp; + struct cpumask *mask; + int ret = 0; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + + if (rdtgrp) { + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) { + if (!rdtgrp->plr->d) { + rdt_last_cmd_clear(); + rdt_last_cmd_puts("Cache domain offline\n"); + ret = -ENODEV; + } else { + mask = &rdtgrp->plr->d->cpu_mask; + seq_printf(s, is_cpu_list(of) ? + "%*pbl\n" : "%*pb\n", + cpumask_pr_args(mask)); + } + } else { + seq_printf(s, is_cpu_list(of) ? "%*pbl\n" : "%*pb\n", + cpumask_pr_args(&rdtgrp->cpu_mask)); + } + } else { + ret = -ENOENT; + } + rdtgroup_kn_unlock(of->kn); + + return ret; +} + +/* + * Update the PGR_ASSOC MSR on all cpus in @cpu_mask, + * + * Per task closids/rmids must have been set up before calling this function. + * @r may be NULL. + */ +static void +update_closid_rmid(const struct cpumask *cpu_mask, struct rdtgroup *r) +{ + struct resctrl_cpu_sync defaults; + struct resctrl_cpu_sync *defaults_p = NULL; + + if (r) { + defaults.closid = r->closid; + defaults.rmid = r->mon.rmid; + defaults_p = &defaults; + } + + on_each_cpu_mask(cpu_mask, resctrl_arch_sync_cpu_defaults, defaults_p, + 1); +} + +static int cpus_mon_write(struct rdtgroup *rdtgrp, cpumask_var_t newmask, + cpumask_var_t tmpmask) +{ + struct rdtgroup *prgrp = rdtgrp->mon.parent, *crgrp; + struct list_head *head; + + /* Check whether cpus belong to parent ctrl group */ + cpumask_andnot(tmpmask, newmask, &prgrp->cpu_mask); + if (!cpumask_empty(tmpmask)) { + rdt_last_cmd_puts("Can only add CPUs to mongroup that belong to parent\n"); + return -EINVAL; + } + + /* Check whether cpus are dropped from this group */ + cpumask_andnot(tmpmask, &rdtgrp->cpu_mask, newmask); + if (!cpumask_empty(tmpmask)) { + /* Give any dropped cpus to parent rdtgroup */ + cpumask_or(&prgrp->cpu_mask, &prgrp->cpu_mask, tmpmask); + update_closid_rmid(tmpmask, prgrp); + } + + /* + * If we added cpus, remove them from previous group that owned them + * and update per-cpu rmid + */ + cpumask_andnot(tmpmask, newmask, &rdtgrp->cpu_mask); + if (!cpumask_empty(tmpmask)) { + head = &prgrp->mon.crdtgrp_list; + list_for_each_entry(crgrp, head, mon.crdtgrp_list) { + if (crgrp == rdtgrp) + continue; + cpumask_andnot(&crgrp->cpu_mask, &crgrp->cpu_mask, + tmpmask); + } + update_closid_rmid(tmpmask, rdtgrp); + } + + /* Done pushing/pulling - update this group with new mask */ + cpumask_copy(&rdtgrp->cpu_mask, newmask); + + return 0; +} + +static void cpumask_rdtgrp_clear(struct rdtgroup *r, struct cpumask *m) +{ + struct rdtgroup *crgrp; + + cpumask_andnot(&r->cpu_mask, &r->cpu_mask, m); + /* update the child mon group masks as well*/ + list_for_each_entry(crgrp, &r->mon.crdtgrp_list, mon.crdtgrp_list) + cpumask_and(&crgrp->cpu_mask, &r->cpu_mask, &crgrp->cpu_mask); +} + +static int cpus_ctrl_write(struct rdtgroup *rdtgrp, cpumask_var_t newmask, + cpumask_var_t tmpmask, cpumask_var_t tmpmask1) +{ + struct rdtgroup *r, *crgrp; + struct list_head *head; + + /* Check whether cpus are dropped from this group */ + cpumask_andnot(tmpmask, &rdtgrp->cpu_mask, newmask); + if (!cpumask_empty(tmpmask)) { + /* Can't drop from default group */ + if (rdtgrp == &rdtgroup_default) { + rdt_last_cmd_puts("Can't drop CPUs from default group\n"); + return -EINVAL; + } + + /* Give any dropped cpus to rdtgroup_default */ + cpumask_or(&rdtgroup_default.cpu_mask, + &rdtgroup_default.cpu_mask, tmpmask); + update_closid_rmid(tmpmask, &rdtgroup_default); + } + + /* + * If we added cpus, remove them from previous group and + * the prev group's child groups that owned them + * and update per-cpu closid/rmid. + */ + cpumask_andnot(tmpmask, newmask, &rdtgrp->cpu_mask); + if (!cpumask_empty(tmpmask)) { + list_for_each_entry(r, &rdt_all_groups, rdtgroup_list) { + if (r == rdtgrp) + continue; + cpumask_and(tmpmask1, &r->cpu_mask, tmpmask); + if (!cpumask_empty(tmpmask1)) + cpumask_rdtgrp_clear(r, tmpmask1); + } + update_closid_rmid(tmpmask, rdtgrp); + } + + /* Done pushing/pulling - update this group with new mask */ + cpumask_copy(&rdtgrp->cpu_mask, newmask); + + /* + * Clear child mon group masks since there is a new parent mask + * now and update the rmid for the cpus the child lost. + */ + head = &rdtgrp->mon.crdtgrp_list; + list_for_each_entry(crgrp, head, mon.crdtgrp_list) { + cpumask_and(tmpmask, &rdtgrp->cpu_mask, &crgrp->cpu_mask); + update_closid_rmid(tmpmask, rdtgrp); + cpumask_clear(&crgrp->cpu_mask); + } + + return 0; +} + +static ssize_t rdtgroup_cpus_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + cpumask_var_t tmpmask, newmask, tmpmask1; + struct rdtgroup *rdtgrp; + int ret; + + if (!buf) + return -EINVAL; + + if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL)) + return -ENOMEM; + if (!zalloc_cpumask_var(&newmask, GFP_KERNEL)) { + free_cpumask_var(tmpmask); + return -ENOMEM; + } + if (!zalloc_cpumask_var(&tmpmask1, GFP_KERNEL)) { + free_cpumask_var(tmpmask); + free_cpumask_var(newmask); + return -ENOMEM; + } + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (!rdtgrp) { + ret = -ENOENT; + goto unlock; + } + + rdt_last_cmd_clear(); + + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED || + rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { + ret = -EINVAL; + rdt_last_cmd_puts("Pseudo-locking in progress\n"); + goto unlock; + } + + if (is_cpu_list(of)) + ret = cpulist_parse(buf, newmask); + else + ret = cpumask_parse(buf, newmask); + + if (ret) { + rdt_last_cmd_puts("Bad CPU list/mask\n"); + goto unlock; + } + + /* check that user didn't specify any offline cpus */ + cpumask_andnot(tmpmask, newmask, cpu_online_mask); + if (!cpumask_empty(tmpmask)) { + ret = -EINVAL; + rdt_last_cmd_puts("Can only assign online CPUs\n"); + goto unlock; + } + + if (rdtgrp->type == RDTCTRL_GROUP) + ret = cpus_ctrl_write(rdtgrp, newmask, tmpmask, tmpmask1); + else if (rdtgrp->type == RDTMON_GROUP) + ret = cpus_mon_write(rdtgrp, newmask, tmpmask); + else + ret = -EINVAL; + +unlock: + rdtgroup_kn_unlock(of->kn); + free_cpumask_var(tmpmask); + free_cpumask_var(newmask); + free_cpumask_var(tmpmask1); + + return ret ?: nbytes; +} + +/** + * rdtgroup_remove - the helper to remove resource group safely + * @rdtgrp: resource group to remove + * + * On resource group creation via a mkdir, an extra kernfs_node reference is + * taken to ensure that the rdtgroup structure remains accessible for the + * rdtgroup_kn_unlock() calls where it is removed. + * + * Drop the extra reference here, then free the rdtgroup structure. + * + * Return: void + */ +static void rdtgroup_remove(struct rdtgroup *rdtgrp) +{ + kernfs_put(rdtgrp->kn); + kfree(rdtgrp); +} + +static void _update_task_closid_rmid(void *task) +{ + /* + * If the task is still current on this CPU, update PQR_ASSOC MSR. + * Otherwise, the MSR is updated when the task is scheduled in. + */ + if (task == current) + resctrl_sched_in(task); +} + +static void update_task_closid_rmid(struct task_struct *t) +{ + if (IS_ENABLED(CONFIG_SMP) && task_curr(t)) + smp_call_function_single(task_cpu(t), _update_task_closid_rmid, t, 1); + else + _update_task_closid_rmid(t); +} + +static bool task_in_rdtgroup(struct task_struct *tsk, struct rdtgroup *rdtgrp) +{ + u32 closid, rmid = rdtgrp->mon.rmid; + + if (rdtgrp->type == RDTCTRL_GROUP) + closid = rdtgrp->closid; + else if (rdtgrp->type == RDTMON_GROUP) + closid = rdtgrp->mon.parent->closid; + else + return false; + + return resctrl_arch_match_closid(tsk, closid) && + resctrl_arch_match_rmid(tsk, closid, rmid); +} + +static int __rdtgroup_move_task(struct task_struct *tsk, + struct rdtgroup *rdtgrp) +{ + /* If the task is already in rdtgrp, no need to move the task. */ + if (task_in_rdtgroup(tsk, rdtgrp)) + return 0; + + /* + * Set the task's closid/rmid before the PQR_ASSOC MSR can be + * updated by them. + * + * For ctrl_mon groups, move both closid and rmid. + * For monitor groups, can move the tasks only from + * their parent CTRL group. + */ + if (rdtgrp->type == RDTMON_GROUP && + !resctrl_arch_match_closid(tsk, rdtgrp->mon.parent->closid)) { + rdt_last_cmd_puts("Can't move task to different control group\n"); + return -EINVAL; + } + + if (rdtgrp->type == RDTMON_GROUP) + resctrl_arch_set_closid_rmid(tsk, rdtgrp->mon.parent->closid, + rdtgrp->mon.rmid); + else + resctrl_arch_set_closid_rmid(tsk, rdtgrp->closid, + rdtgrp->mon.rmid); + + /* + * Ensure the task's closid and rmid are written before determining if + * the task is current that will decide if it will be interrupted. + * This pairs with the full barrier between the rq->curr update and + * resctrl_sched_in() during context switch. + */ + smp_mb(); + + /* + * By now, the task's closid and rmid are set. If the task is current + * on a CPU, the PQR_ASSOC MSR needs to be updated to make the resource + * group go into effect. If the task is not current, the MSR will be + * updated when the task is scheduled in. + */ + update_task_closid_rmid(tsk); + + return 0; +} + +static bool is_closid_match(struct task_struct *t, struct rdtgroup *r) +{ + return (resctrl_arch_alloc_capable() && (r->type == RDTCTRL_GROUP) && + resctrl_arch_match_closid(t, r->closid)); +} + +static bool is_rmid_match(struct task_struct *t, struct rdtgroup *r) +{ + return (resctrl_arch_mon_capable() && (r->type == RDTMON_GROUP) && + resctrl_arch_match_rmid(t, r->mon.parent->closid, + r->mon.rmid)); +} + +/** + * rdtgroup_tasks_assigned - Test if tasks have been assigned to resource group + * @r: Resource group + * + * Return: 1 if tasks have been assigned to @r, 0 otherwise + */ +int rdtgroup_tasks_assigned(struct rdtgroup *r) +{ + struct task_struct *p, *t; + int ret = 0; + + lockdep_assert_held(&rdtgroup_mutex); + + rcu_read_lock(); + for_each_process_thread(p, t) { + if (is_closid_match(t, r) || is_rmid_match(t, r)) { + ret = 1; + break; + } + } + rcu_read_unlock(); + + return ret; +} + +static int rdtgroup_task_write_permission(struct task_struct *task, + struct kernfs_open_file *of) +{ + const struct cred *tcred = get_task_cred(task); + const struct cred *cred = current_cred(); + int ret = 0; + + /* + * Even if we're attaching all tasks in the thread group, we only + * need to check permissions on one of them. + */ + if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) && + !uid_eq(cred->euid, tcred->uid) && + !uid_eq(cred->euid, tcred->suid)) { + rdt_last_cmd_printf("No permission to move task %d\n", task->pid); + ret = -EPERM; + } + + put_cred(tcred); + return ret; +} + +static int rdtgroup_move_task(pid_t pid, struct rdtgroup *rdtgrp, + struct kernfs_open_file *of) +{ + struct task_struct *tsk; + int ret; + + rcu_read_lock(); + if (pid) { + tsk = find_task_by_vpid(pid); + if (!tsk) { + rcu_read_unlock(); + rdt_last_cmd_printf("No task %d\n", pid); + return -ESRCH; + } + } else { + tsk = current; + } + + get_task_struct(tsk); + rcu_read_unlock(); + + ret = rdtgroup_task_write_permission(tsk, of); + if (!ret) + ret = __rdtgroup_move_task(tsk, rdtgrp); + + put_task_struct(tsk); + return ret; +} + +static ssize_t rdtgroup_tasks_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct rdtgroup *rdtgrp; + char *pid_str; + int ret = 0; + pid_t pid; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (!rdtgrp) { + rdtgroup_kn_unlock(of->kn); + return -ENOENT; + } + rdt_last_cmd_clear(); + + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED || + rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { + ret = -EINVAL; + rdt_last_cmd_puts("Pseudo-locking in progress\n"); + goto unlock; + } + + while (buf && buf[0] != '\0' && buf[0] != '\n') { + pid_str = strim(strsep(&buf, ",")); + + if (kstrtoint(pid_str, 0, &pid)) { + rdt_last_cmd_printf("Task list parsing error pid %s\n", pid_str); + ret = -EINVAL; + break; + } + + if (pid < 0) { + rdt_last_cmd_printf("Invalid pid %d\n", pid); + ret = -EINVAL; + break; + } + + ret = rdtgroup_move_task(pid, rdtgrp, of); + if (ret) { + rdt_last_cmd_printf("Error while processing task %d\n", pid); + break; + } + } + +unlock: + rdtgroup_kn_unlock(of->kn); + + return ret ?: nbytes; +} + +static void show_rdt_tasks(struct rdtgroup *r, struct seq_file *s) +{ + struct task_struct *p, *t; + pid_t pid; + + rcu_read_lock(); + for_each_process_thread(p, t) { + if (is_closid_match(t, r) || is_rmid_match(t, r)) { + pid = task_pid_vnr(t); + if (pid) + seq_printf(s, "%d\n", pid); + } + } + rcu_read_unlock(); +} + +static int rdtgroup_tasks_show(struct kernfs_open_file *of, + struct seq_file *s, void *v) +{ + struct rdtgroup *rdtgrp; + int ret = 0; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (rdtgrp) + show_rdt_tasks(rdtgrp, s); + else + ret = -ENOENT; + rdtgroup_kn_unlock(of->kn); + + return ret; +} + +static int rdtgroup_closid_show(struct kernfs_open_file *of, + struct seq_file *s, void *v) +{ + struct rdtgroup *rdtgrp; + int ret = 0; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (rdtgrp) + seq_printf(s, "%u\n", rdtgrp->closid); + else + ret = -ENOENT; + rdtgroup_kn_unlock(of->kn); + + return ret; +} + +static int rdtgroup_rmid_show(struct kernfs_open_file *of, + struct seq_file *s, void *v) +{ + struct rdtgroup *rdtgrp; + int ret = 0; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (rdtgrp) + seq_printf(s, "%u\n", rdtgrp->mon.rmid); + else + ret = -ENOENT; + rdtgroup_kn_unlock(of->kn); + + return ret; +} + +#ifdef CONFIG_PROC_CPU_RESCTRL + +/* + * A task can only be part of one resctrl control group and of one monitor + * group which is associated to that control group. + * + * 1) res: + * mon: + * + * resctrl is not available. + * + * 2) res:/ + * mon: + * + * Task is part of the root resctrl control group, and it is not associated + * to any monitor group. + * + * 3) res:/ + * mon:mon0 + * + * Task is part of the root resctrl control group and monitor group mon0. + * + * 4) res:group0 + * mon: + * + * Task is part of resctrl control group group0, and it is not associated + * to any monitor group. + * + * 5) res:group0 + * mon:mon1 + * + * Task is part of resctrl control group group0 and monitor group mon1. + */ +int proc_resctrl_show(struct seq_file *s, struct pid_namespace *ns, + struct pid *pid, struct task_struct *tsk) +{ + struct rdtgroup *rdtg; + int ret = 0; + + mutex_lock(&rdtgroup_mutex); + + /* Return empty if resctrl has not been mounted. */ + if (!resctrl_mounted) { + seq_puts(s, "res:\nmon:\n"); + goto unlock; + } + + list_for_each_entry(rdtg, &rdt_all_groups, rdtgroup_list) { + struct rdtgroup *crg; + + /* + * Task information is only relevant for shareable + * and exclusive groups. + */ + if (rdtg->mode != RDT_MODE_SHAREABLE && + rdtg->mode != RDT_MODE_EXCLUSIVE) + continue; + + if (!resctrl_arch_match_closid(tsk, rdtg->closid)) + continue; + + seq_printf(s, "res:%s%s\n", (rdtg == &rdtgroup_default) ? "/" : "", + rdtg->kn->name); + seq_puts(s, "mon:"); + list_for_each_entry(crg, &rdtg->mon.crdtgrp_list, + mon.crdtgrp_list) { + if (!resctrl_arch_match_rmid(tsk, crg->mon.parent->closid, + crg->mon.rmid)) + continue; + seq_printf(s, "%s", crg->kn->name); + break; + } + seq_putc(s, '\n'); + goto unlock; + } + /* + * The above search should succeed. Otherwise return + * with an error. + */ + ret = -ENOENT; +unlock: + mutex_unlock(&rdtgroup_mutex); + + return ret; +} +#endif + +static int rdt_last_cmd_status_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + int len; + + mutex_lock(&rdtgroup_mutex); + len = seq_buf_used(&last_cmd_status); + if (len) + seq_printf(seq, "%.*s", len, last_cmd_status_buf); + else + seq_puts(seq, "ok\n"); + mutex_unlock(&rdtgroup_mutex); + return 0; +} + +static int rdt_num_closids_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + + seq_printf(seq, "%u\n", s->num_closid); + return 0; +} + +static int rdt_default_ctrl_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + struct rdt_resource *r = s->res; + + seq_printf(seq, "%x\n", r->default_ctrl); + return 0; +} + +static int rdt_min_cbm_bits_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + struct rdt_resource *r = s->res; + + seq_printf(seq, "%u\n", r->cache.min_cbm_bits); + return 0; +} + +static int rdt_shareable_bits_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + struct rdt_resource *r = s->res; + + seq_printf(seq, "%x\n", r->cache.shareable_bits); + return 0; +} + +/* + * rdt_bit_usage_show - Display current usage of resources + * + * A domain is a shared resource that can now be allocated differently. Here + * we display the current regions of the domain as an annotated bitmask. + * For each domain of this resource its allocation bitmask + * is annotated as below to indicate the current usage of the corresponding bit: + * 0 - currently unused + * X - currently available for sharing and used by software and hardware + * H - currently used by hardware only but available for software use + * S - currently used and shareable by software only + * E - currently used exclusively by one resource group + * P - currently pseudo-locked by one resource group + */ +static int rdt_bit_usage_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + /* + * Use unsigned long even though only 32 bits are used to ensure + * test_bit() is used safely. + */ + unsigned long sw_shareable = 0, hw_shareable = 0; + unsigned long exclusive = 0, pseudo_locked = 0; + struct rdt_resource *r = s->res; + struct rdt_domain *dom; + int i, hwb, swb, excl, psl; + enum rdtgrp_mode mode; + bool sep = false; + u32 ctrl_val; + + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + hw_shareable = r->cache.shareable_bits; + list_for_each_entry(dom, &r->domains, list) { + if (sep) + seq_putc(seq, ';'); + sw_shareable = 0; + exclusive = 0; + seq_printf(seq, "%d=", dom->id); + for (i = 0; i < closids_supported(); i++) { + if (!closid_allocated(i)) + continue; + ctrl_val = resctrl_arch_get_config(r, dom, i, + s->conf_type); + mode = rdtgroup_mode_by_closid(i); + switch (mode) { + case RDT_MODE_SHAREABLE: + sw_shareable |= ctrl_val; + break; + case RDT_MODE_EXCLUSIVE: + exclusive |= ctrl_val; + break; + case RDT_MODE_PSEUDO_LOCKSETUP: + /* + * RDT_MODE_PSEUDO_LOCKSETUP is possible + * here but not included since the CBM + * associated with this CLOSID in this mode + * is not initialized and no task or cpu can be + * assigned this CLOSID. + */ + break; + case RDT_MODE_PSEUDO_LOCKED: + case RDT_NUM_MODES: + WARN(1, + "invalid mode for closid %d\n", i); + break; + } + } + for (i = r->cache.cbm_len - 1; i >= 0; i--) { + pseudo_locked = dom->plr ? dom->plr->cbm : 0; + hwb = test_bit(i, &hw_shareable); + swb = test_bit(i, &sw_shareable); + excl = test_bit(i, &exclusive); + psl = test_bit(i, &pseudo_locked); + if (hwb && swb) + seq_putc(seq, 'X'); + else if (hwb && !swb) + seq_putc(seq, 'H'); + else if (!hwb && swb) + seq_putc(seq, 'S'); + else if (excl) + seq_putc(seq, 'E'); + else if (psl) + seq_putc(seq, 'P'); + else /* Unused bits remain */ + seq_putc(seq, '0'); + } + sep = true; + } + seq_putc(seq, '\n'); + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); + return 0; +} + +static int rdt_min_bw_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + struct rdt_resource *r = s->res; + + seq_printf(seq, "%u\n", r->membw.min_bw); + return 0; +} + +static int rdt_num_rmids_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct rdt_resource *r = of->kn->parent->priv; + + seq_printf(seq, "%d\n", r->num_rmid); + + return 0; +} + +static int rdt_mon_features_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct rdt_resource *r = of->kn->parent->priv; + struct mon_evt *mevt; + + list_for_each_entry(mevt, &r->evt_list, list) { + seq_printf(seq, "%s\n", mevt->name); + if (mevt->configurable) + seq_printf(seq, "%s_config\n", mevt->name); + } + + return 0; +} + +static int rdt_bw_gran_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + struct rdt_resource *r = s->res; + + seq_printf(seq, "%u\n", r->membw.bw_gran); + return 0; +} + +static int rdt_delay_linear_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + struct rdt_resource *r = s->res; + + seq_printf(seq, "%u\n", r->membw.delay_linear); + return 0; +} + +static int max_threshold_occ_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct rdt_resource *r = of->kn->parent->priv; + + if (r->cache_level != 3) + return 0; + + seq_printf(seq, "%u\n", resctrl_rmid_realloc_threshold); + + return 0; +} + +static int rdt_thread_throttle_mode_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + struct rdt_resource *r = s->res; + + if (r->membw.throttle_mode == THREAD_THROTTLE_PER_THREAD) + seq_puts(seq, "per-thread\n"); + else + seq_puts(seq, "max\n"); + + return 0; +} + +static ssize_t max_threshold_occ_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct rdt_resource *r = of->kn->parent->priv; + unsigned int bytes; + int ret; + + if (r->cache_level != 3) + goto out; + + ret = kstrtouint(buf, 0, &bytes); + if (ret) + return ret; + + if (bytes > resctrl_rmid_realloc_limit) + return -EINVAL; + + resctrl_rmid_realloc_threshold = resctrl_arch_round_mon_val(bytes); + +out: + return nbytes; +} + +/* + * rdtgroup_mode_show - Display mode of this resource group + */ +static int rdtgroup_mode_show(struct kernfs_open_file *of, + struct seq_file *s, void *v) +{ + struct rdtgroup *rdtgrp; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (!rdtgrp) { + rdtgroup_kn_unlock(of->kn); + return -ENOENT; + } + + seq_printf(s, "%s\n", rdtgroup_mode_str(rdtgrp->mode)); + + rdtgroup_kn_unlock(of->kn); + return 0; +} + +static enum resctrl_conf_type resctrl_peer_type(enum resctrl_conf_type my_type) +{ + switch (my_type) { + case CDP_CODE: + return CDP_DATA; + case CDP_DATA: + return CDP_CODE; + default: + case CDP_NONE: + return CDP_NONE; + } +} + +static int rdt_has_sparse_bitmasks_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + struct rdt_resource *r = s->res; + + seq_printf(seq, "%u\n", r->cache.arch_has_sparse_bitmasks); + + return 0; +} + +/** + * __rdtgroup_cbm_overlaps - Does CBM for intended closid overlap with other + * @r: Resource to which domain instance @d belongs. + * @d: The domain instance for which @closid is being tested. + * @cbm: Capacity bitmask being tested. + * @closid: Intended closid for @cbm. + * @type: CDP type of @r. + * @exclusive: Only check if overlaps with exclusive resource groups + * + * Checks if provided @cbm intended to be used for @closid on domain + * @d overlaps with any other closids or other hardware usage associated + * with this domain. If @exclusive is true then only overlaps with + * resource groups in exclusive mode will be considered. If @exclusive + * is false then overlaps with any resource group or hardware entities + * will be considered. + * + * @cbm is unsigned long, even if only 32 bits are used, to make the + * bitmap functions work correctly. + * + * Return: false if CBM does not overlap, true if it does. + */ +static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d, + unsigned long cbm, int closid, + enum resctrl_conf_type type, bool exclusive) +{ + enum rdtgrp_mode mode; + unsigned long ctrl_b; + int i; + + /* Check for any overlap with regions used by hardware directly */ + if (!exclusive) { + ctrl_b = r->cache.shareable_bits; + if (bitmap_intersects(&cbm, &ctrl_b, r->cache.cbm_len)) + return true; + } + + /* Check for overlap with other resource groups */ + for (i = 0; i < closids_supported(); i++) { + ctrl_b = resctrl_arch_get_config(r, d, i, type); + mode = rdtgroup_mode_by_closid(i); + if (closid_allocated(i) && i != closid && + mode != RDT_MODE_PSEUDO_LOCKSETUP) { + if (bitmap_intersects(&cbm, &ctrl_b, r->cache.cbm_len)) { + if (exclusive) { + if (mode == RDT_MODE_EXCLUSIVE) + return true; + continue; + } + return true; + } + } + } + + return false; +} + +/** + * rdtgroup_cbm_overlaps - Does CBM overlap with other use of hardware + * @s: Schema for the resource to which domain instance @d belongs. + * @d: The domain instance for which @closid is being tested. + * @cbm: Capacity bitmask being tested. + * @closid: Intended closid for @cbm. + * @exclusive: Only check if overlaps with exclusive resource groups + * + * Resources that can be allocated using a CBM can use the CBM to control + * the overlap of these allocations. rdtgroup_cmb_overlaps() is the test + * for overlap. Overlap test is not limited to the specific resource for + * which the CBM is intended though - when dealing with CDP resources that + * share the underlying hardware the overlap check should be performed on + * the CDP resource sharing the hardware also. + * + * Refer to description of __rdtgroup_cbm_overlaps() for the details of the + * overlap test. + * + * Return: true if CBM overlap detected, false if there is no overlap + */ +bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d, + unsigned long cbm, int closid, bool exclusive) +{ + enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type); + struct rdt_resource *r = s->res; + + if (__rdtgroup_cbm_overlaps(r, d, cbm, closid, s->conf_type, + exclusive)) + return true; + + if (!resctrl_arch_get_cdp_enabled(r->rid)) + return false; + return __rdtgroup_cbm_overlaps(r, d, cbm, closid, peer_type, exclusive); +} + +/** + * rdtgroup_mode_test_exclusive - Test if this resource group can be exclusive + * @rdtgrp: Resource group identified through its closid. + * + * An exclusive resource group implies that there should be no sharing of + * its allocated resources. At the time this group is considered to be + * exclusive this test can determine if its current schemata supports this + * setting by testing for overlap with all other resource groups. + * + * Return: true if resource group can be exclusive, false if there is overlap + * with allocations of other resource groups and thus this resource group + * cannot be exclusive. + */ +static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp) +{ + int closid = rdtgrp->closid; + struct resctrl_schema *s; + struct rdt_resource *r; + bool has_cache = false; + struct rdt_domain *d; + u32 ctrl; + + /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + + list_for_each_entry(s, &resctrl_schema_all, list) { + r = s->res; + if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA) + continue; + has_cache = true; + list_for_each_entry(d, &r->domains, list) { + ctrl = resctrl_arch_get_config(r, d, closid, + s->conf_type); + if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) { + rdt_last_cmd_puts("Schemata overlaps\n"); + return false; + } + } + } + + if (!has_cache) { + rdt_last_cmd_puts("Cannot be exclusive without CAT/CDP\n"); + return false; + } + + return true; +} + +/* + * rdtgroup_mode_write - Modify the resource group's mode + */ +static ssize_t rdtgroup_mode_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct rdtgroup *rdtgrp; + enum rdtgrp_mode mode; + int ret = 0; + + /* Valid input requires a trailing newline */ + if (nbytes == 0 || buf[nbytes - 1] != '\n') + return -EINVAL; + buf[nbytes - 1] = '\0'; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (!rdtgrp) { + rdtgroup_kn_unlock(of->kn); + return -ENOENT; + } + + rdt_last_cmd_clear(); + + mode = rdtgrp->mode; + + if ((!strcmp(buf, "shareable") && mode == RDT_MODE_SHAREABLE) || + (!strcmp(buf, "exclusive") && mode == RDT_MODE_EXCLUSIVE) || + (!strcmp(buf, "pseudo-locksetup") && + mode == RDT_MODE_PSEUDO_LOCKSETUP) || + (!strcmp(buf, "pseudo-locked") && mode == RDT_MODE_PSEUDO_LOCKED)) + goto out; + + if (mode == RDT_MODE_PSEUDO_LOCKED) { + rdt_last_cmd_puts("Cannot change pseudo-locked group\n"); + ret = -EINVAL; + goto out; + } + + if (!strcmp(buf, "shareable")) { + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { + ret = rdtgroup_locksetup_exit(rdtgrp); + if (ret) + goto out; + } + rdtgrp->mode = RDT_MODE_SHAREABLE; + } else if (!strcmp(buf, "exclusive")) { + if (!rdtgroup_mode_test_exclusive(rdtgrp)) { + ret = -EINVAL; + goto out; + } + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { + ret = rdtgroup_locksetup_exit(rdtgrp); + if (ret) + goto out; + } + rdtgrp->mode = RDT_MODE_EXCLUSIVE; + } else if (IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK) && + !strcmp(buf, "pseudo-locksetup")) { + ret = rdtgroup_locksetup_enter(rdtgrp); + if (ret) + goto out; + rdtgrp->mode = RDT_MODE_PSEUDO_LOCKSETUP; + } else { + rdt_last_cmd_puts("Unknown or unsupported mode\n"); + ret = -EINVAL; + } + +out: + rdtgroup_kn_unlock(of->kn); + return ret ?: nbytes; +} + +/** + * rdtgroup_cbm_to_size - Translate CBM to size in bytes + * @r: RDT resource to which @d belongs. + * @d: RDT domain instance. + * @cbm: bitmask for which the size should be computed. + * + * The bitmask provided associated with the RDT domain instance @d will be + * translated into how many bytes it represents. The size in bytes is + * computed by first dividing the total cache size by the CBM length to + * determine how many bytes each bit in the bitmask represents. The result + * is multiplied with the number of bits set in the bitmask. + * + * @cbm is unsigned long, even if only 32 bits are used to make the + * bitmap functions work correctly. + */ +unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, + struct rdt_domain *d, unsigned long cbm) +{ + struct cpu_cacheinfo *ci; + unsigned int size = 0; + int num_b, i; + + num_b = bitmap_weight(&cbm, r->cache.cbm_len); + ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask)); + for (i = 0; i < ci->num_leaves; i++) { + if (ci->info_list[i].level == r->cache_level) { + size = ci->info_list[i].size / r->cache.cbm_len * num_b; + break; + } + } + + return size; +} + +/* + * rdtgroup_size_show - Display size in bytes of allocated regions + * + * The "size" file mirrors the layout of the "schemata" file, printing the + * size in bytes of each region instead of the capacity bitmask. + */ +static int rdtgroup_size_show(struct kernfs_open_file *of, + struct seq_file *s, void *v) +{ + struct resctrl_schema *schema; + enum resctrl_conf_type type; + struct rdtgroup *rdtgrp; + struct rdt_resource *r; + struct rdt_domain *d; + unsigned int size; + int ret = 0; + u32 closid; + bool sep; + u32 ctrl; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (!rdtgrp) { + rdtgroup_kn_unlock(of->kn); + return -ENOENT; + } + + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) { + if (!rdtgrp->plr->d) { + rdt_last_cmd_clear(); + rdt_last_cmd_puts("Cache domain offline\n"); + ret = -ENODEV; + } else { + seq_printf(s, "%*s:", max_name_width, + rdtgrp->plr->s->name); + size = rdtgroup_cbm_to_size(rdtgrp->plr->s->res, + rdtgrp->plr->d, + rdtgrp->plr->cbm); + seq_printf(s, "%d=%u\n", rdtgrp->plr->d->id, size); + } + goto out; + } + + closid = rdtgrp->closid; + + list_for_each_entry(schema, &resctrl_schema_all, list) { + r = schema->res; + type = schema->conf_type; + sep = false; + seq_printf(s, "%*s:", max_name_width, schema->name); + list_for_each_entry(d, &r->domains, list) { + if (sep) + seq_putc(s, ';'); + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { + size = 0; + } else { + if (is_mba_sc(r)) + ctrl = d->mbps_val[closid]; + else + ctrl = resctrl_arch_get_config(r, d, + closid, + type); + if (r->rid == RDT_RESOURCE_L3 || + r->rid == RDT_RESOURCE_L2) + size = rdtgroup_cbm_to_size(r, d, ctrl); + else + size = ctrl; + } + seq_printf(s, "%d=%u", d->id, size); + sep = true; + } + seq_putc(s, '\n'); + } + +out: + rdtgroup_kn_unlock(of->kn); + + return ret; +} + +static void mondata_config_read(struct resctrl_mon_config_info *mon_info) +{ + smp_call_function_any(&mon_info->d->cpu_mask, + resctrl_arch_mon_event_config_read, mon_info, 1); +} + +static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid) +{ + struct resctrl_mon_config_info mon_info = {0}; + struct rdt_domain *dom; + bool sep = false; + + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + + list_for_each_entry(dom, &r->domains, list) { + if (sep) + seq_puts(s, ";"); + + memset(&mon_info, 0, sizeof(struct resctrl_mon_config_info)); + mon_info.r = r; + mon_info.d = dom; + mon_info.evtid = evtid; + mondata_config_read(&mon_info); + + seq_printf(s, "%d=0x%02x", dom->id, mon_info.mon_config); + sep = true; + } + seq_puts(s, "\n"); + + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); + + return 0; +} + +static int mbm_total_bytes_config_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct rdt_resource *r = of->kn->parent->priv; + + mbm_config_show(seq, r, QOS_L3_MBM_TOTAL_EVENT_ID); + + return 0; +} + +static int mbm_local_bytes_config_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct rdt_resource *r = of->kn->parent->priv; + + mbm_config_show(seq, r, QOS_L3_MBM_LOCAL_EVENT_ID); + + return 0; +} + +static void mbm_config_write_domain(struct rdt_resource *r, + struct rdt_domain *d, u32 evtid, u32 val) +{ + struct resctrl_mon_config_info mon_info = {0}; + + /* + * Read the current config value first. If both are the same then + * no need to write it again. + */ + mon_info.r = r; + mon_info.d = d; + mon_info.evtid = evtid; + mondata_config_read(&mon_info); + if (mon_info.mon_config == val) + return; + + mon_info.mon_config = val; + + /* + * Update MSR_IA32_EVT_CFG_BASE MSR on one of the CPUs in the + * domain. The MSRs offset from MSR MSR_IA32_EVT_CFG_BASE + * are scoped at the domain level. Writing any of these MSRs + * on one CPU is observed by all the CPUs in the domain. + */ + smp_call_function_any(&d->cpu_mask, resctrl_arch_mon_event_config_write, + &mon_info, 1); + if (mon_info.err) { + rdt_last_cmd_puts("Invalid event configuration\n"); + return; + } + + /* + * When an Event Configuration is changed, the bandwidth counters + * for all RMIDs and Events will be cleared by the hardware. The + * hardware also sets MSR_IA32_QM_CTR.Unavailable (bit 62) for + * every RMID on the next read to any event for every RMID. + * Subsequent reads will have MSR_IA32_QM_CTR.Unavailable (bit 62) + * cleared while it is tracked by the hardware. Clear the + * mbm_local and mbm_total counts for all the RMIDs. + */ + resctrl_arch_reset_rmid_all(r, d); +} + +static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid) +{ + char *dom_str = NULL, *id_str; + unsigned long dom_id, val; + struct rdt_domain *d; + + /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + +next: + if (!tok || tok[0] == '\0') + return 0; + + /* Start processing the strings for each domain */ + dom_str = strim(strsep(&tok, ";")); + id_str = strsep(&dom_str, "="); + + if (!id_str || kstrtoul(id_str, 10, &dom_id)) { + rdt_last_cmd_puts("Missing '=' or non-numeric domain id\n"); + return -EINVAL; + } + + if (!dom_str || kstrtoul(dom_str, 16, &val)) { + rdt_last_cmd_puts("Non-numeric event configuration value\n"); + return -EINVAL; + } + + list_for_each_entry(d, &r->domains, list) { + if (d->id == dom_id) { + mbm_config_write_domain(r, d, evtid, val); + goto next; + } + } + + return -EINVAL; +} + +static ssize_t mbm_total_bytes_config_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, + loff_t off) +{ + struct rdt_resource *r = of->kn->parent->priv; + int ret; + + /* Valid input requires a trailing newline */ + if (nbytes == 0 || buf[nbytes - 1] != '\n') + return -EINVAL; + + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + + rdt_last_cmd_clear(); + + buf[nbytes - 1] = '\0'; + + ret = mon_config_write(r, buf, QOS_L3_MBM_TOTAL_EVENT_ID); + + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); + + return ret ?: nbytes; +} + +static ssize_t mbm_local_bytes_config_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, + loff_t off) +{ + struct rdt_resource *r = of->kn->parent->priv; + int ret; + + /* Valid input requires a trailing newline */ + if (nbytes == 0 || buf[nbytes - 1] != '\n') + return -EINVAL; + + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + + rdt_last_cmd_clear(); + + buf[nbytes - 1] = '\0'; + + ret = mon_config_write(r, buf, QOS_L3_MBM_LOCAL_EVENT_ID); + + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); + + return ret ?: nbytes; +} + +/* rdtgroup information files for one cache resource. */ +static struct rftype res_common_files[] = { + { + .name = "last_cmd_status", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_last_cmd_status_show, + .fflags = RFTYPE_TOP_INFO, + }, + { + .name = "num_closids", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_num_closids_show, + .fflags = RFTYPE_CTRL_INFO, + }, + { + .name = "mon_features", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_mon_features_show, + .fflags = RFTYPE_MON_INFO, + }, + { + .name = "num_rmids", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_num_rmids_show, + .fflags = RFTYPE_MON_INFO, + }, + { + .name = "cbm_mask", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_default_ctrl_show, + .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_CACHE, + }, + { + .name = "min_cbm_bits", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_min_cbm_bits_show, + .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_CACHE, + }, + { + .name = "shareable_bits", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_shareable_bits_show, + .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_CACHE, + }, + { + .name = "bit_usage", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_bit_usage_show, + .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_CACHE, + }, + { + .name = "min_bandwidth", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_min_bw_show, + .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_MB, + }, + { + .name = "bandwidth_gran", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_bw_gran_show, + .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_MB, + }, + { + .name = "delay_linear", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_delay_linear_show, + .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_MB, + }, + /* + * Platform specific which (if any) capabilities are provided by + * thread_throttle_mode. Defer "fflags" initialization to platform + * discovery. + */ + { + .name = "thread_throttle_mode", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_thread_throttle_mode_show, + }, + { + .name = "max_threshold_occupancy", + .mode = 0644, + .kf_ops = &rdtgroup_kf_single_ops, + .write = max_threshold_occ_write, + .seq_show = max_threshold_occ_show, + .fflags = RFTYPE_MON_INFO | RFTYPE_RES_CACHE, + }, + { + .name = "mbm_total_bytes_config", + .mode = 0644, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = mbm_total_bytes_config_show, + .write = mbm_total_bytes_config_write, + }, + { + .name = "mbm_local_bytes_config", + .mode = 0644, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = mbm_local_bytes_config_show, + .write = mbm_local_bytes_config_write, + }, + { + .name = "cpus", + .mode = 0644, + .kf_ops = &rdtgroup_kf_single_ops, + .write = rdtgroup_cpus_write, + .seq_show = rdtgroup_cpus_show, + .fflags = RFTYPE_BASE, + }, + { + .name = "cpus_list", + .mode = 0644, + .kf_ops = &rdtgroup_kf_single_ops, + .write = rdtgroup_cpus_write, + .seq_show = rdtgroup_cpus_show, + .flags = RFTYPE_FLAGS_CPUS_LIST, + .fflags = RFTYPE_BASE, + }, + { + .name = "tasks", + .mode = 0644, + .kf_ops = &rdtgroup_kf_single_ops, + .write = rdtgroup_tasks_write, + .seq_show = rdtgroup_tasks_show, + .fflags = RFTYPE_BASE, + }, + { + .name = "mon_hw_id", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdtgroup_rmid_show, + .fflags = RFTYPE_MON_BASE | RFTYPE_DEBUG, + }, + { + .name = "schemata", + .mode = 0644, + .kf_ops = &rdtgroup_kf_single_ops, + .write = rdtgroup_schemata_write, + .seq_show = rdtgroup_schemata_show, + .fflags = RFTYPE_CTRL_BASE, + }, + { + .name = "mode", + .mode = 0644, + .kf_ops = &rdtgroup_kf_single_ops, + .write = rdtgroup_mode_write, + .seq_show = rdtgroup_mode_show, + .fflags = RFTYPE_CTRL_BASE, + }, + { + .name = "size", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdtgroup_size_show, + .fflags = RFTYPE_CTRL_BASE, + }, + { + .name = "sparse_masks", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_has_sparse_bitmasks_show, + .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_CACHE, + }, + { + .name = "ctrl_hw_id", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdtgroup_closid_show, + .fflags = RFTYPE_CTRL_BASE | RFTYPE_DEBUG, + }, + +}; + +static int rdtgroup_add_files(struct kernfs_node *kn, unsigned long fflags) +{ + struct rftype *rfts, *rft; + int ret, len; + + rfts = res_common_files; + len = ARRAY_SIZE(res_common_files); + + lockdep_assert_held(&rdtgroup_mutex); + + if (resctrl_debug) + fflags |= RFTYPE_DEBUG; + + for (rft = rfts; rft < rfts + len; rft++) { + if (rft->fflags && ((fflags & rft->fflags) == rft->fflags)) { + ret = rdtgroup_add_file(kn, rft); + if (ret) + goto error; + } + } + + return 0; +error: + pr_warn("Failed to add %s, err=%d\n", rft->name, ret); + while (--rft >= rfts) { + if ((fflags & rft->fflags) == rft->fflags) + kernfs_remove_by_name(kn, rft->name); + } + return ret; +} + +static struct rftype *rdtgroup_get_rftype_by_name(const char *name) +{ + struct rftype *rfts, *rft; + int len; + + rfts = res_common_files; + len = ARRAY_SIZE(res_common_files); + + for (rft = rfts; rft < rfts + len; rft++) { + if (!strcmp(rft->name, name)) + return rft; + } + + return NULL; +} + +static void thread_throttle_mode_init(void) +{ + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_MBA); + struct rftype *rft; + + if (!r->alloc_capable || + r->membw.throttle_mode == THREAD_THROTTLE_UNDEFINED) + return; + + rft = rdtgroup_get_rftype_by_name("thread_throttle_mode"); + if (!rft) + return; + + rft->fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_MB; +} + +void mbm_config_rftype_init(const char *config) +{ + struct rftype *rft; + + rft = rdtgroup_get_rftype_by_name(config); + if (rft) + rft->fflags = RFTYPE_MON_INFO | RFTYPE_RES_CACHE; +} + +/** + * rdtgroup_kn_mode_restrict - Restrict user access to named resctrl file + * @r: The resource group with which the file is associated. + * @name: Name of the file + * + * The permissions of named resctrl file, directory, or link are modified + * to not allow read, write, or execute by any user. + * + * WARNING: This function is intended to communicate to the user that the + * resctrl file has been locked down - that it is not relevant to the + * particular state the system finds itself in. It should not be relied + * on to protect from user access because after the file's permissions + * are restricted the user can still change the permissions using chmod + * from the command line. + * + * Return: 0 on success, <0 on failure. + */ +int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name) +{ + struct iattr iattr = {.ia_valid = ATTR_MODE,}; + struct kernfs_node *kn; + int ret = 0; + + kn = kernfs_find_and_get_ns(r->kn, name, NULL); + if (!kn) + return -ENOENT; + + switch (kernfs_type(kn)) { + case KERNFS_DIR: + iattr.ia_mode = S_IFDIR; + break; + case KERNFS_FILE: + iattr.ia_mode = S_IFREG; + break; + case KERNFS_LINK: + iattr.ia_mode = S_IFLNK; + break; + } + + ret = kernfs_setattr(kn, &iattr); + kernfs_put(kn); + return ret; +} + +/** + * rdtgroup_kn_mode_restore - Restore user access to named resctrl file + * @r: The resource group with which the file is associated. + * @name: Name of the file + * @mask: Mask of permissions that should be restored + * + * Restore the permissions of the named file. If @name is a directory the + * permissions of its parent will be used. + * + * Return: 0 on success, <0 on failure. + */ +int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name, + umode_t mask) +{ + struct iattr iattr = {.ia_valid = ATTR_MODE,}; + struct kernfs_node *kn, *parent; + struct rftype *rfts, *rft; + int ret, len; + + rfts = res_common_files; + len = ARRAY_SIZE(res_common_files); + + for (rft = rfts; rft < rfts + len; rft++) { + if (!strcmp(rft->name, name)) + iattr.ia_mode = rft->mode & mask; + } + + kn = kernfs_find_and_get_ns(r->kn, name, NULL); + if (!kn) + return -ENOENT; + + switch (kernfs_type(kn)) { + case KERNFS_DIR: + parent = kernfs_get_parent(kn); + if (parent) { + iattr.ia_mode |= parent->mode; + kernfs_put(parent); + } + iattr.ia_mode |= S_IFDIR; + break; + case KERNFS_FILE: + iattr.ia_mode |= S_IFREG; + break; + case KERNFS_LINK: + iattr.ia_mode |= S_IFLNK; + break; + } + + ret = kernfs_setattr(kn, &iattr); + kernfs_put(kn); + return ret; +} + +static int rdtgroup_mkdir_info_resdir(void *priv, char *name, + unsigned long fflags) +{ + struct kernfs_node *kn_subdir; + int ret; + + kn_subdir = kernfs_create_dir(kn_info, name, + kn_info->mode, priv); + if (IS_ERR(kn_subdir)) + return PTR_ERR(kn_subdir); + + ret = rdtgroup_kn_set_ugid(kn_subdir); + if (ret) + return ret; + + ret = rdtgroup_add_files(kn_subdir, fflags); + if (!ret) + kernfs_activate(kn_subdir); + + return ret; +} + +static int rdtgroup_create_info_dir(struct kernfs_node *parent_kn) +{ + enum resctrl_res_level i; + struct resctrl_schema *s; + struct rdt_resource *r; + unsigned long fflags; + char name[32]; + int ret; + + /* create the directory */ + kn_info = kernfs_create_dir(parent_kn, "info", parent_kn->mode, NULL); + if (IS_ERR(kn_info)) + return PTR_ERR(kn_info); + + ret = rdtgroup_add_files(kn_info, RFTYPE_TOP_INFO); + if (ret) + goto out_destroy; + + /* loop over enabled controls, these are all alloc_capable */ + list_for_each_entry(s, &resctrl_schema_all, list) { + r = s->res; + + /* Not supported yet */ + if (r->rid > RDT_RESOURCE_SMBA) + continue; + + fflags = r->fflags | RFTYPE_CTRL_INFO; + ret = rdtgroup_mkdir_info_resdir(s, s->name, fflags); + if (ret) + goto out_destroy; + } + + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + r = resctrl_arch_get_resource(i); + if (!r->mon_capable) + continue; + + fflags = r->fflags | RFTYPE_MON_INFO; + sprintf(name, "%s_MON", r->name); + ret = rdtgroup_mkdir_info_resdir(r, name, fflags); + if (ret) + goto out_destroy; + } + + ret = rdtgroup_kn_set_ugid(kn_info); + if (ret) + goto out_destroy; + + kernfs_activate(kn_info); + + return 0; + +out_destroy: + kernfs_remove(kn_info); + return ret; +} + +static int +mongroup_create_dir(struct kernfs_node *parent_kn, struct rdtgroup *prgrp, + char *name, struct kernfs_node **dest_kn) +{ + struct kernfs_node *kn; + int ret; + + /* create the directory */ + kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp); + if (IS_ERR(kn)) + return PTR_ERR(kn); + + if (dest_kn) + *dest_kn = kn; + + ret = rdtgroup_kn_set_ugid(kn); + if (ret) + goto out_destroy; + + kernfs_activate(kn); + + return 0; + +out_destroy: + kernfs_remove(kn); + return ret; +} + +static inline bool is_mba_linear(void) +{ + return resctrl_arch_get_resource(RDT_RESOURCE_MBA)->membw.delay_linear; +} + +static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d) +{ + u32 num_closid = resctrl_arch_get_num_closid(r); + int cpu = cpumask_any(&d->cpu_mask); + int i; + + d->mbps_val = kcalloc_node(num_closid, sizeof(*d->mbps_val), + GFP_KERNEL, cpu_to_node(cpu)); + if (!d->mbps_val) + return -ENOMEM; + + for (i = 0; i < num_closid; i++) + d->mbps_val[i] = MBA_MAX_MBPS; + + return 0; +} + +static void mba_sc_domain_destroy(struct rdt_resource *r, + struct rdt_domain *d) +{ + kfree(d->mbps_val); + d->mbps_val = NULL; +} + +/* + * MBA software controller is supported only if + * MBM is supported and MBA is in linear scale. + */ +static bool supports_mba_mbps(void) +{ + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_MBA); + + return (resctrl_arch_is_mbm_local_enabled() && + r->alloc_capable && is_mba_linear()); +} + +/* + * Enable or disable the MBA software controller + * which helps user specify bandwidth in MBps. + */ +static int set_mba_sc(bool mba_sc) +{ + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_MBA); + u32 num_closid = resctrl_arch_get_num_closid(r); + struct rdt_domain *d; + int i; + + if (!supports_mba_mbps() || mba_sc == is_mba_sc(r)) + return -EINVAL; + + r->membw.mba_sc = mba_sc; + + list_for_each_entry(d, &r->domains, list) { + for (i = 0; i < num_closid; i++) + d->mbps_val[i] = MBA_MAX_MBPS; + } + + return 0; +} + +/* + * We don't allow rdtgroup directories to be created anywhere + * except the root directory. Thus when looking for the rdtgroup + * structure for a kernfs node we are either looking at a directory, + * in which case the rdtgroup structure is pointed at by the "priv" + * field, otherwise we have a file, and need only look to the parent + * to find the rdtgroup. + */ +static struct rdtgroup *kernfs_to_rdtgroup(struct kernfs_node *kn) +{ + if (kernfs_type(kn) == KERNFS_DIR) { + /* + * All the resource directories use "kn->priv" + * to point to the "struct rdtgroup" for the + * resource. "info" and its subdirectories don't + * have rdtgroup structures, so return NULL here. + */ + if (kn == kn_info || kn->parent == kn_info) + return NULL; + else + return kn->priv; + } else { + return kn->parent->priv; + } +} + +static void rdtgroup_kn_get(struct rdtgroup *rdtgrp, struct kernfs_node *kn) +{ + atomic_inc(&rdtgrp->waitcount); + kernfs_break_active_protection(kn); +} + +static void rdtgroup_kn_put(struct rdtgroup *rdtgrp, struct kernfs_node *kn) +{ + if (atomic_dec_and_test(&rdtgrp->waitcount) && + (rdtgrp->flags & RDT_DELETED)) { + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP || + rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) + rdtgroup_pseudo_lock_remove(rdtgrp); + kernfs_unbreak_active_protection(kn); + rdtgroup_remove(rdtgrp); + } else { + kernfs_unbreak_active_protection(kn); + } +} + +struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn) +{ + struct rdtgroup *rdtgrp = kernfs_to_rdtgroup(kn); + + if (!rdtgrp) + return NULL; + + rdtgroup_kn_get(rdtgrp, kn); + + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + + /* Was this group deleted while we waited? */ + if (rdtgrp->flags & RDT_DELETED) + return NULL; + + return rdtgrp; +} + +void rdtgroup_kn_unlock(struct kernfs_node *kn) +{ + struct rdtgroup *rdtgrp = kernfs_to_rdtgroup(kn); + + if (!rdtgrp) + return; + + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); + + rdtgroup_kn_put(rdtgrp, kn); +} + +static int mkdir_mondata_all(struct kernfs_node *parent_kn, + struct rdtgroup *prgrp, + struct kernfs_node **mon_data_kn); + +static void rdt_disable_ctx(void) +{ + resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L3, false); + resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L2, false); + set_mba_sc(false); + + resctrl_debug = false; +} + +static int rdt_enable_ctx(struct rdt_fs_context *ctx) +{ + struct rdt_resource *r; + int ret = 0; + + if (ctx->enable_cdpl2) { + ret = resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L2, true); + if (ret) + goto out_done; + } + + if (ctx->enable_cdpl3) { + ret = resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L3, true); + if (ret) + goto out_cdpl2; + } + + if (ctx->enable_mba_mbps) { + ret = set_mba_sc(true); + if (ret) + goto out_cdpl3; + } + + if (ctx->enable_debug) + resctrl_debug = true; + + r = resctrl_arch_get_resource(RDT_RESOURCE_L2); + /* Only arm64 arch hides L2 resource by default */ + if (IS_ENABLED(CONFIG_ARM64_MPAM) && !ctx->enable_l2) + r->invisible = true; + else + r->invisible = false; + + return 0; + +out_cdpl3: + resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L3, false); +out_cdpl2: + resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L2, false); +out_done: + return ret; +} + +static int schemata_list_add(struct rdt_resource *r, enum resctrl_conf_type type) +{ + struct resctrl_schema *s; + const char *suffix = ""; + int ret, cl; + + s = kzalloc(sizeof(*s), GFP_KERNEL); + if (!s) + return -ENOMEM; + + s->res = r; + s->num_closid = resctrl_arch_get_num_closid(r); + if (resctrl_arch_get_cdp_enabled(r->rid)) + s->num_closid /= 2; + + s->conf_type = type; + switch (type) { + case CDP_CODE: + suffix = "CODE"; + break; + case CDP_DATA: + suffix = "DATA"; + break; + case CDP_NONE: + suffix = ""; + break; + } + + ret = snprintf(s->name, sizeof(s->name), "%s%s", r->name, suffix); + if (ret >= sizeof(s->name)) { + kfree(s); + return -EINVAL; + } + + cl = strlen(s->name); + + /* + * If CDP is supported by this resource, but not enabled, + * include the suffix. This ensures the tabular format of the + * schemata file does not change between mounts of the filesystem. + */ + if (r->cdp_capable && !resctrl_arch_get_cdp_enabled(r->rid)) + cl += 4; + + if (cl > max_name_width) + max_name_width = cl; + + /* + * Choose a width for the resource data based on the resource that has + * widest name and cbm. + */ + max_data_width = max(max_data_width, r->data_width); + + INIT_LIST_HEAD(&s->list); + list_add(&s->list, &resctrl_schema_all); + + return 0; +} + +static int schemata_list_create(void) +{ + enum resctrl_res_level i; + struct rdt_resource *r; + int ret = 0; + + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + r = resctrl_arch_get_resource(i); + if (!r->alloc_capable) + continue; + + if (resctrl_arch_get_cdp_enabled(r->rid)) { + ret = schemata_list_add(r, CDP_CODE); + if (ret) + break; + + ret = schemata_list_add(r, CDP_DATA); + } else { + ret = schemata_list_add(r, CDP_NONE); + } + + if (ret) + break; + } + + return ret; +} + +static void schemata_list_destroy(void) +{ + struct resctrl_schema *s, *tmp; + + list_for_each_entry_safe(s, tmp, &resctrl_schema_all, list) { + list_del(&s->list); + kfree(s); + } +} + +static int rdt_get_tree(struct fs_context *fc) +{ + struct rdt_resource *l3 = resctrl_arch_get_resource(RDT_RESOURCE_L3); + struct rdt_fs_context *ctx = rdt_fc2context(fc); + unsigned long flags = RFTYPE_CTRL_BASE; + struct rdt_domain *dom; + int ret; + + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + /* + * resctrl file system can only be mounted once. + */ + if (resctrl_mounted) { + ret = -EBUSY; + goto out; + } + + ret = rdtgroup_setup_root(ctx); + if (ret) + goto out; + + ret = rdt_enable_ctx(ctx); + if (ret) + goto out_root; + + ret = schemata_list_create(); + if (ret) + goto out_schemata_free; + + ret = closid_init(); + if (ret) + goto out_schemata_free; + + if (resctrl_arch_mon_capable()) + flags |= RFTYPE_MON; + + ret = rdtgroup_add_files(rdtgroup_default.kn, flags); + if (ret) + goto out_closid; + + kernfs_activate(rdtgroup_default.kn); + + ret = rdtgroup_create_info_dir(rdtgroup_default.kn); + if (ret < 0) + goto out_closid; + + if (resctrl_arch_mon_capable()) { + ret = mongroup_create_dir(rdtgroup_default.kn, + &rdtgroup_default, "mon_groups", + &kn_mongrp); + if (ret < 0) + goto out_info; + + ret = mkdir_mondata_all(rdtgroup_default.kn, + &rdtgroup_default, &kn_mondata); + if (ret < 0) + goto out_mongrp; + rdtgroup_default.mon.mon_data_kn = kn_mondata; + } + + if (IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) { + ret = rdt_pseudo_lock_init(); + if (ret) + goto out_mondata; + } + + ret = kernfs_get_tree(fc); + if (ret < 0) + goto out_psl; + + if (resctrl_arch_alloc_capable()) + resctrl_arch_enable_alloc(); + if (resctrl_arch_mon_capable()) + resctrl_arch_enable_mon(); + + if (resctrl_arch_alloc_capable() || resctrl_arch_mon_capable()) + resctrl_mounted = true; + + if (resctrl_is_mbm_enabled() && resctrl_arch_would_mbm_overflow()) { + list_for_each_entry(dom, &l3->domains, list) + mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL, + RESCTRL_PICK_ANY_CPU); + } + + goto out; + +out_psl: + if (IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) + rdt_pseudo_lock_release(); +out_mondata: + if (resctrl_arch_mon_capable()) + kernfs_remove(kn_mondata); +out_mongrp: + if (resctrl_arch_mon_capable()) + kernfs_remove(kn_mongrp); +out_info: + kernfs_remove(kn_info); +out_closid: + closid_exit(); +out_schemata_free: + schemata_list_destroy(); + rdt_disable_ctx(); +out_root: + rdtgroup_destroy_root(); +out: + rdt_last_cmd_clear(); + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); + return ret; +} + +enum rdt_param { + Opt_cdp, + Opt_cdpl2, + Opt_mba_mbps, + Opt_debug, + Opt_l2, + nr__rdt_params +}; + +static const struct fs_parameter_spec rdt_fs_parameters[] = { + fsparam_flag("cdp", Opt_cdp), + fsparam_flag("cdpl2", Opt_cdpl2), + fsparam_flag("mba_MBps", Opt_mba_mbps), + fsparam_flag("debug", Opt_debug), + fsparam_flag("l2", Opt_l2), + {} +}; + +static int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param) +{ + struct rdt_fs_context *ctx = rdt_fc2context(fc); + struct fs_parse_result result; + int opt; + + opt = fs_parse(fc, rdt_fs_parameters, param, &result); + if (opt < 0) + return opt; + + switch (opt) { + case Opt_cdp: + ctx->enable_cdpl3 = true; + return 0; + case Opt_cdpl2: + ctx->enable_cdpl2 = true; + return 0; + case Opt_mba_mbps: + if (!supports_mba_mbps()) + return -EINVAL; + ctx->enable_mba_mbps = true; + return 0; + case Opt_debug: + ctx->enable_debug = true; + return 0; + case Opt_l2: + ctx->enable_l2 = true; + return 0; + } + + return -EINVAL; +} + +static void rdt_fs_context_free(struct fs_context *fc) +{ + struct rdt_fs_context *ctx = rdt_fc2context(fc); + + kernfs_free_fs_context(fc); + kfree(ctx); +} + +static const struct fs_context_operations rdt_fs_context_ops = { + .free = rdt_fs_context_free, + .parse_param = rdt_parse_param, + .get_tree = rdt_get_tree, +}; + +static int rdt_init_fs_context(struct fs_context *fc) +{ + struct rdt_fs_context *ctx; + + ctx = kzalloc(sizeof(struct rdt_fs_context), GFP_KERNEL); + if (!ctx) + return -ENOMEM; + + ctx->kfc.magic = RDTGROUP_SUPER_MAGIC; + fc->fs_private = &ctx->kfc; + fc->ops = &rdt_fs_context_ops; + put_user_ns(fc->user_ns); + fc->user_ns = get_user_ns(&init_user_ns); + fc->global = true; + return 0; +} + +/* + * Move tasks from one to the other group. If @from is NULL, then all tasks + * in the systems are moved unconditionally (used for teardown). + * + * If @mask is not NULL the cpus on which moved tasks are running are set + * in that mask so the update smp function call is restricted to affected + * cpus. + */ +static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to, + struct cpumask *mask) +{ + struct task_struct *p, *t; + + read_lock(&tasklist_lock); + for_each_process_thread(p, t) { + if (!from || is_closid_match(t, from) || + is_rmid_match(t, from)) { + resctrl_arch_set_closid_rmid(t, to->closid, + to->mon.rmid); + + /* + * Order the closid/rmid stores above before the loads + * in task_curr(). This pairs with the full barrier + * between the rq->curr update and resctrl_sched_in() + * during context switch. + */ + smp_mb(); + + /* + * If the task is on a CPU, set the CPU in the mask. + * The detection is inaccurate as tasks might move or + * schedule before the smp function call takes place. + * In such a case the function call is pointless, but + * there is no other side effect. + */ + if (IS_ENABLED(CONFIG_SMP) && mask && task_curr(t)) + cpumask_set_cpu(task_cpu(t), mask); + } + } + read_unlock(&tasklist_lock); +} + +static void free_all_child_rdtgrp(struct rdtgroup *rdtgrp) +{ + struct rdtgroup *sentry, *stmp; + struct list_head *head; + + head = &rdtgrp->mon.crdtgrp_list; + list_for_each_entry_safe(sentry, stmp, head, mon.crdtgrp_list) { + free_rmid(sentry->closid, sentry->mon.rmid); + list_del(&sentry->mon.crdtgrp_list); + + if (atomic_read(&sentry->waitcount) != 0) + sentry->flags = RDT_DELETED; + else + rdtgroup_remove(sentry); + } +} + +/* + * Forcibly remove all of subdirectories under root. + */ +static void rmdir_all_sub(void) +{ + struct rdtgroup *rdtgrp, *tmp; + + /* Move all tasks to the default resource group */ + rdt_move_group_tasks(NULL, &rdtgroup_default, NULL); + + list_for_each_entry_safe(rdtgrp, tmp, &rdt_all_groups, rdtgroup_list) { + /* Free any child rmids */ + free_all_child_rdtgrp(rdtgrp); + + /* Remove each rdtgroup other than root */ + if (rdtgrp == &rdtgroup_default) + continue; + + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP || + rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) + rdtgroup_pseudo_lock_remove(rdtgrp); + + /* + * Give any CPUs back to the default group. We cannot copy + * cpu_online_mask because a CPU might have executed the + * offline callback already, but is still marked online. + */ + cpumask_or(&rdtgroup_default.cpu_mask, + &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask); + + free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); + + kernfs_remove(rdtgrp->kn); + list_del(&rdtgrp->rdtgroup_list); + + if (atomic_read(&rdtgrp->waitcount) != 0) + rdtgrp->flags = RDT_DELETED; + else + rdtgroup_remove(rdtgrp); + } + /* Notify online CPUs to update per cpu storage and PQR_ASSOC MSR */ + update_closid_rmid(cpu_online_mask, &rdtgroup_default); + + kernfs_remove(kn_info); + kernfs_remove(kn_mongrp); + kernfs_remove(kn_mondata); +} + +static void rdt_kill_sb(struct super_block *sb) +{ + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + + rdt_disable_ctx(); + + /* Put everything back to default values. */ + resctrl_arch_reset_resources(); + + rmdir_all_sub(); + if (IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) + rdt_pseudo_lock_release(); + rdtgroup_default.mode = RDT_MODE_SHAREABLE; + closid_exit(); + schemata_list_destroy(); + rdtgroup_destroy_root(); + if (resctrl_arch_alloc_capable()) + resctrl_arch_disable_alloc(); + if (resctrl_arch_mon_capable()) + resctrl_arch_disable_mon(); + resctrl_mounted = false; + kernfs_kill_sb(sb); + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); +} + +static struct file_system_type rdt_fs_type = { + .name = "resctrl", + .init_fs_context = rdt_init_fs_context, + .parameters = rdt_fs_parameters, + .kill_sb = rdt_kill_sb, +}; + +static int mon_addfile(struct kernfs_node *parent_kn, const char *name, + void *priv) +{ + struct kernfs_node *kn; + int ret = 0; + + kn = __kernfs_create_file(parent_kn, name, 0444, + GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0, + &kf_mondata_ops, priv, NULL, NULL); + if (IS_ERR(kn)) + return PTR_ERR(kn); + + ret = rdtgroup_kn_set_ugid(kn); + if (ret) { + kernfs_remove(kn); + return ret; + } + + return ret; +} + +/* + * Remove all subdirectories of mon_data of ctrl_mon groups + * and monitor groups with given domain id. + */ +static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r, + unsigned int dom_id) +{ + struct rdtgroup *prgrp, *crgrp; + char name[32]; + + list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) { + sprintf(name, "mon_%s_%02d", r->name, dom_id); + kernfs_remove_by_name(prgrp->mon.mon_data_kn, name); + + list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list) + kernfs_remove_by_name(crgrp->mon.mon_data_kn, name); + } +} + +static int mkdir_mondata_subdir(struct kernfs_node *parent_kn, + struct rdt_domain *d, + struct rdt_resource *r, struct rdtgroup *prgrp) +{ + union mon_data_bits priv; + struct kernfs_node *kn; + struct mon_evt *mevt; + struct rmid_read rr; + char name[32]; + int ret; + + sprintf(name, "mon_%s_%02d", r->name, d->id); + /* create the directory */ + kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp); + if (IS_ERR(kn)) + return PTR_ERR(kn); + + ret = rdtgroup_kn_set_ugid(kn); + if (ret) + goto out_destroy; + + if (WARN_ON(list_empty(&r->evt_list))) { + ret = -EPERM; + goto out_destroy; + } + + priv.u.rid = r->rid; + priv.u.domid = d->id; + list_for_each_entry(mevt, &r->evt_list, list) { + priv.u.evtid = mevt->evtid; + ret = mon_addfile(kn, mevt->name, priv.priv); + if (ret) + goto out_destroy; + + if (resctrl_is_mbm_event(mevt->evtid)) + mon_event_read(&rr, r, d, prgrp, mevt->evtid, true); + } + kernfs_activate(kn); + return 0; + +out_destroy: + kernfs_remove(kn); + return ret; +} + +/* + * Add all subdirectories of mon_data for "ctrl_mon" groups + * and "monitor" groups with given domain id. + */ +static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r, + struct rdt_domain *d) +{ + struct kernfs_node *parent_kn; + struct rdtgroup *prgrp, *crgrp; + struct list_head *head; + + list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) { + parent_kn = prgrp->mon.mon_data_kn; + mkdir_mondata_subdir(parent_kn, d, r, prgrp); + + head = &prgrp->mon.crdtgrp_list; + list_for_each_entry(crgrp, head, mon.crdtgrp_list) { + parent_kn = crgrp->mon.mon_data_kn; + mkdir_mondata_subdir(parent_kn, d, r, crgrp); + } + } +} + +static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn, + struct rdt_resource *r, + struct rdtgroup *prgrp) +{ + struct rdt_domain *dom; + int ret; + + /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + + list_for_each_entry(dom, &r->domains, list) { + ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp); + if (ret) + return ret; + } + + return 0; +} + +/* + * This creates a directory mon_data which contains the monitored data. + * + * mon_data has one directory for each domain which are named + * in the format mon__. For ex: A mon_data + * with L3 domain looks as below: + * ./mon_data: + * mon_L3_00 + * mon_L3_01 + * mon_L3_02 + * ... + * + * Each domain directory has one file per event: + * ./mon_L3_00/: + * llc_occupancy + * + */ +static int mkdir_mondata_all(struct kernfs_node *parent_kn, + struct rdtgroup *prgrp, + struct kernfs_node **dest_kn) +{ + enum resctrl_res_level i; + struct rdt_resource *r; + struct kernfs_node *kn; + int ret; + + /* + * Create the mon_data directory first. + */ + ret = mongroup_create_dir(parent_kn, prgrp, "mon_data", &kn); + if (ret) + return ret; + + if (dest_kn) + *dest_kn = kn; + + /* + * Create the subdirectories for each domain. Note that all events + * in a domain like L3 are grouped into a resource whose domain is L3 + */ + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + r = resctrl_arch_get_resource(i); + if (!r->mon_capable) + continue; + + if (r->invisible) + continue; + + ret = mkdir_mondata_subdir_alldom(kn, r, prgrp); + if (ret) + goto out_destroy; + } + + return 0; + +out_destroy: + kernfs_remove(kn); + return ret; +} + +/** + * cbm_ensure_valid - Enforce validity on provided CBM + * @_val: Candidate CBM + * @r: RDT resource to which the CBM belongs + * + * The provided CBM represents all cache portions available for use. This + * may be represented by a bitmap that does not consist of contiguous ones + * and thus be an invalid CBM. + * Here the provided CBM is forced to be a valid CBM by only considering + * the first set of contiguous bits as valid and clearing all bits. + * The intention here is to provide a valid default CBM with which a new + * resource group is initialized. The user can follow this with a + * modification to the CBM if the default does not satisfy the + * requirements. + */ +static u32 cbm_ensure_valid(u32 _val, struct rdt_resource *r) +{ + unsigned int cbm_len = r->cache.cbm_len; + unsigned long first_bit, zero_bit; + unsigned long val = _val; + + if (!val) + return 0; + + first_bit = find_first_bit(&val, cbm_len); + zero_bit = find_next_zero_bit(&val, cbm_len, first_bit); + + /* Clear any remaining bits to ensure contiguous region */ + bitmap_clear(&val, zero_bit, cbm_len - zero_bit); + return (u32)val; +} + +/* + * Initialize cache resources per RDT domain + * + * Set the RDT domain up to start off with all usable allocations. That is, + * all shareable and unused bits. All-zero CBM is invalid. + */ +static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s, + u32 closid) +{ + enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type); + enum resctrl_conf_type t = s->conf_type; + struct resctrl_staged_config *cfg; + struct rdt_resource *r = s->res; + u32 used_b = 0, unused_b = 0; + unsigned long tmp_cbm; + enum rdtgrp_mode mode; + u32 peer_ctl, ctrl_val; + int i; + + cfg = &d->staged_config[t]; + cfg->have_new_ctrl = false; + cfg->new_ctrl = r->cache.shareable_bits; + used_b = r->cache.shareable_bits; + for (i = 0; i < closids_supported(); i++) { + if (closid_allocated(i) && i != closid) { + mode = rdtgroup_mode_by_closid(i); + if (mode == RDT_MODE_PSEUDO_LOCKSETUP) + /* + * ctrl values for locksetup aren't relevant + * until the schemata is written, and the mode + * becomes RDT_MODE_PSEUDO_LOCKED. + */ + continue; + /* + * If CDP is active include peer domain's + * usage to ensure there is no overlap + * with an exclusive group. + */ + if (resctrl_arch_get_cdp_enabled(r->rid)) + peer_ctl = resctrl_arch_get_config(r, d, i, + peer_type); + else + peer_ctl = 0; + ctrl_val = resctrl_arch_get_config(r, d, i, + s->conf_type); + used_b |= ctrl_val | peer_ctl; + if (mode == RDT_MODE_SHAREABLE) + cfg->new_ctrl |= ctrl_val | peer_ctl; + } + } + if (d->plr && d->plr->cbm > 0) + used_b |= d->plr->cbm; + unused_b = used_b ^ (BIT_MASK(r->cache.cbm_len) - 1); + unused_b &= BIT_MASK(r->cache.cbm_len) - 1; + cfg->new_ctrl |= unused_b; + /* + * Force the initial CBM to be valid, user can + * modify the CBM based on system availability. + */ + cfg->new_ctrl = cbm_ensure_valid(cfg->new_ctrl, r); + /* + * Assign the u32 CBM to an unsigned long to ensure that + * bitmap_weight() does not access out-of-bound memory. + */ + tmp_cbm = cfg->new_ctrl; + if (bitmap_weight(&tmp_cbm, r->cache.cbm_len) < r->cache.min_cbm_bits) { + rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->id); + return -ENOSPC; + } + cfg->have_new_ctrl = true; + + return 0; +} + +/* + * Initialize cache resources with default values. + * + * A new RDT group is being created on an allocation capable (CAT) + * supporting system. Set this group up to start off with all usable + * allocations. + * + * If there are no more shareable bits available on any domain then + * the entire allocation will fail. + */ +static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid) +{ + struct rdt_domain *d; + int ret; + + list_for_each_entry(d, &s->res->domains, list) { + ret = __init_one_rdt_domain(d, s, closid); + if (ret < 0) + return ret; + } + + return 0; +} + +/* Initialize MBA resource with default values. */ +static void rdtgroup_init_res(struct rdt_resource *r, u32 closid) +{ + struct resctrl_staged_config *cfg; + struct rdt_domain *d; + + list_for_each_entry(d, &r->domains, list) { + if (is_mba_sc(r)) { + d->mbps_val[closid] = MBA_MAX_MBPS; + continue; + } + + cfg = &d->staged_config[CDP_NONE]; + cfg->new_ctrl = r->default_ctrl; + cfg->have_new_ctrl = true; + } +} + +/* Initialize the RDT group's allocations. */ +static int rdtgroup_init_alloc(struct rdtgroup *rdtgrp) +{ + struct resctrl_schema *s; + struct rdt_resource *r; + int ret = 0; + + rdt_staged_configs_clear(); + + list_for_each_entry(s, &resctrl_schema_all, list) { + r = s->res; + if (r->rid == RDT_RESOURCE_L2 || + r->rid == RDT_RESOURCE_L3) { + ret = rdtgroup_init_cat(s, rdtgrp->closid); + if (ret < 0) + goto out; + + } else { + rdtgroup_init_res(r, rdtgrp->closid); + if (is_mba_sc(r)) + continue; + } + + ret = resctrl_arch_update_domains(r, rdtgrp->closid); + if (ret < 0) { + rdt_last_cmd_puts("Failed to initialize allocations\n"); + goto out; + } + + } + + rdtgrp->mode = RDT_MODE_SHAREABLE; + +out: + rdt_staged_configs_clear(); + return ret; +} + +static int mkdir_rdt_prepare_rmid_alloc(struct rdtgroup *rdtgrp) +{ + int ret; + + if (!resctrl_arch_mon_capable()) + return 0; + + ret = alloc_rmid(rdtgrp->closid); + if (ret < 0) { + rdt_last_cmd_puts("Out of RMIDs\n"); + return ret; + } + rdtgrp->mon.rmid = ret; + + ret = mkdir_mondata_all(rdtgrp->kn, rdtgrp, &rdtgrp->mon.mon_data_kn); + if (ret) { + rdt_last_cmd_puts("kernfs subdir error\n"); + free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); + return ret; + } + + return 0; +} + +static void mkdir_rdt_prepare_rmid_free(struct rdtgroup *rgrp) +{ + if (resctrl_arch_mon_capable()) + free_rmid(rgrp->closid, rgrp->mon.rmid); +} + +static int mkdir_rdt_prepare(struct kernfs_node *parent_kn, + const char *name, umode_t mode, + enum rdt_group_type rtype, struct rdtgroup **r) +{ + struct rdtgroup *prdtgrp, *rdtgrp; + unsigned long files = 0; + struct kernfs_node *kn; + int ret; + + prdtgrp = rdtgroup_kn_lock_live(parent_kn); + if (!prdtgrp) { + ret = -ENODEV; + goto out_unlock; + } + + rdt_last_cmd_clear(); + + if (rtype == RDTMON_GROUP && + (prdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP || + prdtgrp->mode == RDT_MODE_PSEUDO_LOCKED)) { + ret = -EINVAL; + rdt_last_cmd_puts("Pseudo-locking in progress\n"); + goto out_unlock; + } + + /* allocate the rdtgroup. */ + rdtgrp = kzalloc(sizeof(*rdtgrp), GFP_KERNEL); + if (!rdtgrp) { + ret = -ENOSPC; + rdt_last_cmd_puts("Kernel out of memory\n"); + goto out_unlock; + } + *r = rdtgrp; + rdtgrp->mon.parent = prdtgrp; + rdtgrp->type = rtype; + INIT_LIST_HEAD(&rdtgrp->mon.crdtgrp_list); + + /* kernfs creates the directory for rdtgrp */ + kn = kernfs_create_dir(parent_kn, name, mode, rdtgrp); + if (IS_ERR(kn)) { + ret = PTR_ERR(kn); + rdt_last_cmd_puts("kernfs create error\n"); + goto out_free_rgrp; + } + rdtgrp->kn = kn; + + /* + * kernfs_remove() will drop the reference count on "kn" which + * will free it. But we still need it to stick around for the + * rdtgroup_kn_unlock(kn) call. Take one extra reference here, + * which will be dropped by kernfs_put() in rdtgroup_remove(). + */ + kernfs_get(kn); + + ret = rdtgroup_kn_set_ugid(kn); + if (ret) { + rdt_last_cmd_puts("kernfs perm error\n"); + goto out_destroy; + } + + if (rtype == RDTCTRL_GROUP) { + files = RFTYPE_BASE | RFTYPE_CTRL; + if (resctrl_arch_mon_capable()) + files |= RFTYPE_MON; + } else { + files = RFTYPE_BASE | RFTYPE_MON; + } + + ret = rdtgroup_add_files(kn, files); + if (ret) { + rdt_last_cmd_puts("kernfs fill error\n"); + goto out_destroy; + } + + /* + * The caller unlocks the parent_kn upon success. + */ + return 0; + +out_destroy: + kernfs_put(rdtgrp->kn); + kernfs_remove(rdtgrp->kn); +out_free_rgrp: + kfree(rdtgrp); +out_unlock: + rdtgroup_kn_unlock(parent_kn); + return ret; +} + +static void mkdir_rdt_prepare_clean(struct rdtgroup *rgrp) +{ + kernfs_remove(rgrp->kn); + rdtgroup_remove(rgrp); +} + +/* + * Create a monitor group under "mon_groups" directory of a control + * and monitor group(ctrl_mon). This is a resource group + * to monitor a subset of tasks and cpus in its parent ctrl_mon group. + */ +static int rdtgroup_mkdir_mon(struct kernfs_node *parent_kn, + const char *name, umode_t mode) +{ + struct rdtgroup *rdtgrp, *prgrp; + int ret; + + ret = mkdir_rdt_prepare(parent_kn, name, mode, RDTMON_GROUP, &rdtgrp); + if (ret) + return ret; + + prgrp = rdtgrp->mon.parent; + rdtgrp->closid = prgrp->closid; + + ret = mkdir_rdt_prepare_rmid_alloc(rdtgrp); + if (ret) { + mkdir_rdt_prepare_clean(rdtgrp); + goto out_unlock; + } + + kernfs_activate(rdtgrp->kn); + + /* + * Add the rdtgrp to the list of rdtgrps the parent + * ctrl_mon group has to track. + */ + list_add_tail(&rdtgrp->mon.crdtgrp_list, &prgrp->mon.crdtgrp_list); + +out_unlock: + rdtgroup_kn_unlock(parent_kn); + return ret; +} + +/* + * These are rdtgroups created under the root directory. Can be used + * to allocate and monitor resources. + */ +static int rdtgroup_mkdir_ctrl_mon(struct kernfs_node *parent_kn, + const char *name, umode_t mode) +{ + struct rdtgroup *rdtgrp; + struct kernfs_node *kn; + u32 closid; + int ret; + + ret = mkdir_rdt_prepare(parent_kn, name, mode, RDTCTRL_GROUP, &rdtgrp); + if (ret) + return ret; + + kn = rdtgrp->kn; + ret = closid_alloc(); + if (ret < 0) { + rdt_last_cmd_puts("Out of CLOSIDs\n"); + goto out_common_fail; + } + closid = ret; + ret = 0; + + rdtgrp->closid = closid; + + ret = mkdir_rdt_prepare_rmid_alloc(rdtgrp); + if (ret) + goto out_closid_free; + + kernfs_activate(rdtgrp->kn); + + ret = rdtgroup_init_alloc(rdtgrp); + if (ret < 0) + goto out_rmid_free; + + list_add(&rdtgrp->rdtgroup_list, &rdt_all_groups); + + if (resctrl_arch_mon_capable()) { + /* + * Create an empty mon_groups directory to hold the subset + * of tasks and cpus to monitor. + */ + ret = mongroup_create_dir(kn, rdtgrp, "mon_groups", NULL); + if (ret) { + rdt_last_cmd_puts("kernfs subdir error\n"); + goto out_del_list; + } + } + + goto out_unlock; + +out_del_list: + list_del(&rdtgrp->rdtgroup_list); +out_rmid_free: + mkdir_rdt_prepare_rmid_free(rdtgrp); +out_closid_free: + closid_free(closid); +out_common_fail: + mkdir_rdt_prepare_clean(rdtgrp); +out_unlock: + rdtgroup_kn_unlock(parent_kn); + return ret; +} + +/* + * We allow creating mon groups only with in a directory called "mon_groups" + * which is present in every ctrl_mon group. Check if this is a valid + * "mon_groups" directory. + * + * 1. The directory should be named "mon_groups". + * 2. The mon group itself should "not" be named "mon_groups". + * This makes sure "mon_groups" directory always has a ctrl_mon group + * as parent. + */ +static bool is_mon_groups(struct kernfs_node *kn, const char *name) +{ + return (!strcmp(kn->name, "mon_groups") && + strcmp(name, "mon_groups")); +} + +static int rdtgroup_mkdir(struct kernfs_node *parent_kn, const char *name, + umode_t mode) +{ + /* Do not accept '\n' to avoid unparsable situation. */ + if (strchr(name, '\n')) + return -EINVAL; + + /* + * If the parent directory is the root directory and RDT + * allocation is supported, add a control and monitoring + * subdirectory + */ + if (resctrl_arch_alloc_capable() && parent_kn == rdtgroup_default.kn) + return rdtgroup_mkdir_ctrl_mon(parent_kn, name, mode); + + /* + * If RDT monitoring is supported and the parent directory is a valid + * "mon_groups" directory, add a monitoring subdirectory. + */ + if (resctrl_arch_mon_capable() && is_mon_groups(parent_kn, name)) + return rdtgroup_mkdir_mon(parent_kn, name, mode); + + return -EPERM; +} + +static int rdtgroup_rmdir_mon(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask) +{ + struct rdtgroup *prdtgrp = rdtgrp->mon.parent; + int cpu; + + /* Give any tasks back to the parent group */ + rdt_move_group_tasks(rdtgrp, prdtgrp, tmpmask); + + /* Update per cpu rmid of the moved CPUs first */ + for_each_cpu(cpu, &rdtgrp->cpu_mask) + resctrl_arch_set_cpu_default_closid_rmid(cpu, rdtgrp->closid, + prdtgrp->mon.rmid); + + /* + * Update the MSR on moved CPUs and CPUs which have moved + * task running on them. + */ + cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask); + update_closid_rmid(tmpmask, NULL); + + rdtgrp->flags = RDT_DELETED; + free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); + + /* + * Remove the rdtgrp from the parent ctrl_mon group's list + */ + WARN_ON(list_empty(&prdtgrp->mon.crdtgrp_list)); + list_del(&rdtgrp->mon.crdtgrp_list); + + kernfs_remove(rdtgrp->kn); + + return 0; +} + +static int rdtgroup_ctrl_remove(struct rdtgroup *rdtgrp) +{ + rdtgrp->flags = RDT_DELETED; + list_del(&rdtgrp->rdtgroup_list); + + kernfs_remove(rdtgrp->kn); + return 0; +} + +static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask) +{ + u32 closid, rmid; + int cpu; + + /* Give any tasks back to the default group */ + rdt_move_group_tasks(rdtgrp, &rdtgroup_default, tmpmask); + + /* Give any CPUs back to the default group */ + cpumask_or(&rdtgroup_default.cpu_mask, + &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask); + + /* Update per cpu closid and rmid of the moved CPUs first */ + closid = rdtgroup_default.closid; + rmid = rdtgroup_default.mon.rmid; + for_each_cpu(cpu, &rdtgrp->cpu_mask) + resctrl_arch_set_cpu_default_closid_rmid(cpu, closid, rmid); + + /* + * Update the MSR on moved CPUs and CPUs which have moved + * task running on them. + */ + cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask); + update_closid_rmid(tmpmask, NULL); + + free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); + closid_free(rdtgrp->closid); + + rdtgroup_ctrl_remove(rdtgrp); + + /* + * Free all the child monitor group rmids. + */ + free_all_child_rdtgrp(rdtgrp); + + return 0; +} + +static int rdtgroup_rmdir(struct kernfs_node *kn) +{ + struct kernfs_node *parent_kn = kn->parent; + struct rdtgroup *rdtgrp; + cpumask_var_t tmpmask; + int ret = 0; + + if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL)) + return -ENOMEM; + + rdtgrp = rdtgroup_kn_lock_live(kn); + if (!rdtgrp) { + ret = -EPERM; + goto out; + } + + /* + * If the rdtgroup is a ctrl_mon group and parent directory + * is the root directory, remove the ctrl_mon group. + * + * If the rdtgroup is a mon group and parent directory + * is a valid "mon_groups" directory, remove the mon group. + */ + if (rdtgrp->type == RDTCTRL_GROUP && parent_kn == rdtgroup_default.kn && + rdtgrp != &rdtgroup_default) { + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP || + rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) { + ret = rdtgroup_ctrl_remove(rdtgrp); + } else { + ret = rdtgroup_rmdir_ctrl(rdtgrp, tmpmask); + } + } else if (rdtgrp->type == RDTMON_GROUP && + is_mon_groups(parent_kn, kn->name)) { + ret = rdtgroup_rmdir_mon(rdtgrp, tmpmask); + } else { + ret = -EPERM; + } + +out: + rdtgroup_kn_unlock(kn); + free_cpumask_var(tmpmask); + return ret; +} + +/** + * mongrp_reparent() - replace parent CTRL_MON group of a MON group + * @rdtgrp: the MON group whose parent should be replaced + * @new_prdtgrp: replacement parent CTRL_MON group for @rdtgrp + * @cpus: cpumask provided by the caller for use during this call + * + * Replaces the parent CTRL_MON group for a MON group, resulting in all member + * tasks' CLOSID immediately changing to that of the new parent group. + * Monitoring data for the group is unaffected by this operation. + */ +static void mongrp_reparent(struct rdtgroup *rdtgrp, + struct rdtgroup *new_prdtgrp, + cpumask_var_t cpus) +{ + struct rdtgroup *prdtgrp = rdtgrp->mon.parent; + + WARN_ON(rdtgrp->type != RDTMON_GROUP); + WARN_ON(new_prdtgrp->type != RDTCTRL_GROUP); + + /* Nothing to do when simply renaming a MON group. */ + if (prdtgrp == new_prdtgrp) + return; + + WARN_ON(list_empty(&prdtgrp->mon.crdtgrp_list)); + list_move_tail(&rdtgrp->mon.crdtgrp_list, + &new_prdtgrp->mon.crdtgrp_list); + + free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); + rdtgrp->mon.rmid = alloc_rmid(new_prdtgrp->closid); + rdtgrp->mon.parent = new_prdtgrp; + rdtgrp->closid = new_prdtgrp->closid; + + /* Propagate updated closid to all tasks in this group. */ + rdt_move_group_tasks(rdtgrp, rdtgrp, cpus); + + update_closid_rmid(cpus, NULL); +} + +static int rdtgroup_rename(struct kernfs_node *kn, + struct kernfs_node *new_parent, const char *new_name) +{ + struct rdtgroup *new_prdtgrp; + struct rmid_entry *entry; + struct rdtgroup *rdtgrp; + cpumask_var_t tmpmask; + int ret; + + rdtgrp = kernfs_to_rdtgroup(kn); + new_prdtgrp = kernfs_to_rdtgroup(new_parent); + if (!rdtgrp || !new_prdtgrp) + return -ENOENT; + + /* Release both kernfs active_refs before obtaining rdtgroup mutex. */ + rdtgroup_kn_get(rdtgrp, kn); + rdtgroup_kn_get(new_prdtgrp, new_parent); + + mutex_lock(&rdtgroup_mutex); + + rdt_last_cmd_clear(); + + /* + * Don't allow kernfs_to_rdtgroup() to return a parent rdtgroup if + * either kernfs_node is a file. + */ + if (kernfs_type(kn) != KERNFS_DIR || + kernfs_type(new_parent) != KERNFS_DIR) { + rdt_last_cmd_puts("Source and destination must be directories"); + ret = -EPERM; + goto out; + } + + if ((rdtgrp->flags & RDT_DELETED) || (new_prdtgrp->flags & RDT_DELETED)) { + ret = -ENOENT; + goto out; + } + + if (rdtgrp->type != RDTMON_GROUP || !kn->parent || + !is_mon_groups(kn->parent, kn->name)) { + rdt_last_cmd_puts("Source must be a MON group\n"); + ret = -EPERM; + goto out; + } + + if (!is_mon_groups(new_parent, new_name)) { + rdt_last_cmd_puts("Destination must be a mon_groups subdirectory\n"); + ret = -EPERM; + goto out; + } + + /* + * If the MON group is monitoring CPUs, the CPUs must be assigned to the + * current parent CTRL_MON group and therefore cannot be assigned to + * the new parent, making the move illegal. + */ + if (!cpumask_empty(&rdtgrp->cpu_mask) && + rdtgrp->mon.parent != new_prdtgrp) { + rdt_last_cmd_puts("Cannot move a MON group that monitors CPUs\n"); + ret = -EPERM; + goto out; + } + + /* + * Unlike RDT, the rmid and closid in MPAM have a hierarchical + * relationship. Therefore, first check whether there are still + * free rmids available under the target closid. + */ + if (IS_ENABLED(CONFIG_ARM64_MPAM)) { + entry = resctrl_find_free_rmid(new_prdtgrp->closid); + if (IS_ERR(entry)) { + rdt_last_cmd_puts("Destination has been out of RMIDs\n"); + ret = PTR_ERR(entry); + goto out; + } + } + + /* + * Allocate the cpumask for use in mongrp_reparent() to avoid the + * possibility of failing to allocate it after kernfs_rename() has + * succeeded. + */ + if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL)) { + ret = -ENOMEM; + goto out; + } + + /* + * Perform all input validation and allocations needed to ensure + * mongrp_reparent() will succeed before calling kernfs_rename(), + * otherwise it would be necessary to revert this call if + * mongrp_reparent() failed. + */ + ret = kernfs_rename(kn, new_parent, new_name); + if (!ret) + mongrp_reparent(rdtgrp, new_prdtgrp, tmpmask); + + free_cpumask_var(tmpmask); + +out: + mutex_unlock(&rdtgroup_mutex); + rdtgroup_kn_put(rdtgrp, kn); + rdtgroup_kn_put(new_prdtgrp, new_parent); + return ret; +} + +static int rdtgroup_show_options(struct seq_file *seq, struct kernfs_root *kf) +{ + if (resctrl_arch_get_cdp_enabled(RDT_RESOURCE_L3)) + seq_puts(seq, ",cdp"); + + if (resctrl_arch_get_cdp_enabled(RDT_RESOURCE_L2)) + seq_puts(seq, ",cdpl2"); + + if (is_mba_sc(resctrl_arch_get_resource(RDT_RESOURCE_MBA))) + seq_puts(seq, ",mba_MBps"); + + if (resctrl_debug) + seq_puts(seq, ",debug"); + + return 0; +} + +static struct kernfs_syscall_ops rdtgroup_kf_syscall_ops = { + .mkdir = rdtgroup_mkdir, + .rmdir = rdtgroup_rmdir, + .rename = rdtgroup_rename, + .show_options = rdtgroup_show_options, +}; + +static int rdtgroup_setup_root(struct rdt_fs_context *ctx) +{ + rdt_root = kernfs_create_root(&rdtgroup_kf_syscall_ops, + KERNFS_ROOT_CREATE_DEACTIVATED | + KERNFS_ROOT_EXTRA_OPEN_PERM_CHECK, + &rdtgroup_default); + if (IS_ERR(rdt_root)) + return PTR_ERR(rdt_root); + + ctx->kfc.root = rdt_root; + rdtgroup_default.kn = kernfs_root_to_node(rdt_root); + + return 0; +} + +static void rdtgroup_destroy_root(void) +{ + kernfs_destroy_root(rdt_root); + rdtgroup_default.kn = NULL; +} + +static void rdtgroup_setup_default(void) +{ + mutex_lock(&rdtgroup_mutex); + + rdtgroup_default.closid = RESCTRL_RESERVED_CLOSID; + rdtgroup_default.mon.rmid = RESCTRL_RESERVED_RMID; + rdtgroup_default.type = RDTCTRL_GROUP; + INIT_LIST_HEAD(&rdtgroup_default.mon.crdtgrp_list); + + list_add(&rdtgroup_default.rdtgroup_list, &rdt_all_groups); + + mutex_unlock(&rdtgroup_mutex); +} + +static void domain_destroy_mon_state(struct rdt_domain *d) +{ + bitmap_free(d->rmid_busy_llc); + kfree(d->mbm_total); + kfree(d->mbm_local); +} + +void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d) +{ + mutex_lock(&rdtgroup_mutex); + + if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA) + mba_sc_domain_destroy(r, d); + + if (!r->mon_capable) + goto out_unlock; + + /* + * If resctrl is mounted, remove all the + * per domain monitor data directories. + */ + if (resctrl_mounted && resctrl_arch_mon_capable()) + rmdir_mondata_subdir_allrdtgrp(r, d->id); + + if (resctrl_is_mbm_enabled() && resctrl_arch_would_mbm_overflow()) + cancel_delayed_work(&d->mbm_over); + if (resctrl_arch_is_llc_occupancy_enabled() && has_busy_rmid(d)) { + /* + * When a package is going down, forcefully + * decrement rmid->ebusy. There is no way to know + * that the L3 was flushed and hence may lead to + * incorrect counts in rare scenarios, but leaving + * the RMID as busy creates RMID leaks if the + * package never comes back. + */ + __check_limbo(d, true); + cancel_delayed_work(&d->cqm_limbo); + } + + domain_destroy_mon_state(d); + +out_unlock: + mutex_unlock(&rdtgroup_mutex); +} + +static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d) +{ + u32 idx_limit = resctrl_arch_system_num_rmid_idx(); + size_t tsize; + + if (resctrl_arch_is_llc_occupancy_enabled()) { + d->rmid_busy_llc = bitmap_zalloc(idx_limit, GFP_KERNEL); + if (!d->rmid_busy_llc) + return -ENOMEM; + } + if (resctrl_arch_is_mbm_total_enabled()) { + tsize = sizeof(*d->mbm_total); + d->mbm_total = kcalloc(idx_limit, tsize, GFP_KERNEL); + if (!d->mbm_total) { + bitmap_free(d->rmid_busy_llc); + return -ENOMEM; + } + } + if (resctrl_arch_is_mbm_local_enabled()) { + tsize = sizeof(*d->mbm_local); + d->mbm_local = kcalloc(idx_limit, tsize, GFP_KERNEL); + if (!d->mbm_local) { + bitmap_free(d->rmid_busy_llc); + kfree(d->mbm_total); + return -ENOMEM; + } + } + if (resctrl_arch_is_mbm_core_enabled()) { + tsize = sizeof(*d->mbm_core); + d->mbm_core = kcalloc(idx_limit, tsize, GFP_KERNEL); + if (!d->mbm_core) { + bitmap_free(d->rmid_busy_llc); + kfree(d->mbm_total); + kfree(d->mbm_local); + return -ENOMEM; + } + } + + return 0; +} + +int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d) +{ + int err = 0; + + mutex_lock(&rdtgroup_mutex); + + if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA) { + /* RDT_RESOURCE_MBA is never mon_capable */ + err = mba_sc_domain_allocate(r, d); + goto out_unlock; + } + + if (!r->mon_capable) + goto out_unlock; + + err = domain_setup_mon_state(r, d); + if (err) + goto out_unlock; + + if (resctrl_is_mbm_enabled() && resctrl_arch_would_mbm_overflow()) { + INIT_DELAYED_WORK(&d->mbm_over, mbm_handle_overflow); + mbm_setup_overflow_handler(d, MBM_OVERFLOW_INTERVAL, + RESCTRL_PICK_ANY_CPU); + } + + if (resctrl_arch_is_llc_occupancy_enabled()) + INIT_DELAYED_WORK(&d->cqm_limbo, cqm_handle_limbo); + + /* + * If the filesystem is not mounted then only the default resource group + * exists. Creation of its directories is deferred until mount time + * by rdt_get_tree() calling mkdir_mondata_all(). + * If resctrl is mounted, add per domain monitor data directories. + */ + if (resctrl_mounted && resctrl_arch_mon_capable()) + mkdir_mondata_subdir_allrdtgrp(r, d); + +out_unlock: + mutex_unlock(&rdtgroup_mutex); + + return err; +} + +void resctrl_online_cpu(unsigned int cpu) +{ + mutex_lock(&rdtgroup_mutex); + /* The CPU is set in default rdtgroup after online. */ + cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask); + mutex_unlock(&rdtgroup_mutex); +} + +static void clear_childcpus(struct rdtgroup *r, unsigned int cpu) +{ + struct rdtgroup *cr; + + list_for_each_entry(cr, &r->mon.crdtgrp_list, mon.crdtgrp_list) { + if (cpumask_test_and_clear_cpu(cpu, &cr->cpu_mask)) + break; + } +} + +void resctrl_offline_cpu(unsigned int cpu) +{ + struct rdt_resource *l3 = resctrl_arch_get_resource(RDT_RESOURCE_L3); + struct rdtgroup *rdtgrp; + struct rdt_domain *d; + + mutex_lock(&rdtgroup_mutex); + list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) { + if (cpumask_test_and_clear_cpu(cpu, &rdtgrp->cpu_mask)) { + clear_childcpus(rdtgrp, cpu); + break; + } + } + + if (!l3->mon_capable) + goto out_unlock; + + d = resctrl_get_domain_from_cpu(cpu, l3); + if (d) { + if (resctrl_is_mbm_enabled() && cpu == d->mbm_work_cpu && + resctrl_arch_would_mbm_overflow()) { + cancel_delayed_work(&d->mbm_over); + mbm_setup_overflow_handler(d, 0, cpu); + } + if (resctrl_arch_is_llc_occupancy_enabled() && + cpu == d->cqm_work_cpu && has_busy_rmid(d)) { + cancel_delayed_work(&d->cqm_limbo); + cqm_setup_limbo_handler(d, 0, cpu); + } + } + +out_unlock: + mutex_unlock(&rdtgroup_mutex); +} + +/* + * resctrl_init - resctrl filesystem initialization + * + * Setup resctrl file system including set up root, create mount point, + * register resctrl filesystem, and initialize files under root directory. + * + * Return: 0 on success or -errno + */ +int resctrl_init(void) +{ + int ret = 0; + + seq_buf_init(&last_cmd_status, last_cmd_status_buf, + sizeof(last_cmd_status_buf)); + + rdtgroup_setup_default(); + + thread_throttle_mode_init(); + + ret = resctrl_mon_resource_init(); + if (ret) + return ret; + + ret = sysfs_create_mount_point(fs_kobj, "resctrl"); + if (ret) + return ret; + + ret = register_filesystem(&rdt_fs_type); + if (ret) + goto cleanup_mountpoint; + + /* + * Adding the resctrl debugfs directory here may not be ideal since + * it would let the resctrl debugfs directory appear on the debugfs + * filesystem before the resctrl filesystem is mounted. + * It may also be ok since that would enable debugging of RDT before + * resctrl is mounted. + * The reason why the debugfs directory is created here and not in + * rdt_get_tree() is because rdt_get_tree() takes rdtgroup_mutex and + * during the debugfs directory creation also &sb->s_type->i_mutex_key + * (the lockdep class of inode->i_rwsem). Other filesystem + * interactions (eg. SyS_getdents) have the lock ordering: + * &sb->s_type->i_mutex_key --> &mm->mmap_lock + * During mmap(), called with &mm->mmap_lock, the rdtgroup_mutex + * is taken, thus creating dependency: + * &mm->mmap_lock --> rdtgroup_mutex for the latter that can cause + * issues considering the other two lock dependencies. + * By creating the debugfs directory here we avoid a dependency + * that may cause deadlock (even though file operations cannot + * occur until the filesystem is mounted, but I do not know how to + * tell lockdep that). + */ + debugfs_resctrl = debugfs_create_dir("resctrl", NULL); + + return 0; + +cleanup_mountpoint: + sysfs_remove_mount_point(fs_kobj, "resctrl"); + + return ret; +} + +void resctrl_exit(void) +{ + debugfs_remove_recursive(debugfs_resctrl); + unregister_filesystem(&rdt_fs_type); + sysfs_remove_mount_point(fs_kobj, "resctrl"); + + resctrl_mon_resource_exit(); +} diff --git a/include/linux/acpi.h b/include/linux/acpi.h index faf84970c6395edfbc42b6ed588f5f03cfae9fe3..ba22f31ff4a38801e3ad8857c871ecd0fc0635d8 100644 --- a/include/linux/acpi.h +++ b/include/linux/acpi.h @@ -1505,6 +1505,9 @@ int find_acpi_cpu_topology(unsigned int cpu, int level); int find_acpi_cpu_topology_cluster(unsigned int cpu); int find_acpi_cpu_topology_package(unsigned int cpu); int find_acpi_cpu_topology_hetero_id(unsigned int cpu); +int find_acpi_cache_level_from_id(u32 cache_id); +int acpi_pptt_get_cpus_from_container(u32 acpi_cpu_id, cpumask_t *cpus); +int acpi_pptt_get_cpumask_from_cache_id(u32 cache_id, cpumask_t *cpus); #else static inline int acpi_pptt_cpu_is_thread(unsigned int cpu) { @@ -1526,6 +1529,20 @@ static inline int find_acpi_cpu_topology_hetero_id(unsigned int cpu) { return -EINVAL; } +static inline int find_acpi_cache_level_from_id(u32 cache_id) +{ + return -EINVAL; +} +static inline int acpi_pptt_get_cpus_from_container(u32 acpi_cpu_id, + cpumask_t *cpus) +{ + return -EINVAL; +} +static inline int acpi_pptt_get_cpumask_from_cache_id(u32 cache_id, + cpumask_t *cpus) +{ + return -EINVAL; +} #endif #ifdef CONFIG_ARM64 diff --git a/include/linux/arm_mpam.h b/include/linux/arm_mpam.h new file mode 100644 index 0000000000000000000000000000000000000000..0c2a2515ff871f02a8517b5051b2e256184babb6 --- /dev/null +++ b/include/linux/arm_mpam.h @@ -0,0 +1,109 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* Copyright (C) 2021 Arm Ltd. */ + +#ifndef __LINUX_ARM_MPAM_H +#define __LINUX_ARM_MPAM_H + +#include +#include +#include + +/* + * The value of the MPAM1_EL1 sysreg when a task is in the default group. + * This is used by the context switch code to use the resctrl CPU property + * instead. The value is modified when CDP is enabled/disabled by mounting + * the resctrl filesystem. + */ +extern u64 mpam_resctrl_default_group; + +#include + +struct mpam_msc; + +enum mpam_msc_iface { + MPAM_IFACE_MMIO, /* a real MPAM MSC */ + MPAM_IFACE_PCC, /* a fake MPAM MSC */ +}; + +enum mpam_class_types { + MPAM_CLASS_CACHE, /* Well known caches, e.g. L2 */ + MPAM_CLASS_MEMORY, /* Main memory */ + MPAM_CLASS_UNKNOWN, /* Everything else, e.g. SMMU */ +}; + +#ifdef CONFIG_ACPI_MPAM +/* Parse the ACPI description of resources entries for this MSC. */ +int acpi_mpam_parse_resources(struct mpam_msc *msc, + struct acpi_mpam_msc_node *tbl_msc); +int acpi_mpam_count_msc(void); +#else +static inline int acpi_mpam_parse_resources(struct mpam_msc *msc, + struct acpi_mpam_msc_node *tbl_msc) +{ + return -EINVAL; +} +static inline int acpi_mpam_count_msc(void) { return -EINVAL; } +#endif + +int mpam_register_requestor(u16 partid_max, u8 pmg_max); + +int mpam_ris_create(struct mpam_msc *msc, u8 ris_idx, + enum mpam_class_types type, u8 class_id, int component_id); + +static inline unsigned int resctrl_arch_round_mon_val(unsigned int val) +{ + return val; +} + +/* MPAM counters requires a monitor to be allocated */ +static inline bool resctrl_arch_event_is_free_running(enum resctrl_event_id evt) +{ + return false; +} + +bool resctrl_arch_alloc_capable(void); +bool resctrl_arch_mon_capable(void); +bool resctrl_arch_is_llc_occupancy_enabled(void); +bool resctrl_arch_is_l2c_occupancy_enabled(void); +bool resctrl_arch_is_mbm_core_enabled(void); +bool resctrl_arch_is_mbm_local_enabled(void); +bool resctrl_arch_is_mbm_total_enabled(void); +bool resctrl_arch_would_mbm_overflow(void); + +/* reset cached configurations, then all devices */ +void resctrl_arch_reset_resources(void); + +bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level ignored); +int resctrl_arch_set_cdp_enabled(enum resctrl_res_level ignored, bool enable); +bool resctrl_arch_hide_cdp(enum resctrl_res_level rid); +bool resctrl_arch_match_closid(struct task_struct *tsk, u32 closid); +bool resctrl_arch_match_rmid(struct task_struct *tsk, u32 closid, u32 rmid); +void resctrl_arch_set_cpu_default_closid(int cpu, u32 closid); +void resctrl_arch_set_closid_rmid(struct task_struct *tsk, u32 closid, u32 rmid); +void resctrl_arch_set_cpu_default_closid_rmid(int cpu, u32 closid, u32 pmg); +void resctrl_sched_in(struct task_struct *tsk); +u32 resctrl_arch_rmid_idx_encode(u32 closid, u32 rmid); +void resctrl_arch_rmid_idx_decode(u32 idx, u32 *closid, u32 *rmid); +u32 resctrl_arch_system_num_rmid_idx(void); + +struct rdt_resource; +void *resctrl_arch_mon_ctx_alloc(struct rdt_resource *r, int evtid); +void resctrl_arch_mon_ctx_free(struct rdt_resource *r, int evtid, void *ctx); + +/* Pseudo lock is not supported by MPAM */ +static inline int resctrl_arch_pseudo_lock_fn(void *_plr) { return 0; } +static inline int resctrl_arch_measure_l2_residency(void *_plr) { return 0; } +static inline int resctrl_arch_measure_l3_residency(void *_plr) { return 0; } +static inline int resctrl_arch_measure_cycles_lat_fn(void *_plr) { return 0; } +static inline u64 resctrl_arch_get_prefetch_disable_bits(void) { return 0; } + +/* + * The CPU configuration for MPAM is cheap to write, and is only written if it + * has changed. No need for fine grained enables. + */ +static inline void resctrl_arch_enable_mon(void) { } +static inline void resctrl_arch_disable_mon(void) { } +static inline void resctrl_arch_enable_alloc(void) { } +static inline void resctrl_arch_disable_alloc(void) { } + +#endif /* __LINUX_ARM_MPAM_H */ diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h index ef701ba043e9407b82d54152e49bf36b9f516ec9..a5716e29eb2079fcd67a3846691ba5432bc801c9 100644 --- a/include/linux/cacheinfo.h +++ b/include/linux/cacheinfo.h @@ -48,7 +48,7 @@ extern unsigned int coherency_max_size; * keeping, the remaining members form the core properties of the cache */ struct cacheinfo { - unsigned int id; + unsigned long id; enum cache_type type; unsigned int level; unsigned int coherency_line_size; @@ -112,6 +112,7 @@ int acpi_get_cache_info(unsigned int cpu, #endif const struct attribute_group *cache_get_priv_group(struct cacheinfo *this_leaf); +unsigned long cache_of_get_id(struct device_node *np); /* * Get the cacheinfo structure for the cache associated with @cpu at @@ -140,11 +141,31 @@ static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int level) * Get the id of the cache associated with @cpu at level @level. * cpuhp lock must be held. */ -static inline int get_cpu_cacheinfo_id(int cpu, int level) +static inline unsigned long get_cpu_cacheinfo_id(int cpu, int level) { struct cacheinfo *ci = get_cpu_cacheinfo_level(cpu, level); - return ci ? ci->id : -1; + return ci ? ci->id : ~0UL; +} + +/* + * Get the size of the cache associated with @cpu at level @level. + * cpuhp lock must be held. + */ +static inline unsigned int get_cpu_cacheinfo_size(int cpu, int level) +{ + struct cpu_cacheinfo *ci = get_cpu_cacheinfo(cpu); + int i; + + if (!ci->info_list) + return 0; + + for (i = 0; i < ci->num_leaves; i++) { + if (ci->info_list[i].level == level) + return ci->info_list[i].size; + } + + return 0; } #ifdef CONFIG_ARM64 diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index d94abba1c716e93b709d22a81d9e30242a0c0675..d86f0b1d3f605c4cc401adf22080beba29205620 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -2,6 +2,12 @@ #ifndef _RESCTRL_H #define _RESCTRL_H +#ifdef CONFIG_ARM_CPU_RESCTRL + +#include + +#else + #include #include #include @@ -343,4 +349,5 @@ void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain * extern unsigned int resctrl_rmid_realloc_threshold; extern unsigned int resctrl_rmid_realloc_limit; +#endif /* ARM_CPU_RESCTRL */ #endif /* _RESCTRL_H */ diff --git a/include/linux/resctrl_mpam.h b/include/linux/resctrl_mpam.h new file mode 100644 index 0000000000000000000000000000000000000000..9bd28c63068cff5c855a35942d2dad461b10bdb5 --- /dev/null +++ b/include/linux/resctrl_mpam.h @@ -0,0 +1,452 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _RESCTRL_MPAM_H +#define _RESCTRL_MPAM_H + +#include +#include +#include +#include +#include + +#ifdef CONFIG_ARCH_HAS_CPU_RESCTRL +#include +#endif + +/* CLOSID, RMID value used by the default control group */ +#define RESCTRL_RESERVED_CLOSID 0 +#define RESCTRL_RESERVED_RMID 0 + +#define RESCTRL_PICK_ANY_CPU -1 + +#ifdef CONFIG_PROC_CPU_RESCTRL + +int proc_resctrl_show(struct seq_file *m, + struct pid_namespace *ns, + struct pid *pid, + struct task_struct *tsk); + +#endif + +/* max value for struct rdt_domain's mbps_val */ +#define MBA_MAX_MBPS U32_MAX + +/* + * Resctrl uses u32 to hold the user-space config. The maximum bitmap size is + * 32. + */ +#define RESCTRL_MAX_CBM 32 + +extern unsigned int resctrl_rmid_realloc_limit; +extern unsigned int resctrl_rmid_realloc_threshold; + +/** + * struct pseudo_lock_region - pseudo-lock region information + * @s: Resctrl schema for the resource to which this + * pseudo-locked region belongs + * @closid: The closid that this pseudo-locked region uses + * @d: RDT domain to which this pseudo-locked region + * belongs + * @cbm: bitmask of the pseudo-locked region + * @lock_thread_wq: waitqueue used to wait on the pseudo-locking thread + * completion + * @thread_done: variable used by waitqueue to test if pseudo-locking + * thread completed + * @cpu: core associated with the cache on which the setup code + * will be run + * @line_size: size of the cache lines + * @size: size of pseudo-locked region in bytes + * @kmem: the kernel memory associated with pseudo-locked region + * @minor: minor number of character device associated with this + * region + * @debugfs_dir: pointer to this region's directory in the debugfs + * filesystem + * @pm_reqs: Power management QoS requests related to this region + */ +struct pseudo_lock_region { + struct resctrl_schema *s; + u32 closid; + struct rdt_domain *d; + u32 cbm; + wait_queue_head_t lock_thread_wq; + int thread_done; + int cpu; + unsigned int line_size; + unsigned int size; + void *kmem; + unsigned int minor; + struct dentry *debugfs_dir; + struct list_head pm_reqs; +}; + +/** + * struct resctrl_staged_config - parsed configuration to be applied + * @new_ctrl: new ctrl value to be loaded + * @have_new_ctrl: whether the user provided new_ctrl is valid + */ +struct resctrl_staged_config { + u32 new_ctrl; + bool have_new_ctrl; +}; + +/** + * struct rdt_domain - group of CPUs sharing a resctrl resource + * @list: all instances of this resource + * @id: unique id for this instance + * @cpu_mask: which CPUs share this resource + * @rmid_busy_llc: bitmap of which limbo RMIDs are above threshold + * @mbm_total: saved state for MBM total bandwidth + * @mbm_local: saved state for MBM local bandwidth + * @mbm_over: worker to periodically read MBM h/w counters + * @cqm_limbo: worker to periodically read CQM h/w counters + * @mbm_work_cpu: worker CPU for MBM h/w counters + * @cqm_work_cpu: worker CPU for CQM h/w counters + * @plr: pseudo-locked region (if any) associated with domain + * @staged_config: parsed configuration to be applied + * @mbps_val: When mba_sc is enabled, this holds the array of user + * specified control values for mba_sc in MBps, indexed + * by closid + */ +struct rdt_domain { + struct list_head list; + int id; + struct cpumask cpu_mask; + unsigned long *rmid_busy_llc; + struct mbm_state *mbm_total; + struct mbm_state *mbm_local; + struct mbm_state *mbm_core; + struct delayed_work mbm_over; + struct delayed_work cqm_limbo; + int mbm_work_cpu; + int cqm_work_cpu; + struct pseudo_lock_region *plr; + struct resctrl_staged_config staged_config[CDP_NUM_TYPES]; + u32 *mbps_val; +}; + +/** + * struct resctrl_cache - Cache allocation related data + * @cbm_len: Length of the cache bit mask + * @min_cbm_bits: Minimum number of consecutive bits to be set. + * The value 0 means the architecture can support + * zero CBM. + * @shareable_bits: Bitmask of shareable resource with other + * executing entities + * @arch_has_sparse_bitmasks: True if a bitmask like f00f is valid. + * @arch_has_per_cpu_cfg: True if QOS_CFG register for this cache + * level has CPU scope. + * @intpri_wd: Number of implemented bits in the priority + * partition. + */ +struct resctrl_cache { + unsigned int cbm_len; + unsigned int min_cbm_bits; + unsigned int shareable_bits; + bool arch_has_sparse_bitmasks; + bool arch_has_per_cpu_cfg; + unsigned int intpri_wd; +}; + +/** + * enum membw_throttle_mode - System's memory bandwidth throttling mode + * @THREAD_THROTTLE_UNDEFINED: Not relevant to the system + * @THREAD_THROTTLE_MAX: Memory bandwidth is throttled at the core + * always using smallest bandwidth percentage + * assigned to threads, aka "max throttling" + * @THREAD_THROTTLE_PER_THREAD: Memory bandwidth is throttled at the thread + */ +enum membw_throttle_mode { + THREAD_THROTTLE_UNDEFINED = 0, + THREAD_THROTTLE_MAX, + THREAD_THROTTLE_PER_THREAD, +}; + +/** + * struct resctrl_membw - Memory bandwidth allocation related data + * @min_bw: Minimum memory bandwidth percentage user can request + * @max_bw: Maximum memory bandwidth value, used as the reset value + * @bw_gran: Granularity at which the memory bandwidth is allocated + * @delay_linear: True if memory B/W delay is in linear scale + * @arch_needs_linear: True if we can't configure non-linear resources + * @throttle_mode: Bandwidth throttling mode when threads request + * different memory bandwidths + * @mba_sc: True if MBA software controller(mba_sc) is enabled + * @mb_map: Mapping of memory B/W percentage to memory B/W delay + * @intpri_wd: Number of implemented bits in the priority + * partition. + */ +struct resctrl_membw { + u32 min_bw; + u32 max_bw; + u32 bw_gran; + u32 delay_linear; + bool arch_needs_linear; + enum membw_throttle_mode throttle_mode; + bool mba_sc; + u32 *mb_map; + u32 intpri_wd; +}; + +/** + * enum resctrl_schema_fmt - The format user-space provides for a schema. + * @RESCTRL_SCHEMA_BITMAP: The schema is a bitmap in hex. + * @RESCTRL_SCHEMA_RANGE: The schema is a decimal number. + */ +enum resctrl_schema_fmt { + RESCTRL_SCHEMA_BITMAP, + RESCTRL_SCHEMA_RANGE, +}; + +/** + * struct rdt_resource - attributes of a resctrl resource + * @rid: The index of the resource + * @alloc_capable: Is allocation available on this machine + * @mon_capable: Is monitor feature available on this machine + * @num_rmid: Number of RMIDs available + * @cache_level: Which cache level defines scope of this resource + * @cache: Cache allocation related data + * @membw: If the component has bandwidth controls, their properties. + * @domains: RCU list of all domains for this resource + * @name: Name to use in "schemata" file. + * @data_width: Character width of data when displaying + * @default_ctrl: Specifies default cache cbm or memory B/W percent. + * @format_str: Per resource format string to show domain value + * @evt_list: List of monitoring events + * @mbm_cfg_mask: Bandwidth sources that can be tracked when bandwidth + * monitoring events can be configured. + * @fflags: flags to choose base and info files + * @cdp_capable: Is the CDP feature available on this resource + */ +struct rdt_resource { + int rid; + bool alloc_capable; + bool mon_capable; + bool invisible; + bool is_volatile; + int num_rmid; + int cache_level; + struct resctrl_cache cache; + struct resctrl_membw membw; + struct list_head domains; + char *name; + int data_width; + u32 default_ctrl; + const char *format_str; + struct list_head evt_list; + unsigned int mbm_cfg_mask; + unsigned long fflags; + bool cdp_capable; + enum resctrl_schema_fmt schema_fmt; +}; + +/* + * Get the resource that exists at this level. If the level is not supported + * a dummy/not-capable resource can be returned. Levels >= RDT_NUM_RESOURCES + * will return NULL. + */ +struct rdt_resource *resctrl_arch_get_resource(enum resctrl_res_level l); + +/** + * struct resctrl_schema - configuration abilities of a resource presented to + * user-space + * @list: Member of resctrl_schema_all. + * @name: The name to use in the "schemata" file. + * @conf_type: Whether this schema is specific to code/data. + * @res: The resource structure exported by the architecture to describe + * the hardware that is configured by this schema. + * @num_closid: The number of closid that can be used with this schema. When + * features like CDP are enabled, this will be lower than the + * hardware supports for the resource. + */ +struct resctrl_schema { + struct list_head list; + char name[16]; + enum resctrl_conf_type conf_type; + struct rdt_resource *res; + u32 num_closid; +}; + +struct resctrl_cpu_sync { + u32 closid; + u32 rmid; +}; + +struct resctrl_mon_config_info { + struct rdt_resource *r; + struct rdt_domain *d; + u32 evtid; + u32 mon_config; + + int err; +}; + +/** + * struct mon_evt - Entry in the event list of a resource + * @evtid: event id + * @name: name of the event + * @configurable: true if the event is configurable + * @list: entry in &rdt_resource->evt_list + */ +struct mon_evt { + enum resctrl_event_id evtid; + char *name; + bool configurable; + struct list_head list; +}; + +/* + * Update and re-load this CPUs defaults. Called via IPI, takes a pointer to + * struct resctrl_cpu_sync, or NULL. + */ +void resctrl_arch_sync_cpu_defaults(void *info); + +/* The number of closid supported by this resource regardless of CDP */ +u32 resctrl_arch_get_num_closid(struct rdt_resource *r); + +struct rdt_domain *resctrl_arch_find_domain(struct rdt_resource *r, int id); +int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid); + +bool resctrl_arch_is_evt_configurable(enum resctrl_event_id evt); +void resctrl_arch_mon_event_config_write(void *info); +void resctrl_arch_mon_event_config_read(void *info); + +/* For use by arch code that needs to remap resctrl's smaller CDP closid */ +static inline u32 resctrl_get_config_index(u32 closid, + enum resctrl_conf_type type) +{ + switch (type) { + default: + case CDP_NONE: + return closid; + case CDP_CODE: + return (closid * 2) + 1; + case CDP_DATA: + return (closid * 2); + } +} + +/* + * Caller must be in a RCU read-side critical section, or hold the + * cpuhp read lock to prevent the struct rdt_domain being freed. + */ +static inline struct rdt_domain * +resctrl_get_domain_from_cpu(int cpu, struct rdt_resource *r) +{ + struct rdt_domain *d; + + /* + * Walking r->domains, ensure it can't race with cpuhp. + * Because this is called via IPI by rdt_ctrl_update(), assertions + * about locks this thread holds will lead to false positives. Check + * someone is holding the CPUs lock. + */ + if (IS_ENABLED(CONFIG_HOTPLUG_CPU) && IS_ENABLED(CONFIG_LOCKDEP)) + lockdep_is_cpus_held(); + + list_for_each_entry_rcu(d, &r->domains, list) { + /* Find the domain that contains this CPU */ + if (cpumask_test_cpu(cpu, &d->cpu_mask)) + return d; + } + + return NULL; +} + +/* + * Update the ctrl_val and apply this config right now. + * Must be called on one of the domain's CPUs. + */ +int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d, + u32 closid, enum resctrl_conf_type t, u32 cfg_val); + +u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, + u32 closid, enum resctrl_conf_type type); +int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d); +void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d); +void resctrl_online_cpu(unsigned int cpu); +void resctrl_offline_cpu(unsigned int cpu); + +/** + * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid + * for this resource and domain. + * @r: resource that the counter should be read from. + * @d: domain that the counter should be read from. + * @closid: closid that matches the rmid. Depending on the architecture, the + * counter may match traffic of both @closid and @rmid, or @rmid + * only. + * @rmid: rmid of the counter to read. + * @eventid: eventid to read, e.g. L3 occupancy. + * @val: result of the counter read in bytes. + * @arch_mon_ctx: An architecture specific value from + * resctrl_arch_mon_ctx_alloc(), for MPAM this identifies + * the hardware monitor allocated for this read request. + * + * Some architectures need to sleep when first programming some of the counters. + * (specifically: arm64's MPAM cache occupancy counters can return 'not ready' + * for a short period of time). Call from a non-migrateable process context on + * a CPU that belongs to domain @d. e.g. use smp_call_on_cpu() or + * schedule_work_on(). This function can be called with interrupts masked, + * e.g. using smp_call_function_any(), but may consistently return an error. + * + * Return: + * 0 on success, or -EIO, -EINVAL etc on error. + */ +int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d, + u32 closid, u32 rmid, enum resctrl_event_id eventid, + u64 *val, void *arch_mon_ctx); + +/** + * resctrl_arch_rmid_read_context_check() - warn about invalid contexts + * + * When built with CONFIG_DEBUG_ATOMIC_SLEEP generate a warning when + * resctrl_arch_rmid_read() is called with preemption disabled. + * + * The contract with resctrl_arch_rmid_read() is that if interrupts + * are unmasked, it can sleep. This allows NOHZ_FULL systems to use an + * IPI, (and fail if the call needed to sleep), while most of the time + * the work is scheduled, allowing the call to sleep. + */ +static inline void resctrl_arch_rmid_read_context_check(void) +{ + if (!irqs_disabled()) + might_sleep(); +} + +/** + * resctrl_arch_reset_rmid() - Reset any private state associated with rmid + * and eventid. + * @r: The domain's resource. + * @d: The rmid's domain. + * @closid: closid that matches the rmid. Depending on the architecture, the + * counter may match traffic of both @closid and @rmid, or @rmid only. + * @rmid: The rmid whose counter values should be reset. + * @eventid: The eventid whose counter values should be reset. + * + * This can be called from any CPU. + */ +void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d, + u32 closid, u32 rmid, + enum resctrl_event_id eventid); + +/** + * resctrl_arch_reset_rmid_all() - Reset all private state associated with + * all rmids and eventids. + * @r: The resctrl resource. + * @d: The domain for which all architectural counter state will + * be cleared. + * + * This can be called from any CPU. + */ +void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d); + +extern unsigned int resctrl_rmid_realloc_threshold; +extern unsigned int resctrl_rmid_realloc_limit; + +extern bool resctrl_mounted; + +int resctrl_init(void); +void resctrl_exit(void); + +int resctrl_arch_mon_resource_init(void); +void mbm_config_rftype_init(const char *config); + +#endif /* _RESCTRL_MPAM_H */ diff --git a/include/linux/resctrl_types.h b/include/linux/resctrl_types.h new file mode 100644 index 0000000000000000000000000000000000000000..3d56c35877ea232a404207a38402c34075213af8 --- /dev/null +++ b/include/linux/resctrl_types.h @@ -0,0 +1,119 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2023 Arm Ltd. + * Based on arch/x86/kernel/cpu/resctrl/internal.h + */ + +#ifndef __LINUX_RESCTRL_TYPES_H +#define __LINUX_RESCTRL_TYPES_H + +#define CQM_LIMBOCHECK_INTERVAL 1000 + +#define MBM_CNTR_WIDTH_BASE 24 +#define MBM_OVERFLOW_INTERVAL 1000 +#define MAX_MBA_BW 100u +#define MBA_IS_LINEAR 0x4 + +/* rdtgroup.flags */ +#define RDT_DELETED 1 + +/* rftype.flags */ +#define RFTYPE_FLAGS_CPUS_LIST 1 + +/* + * Define the file type flags for base and info directories. + */ +#define RFTYPE_INFO BIT(0) +#define RFTYPE_BASE BIT(1) +#define RFTYPE_CTRL BIT(4) +#define RFTYPE_MON BIT(5) +#define RFTYPE_TOP BIT(6) +#define RFTYPE_RES_CACHE BIT(8) +#define RFTYPE_RES_MB BIT(9) +#define RFTYPE_DEBUG BIT(10) +#define RFTYPE_CTRL_INFO (RFTYPE_INFO | RFTYPE_CTRL) +#define RFTYPE_MON_INFO (RFTYPE_INFO | RFTYPE_MON) +#define RFTYPE_TOP_INFO (RFTYPE_INFO | RFTYPE_TOP) +#define RFTYPE_CTRL_BASE (RFTYPE_BASE | RFTYPE_CTRL) +#define RFTYPE_MON_BASE (RFTYPE_BASE | RFTYPE_MON) + +/* Reads to Local DRAM Memory */ +#define READS_TO_LOCAL_MEM BIT(0) + +/* Reads to Remote DRAM Memory */ +#define READS_TO_REMOTE_MEM BIT(1) + +/* Non-Temporal Writes to Local Memory */ +#define NON_TEMP_WRITE_TO_LOCAL_MEM BIT(2) + +/* Non-Temporal Writes to Remote Memory */ +#define NON_TEMP_WRITE_TO_REMOTE_MEM BIT(3) + +/* Reads to Local Memory the system identifies as "Slow Memory" */ +#define READS_TO_LOCAL_S_MEM BIT(4) + +/* Reads to Remote Memory the system identifies as "Slow Memory" */ +#define READS_TO_REMOTE_S_MEM BIT(5) + +/* Dirty Victims to All Types of Memory */ +#define DIRTY_VICTIMS_TO_ALL_MEM BIT(6) + +/* Max event bits supported */ +#define MAX_EVT_CONFIG_BITS GENMASK(6, 0) + +/** + * enum resctrl_conf_type - The type of configuration. + * @CDP_NONE: No prioritisation, both code and data are controlled or monitored. + * @CDP_CODE: Configuration applies to instruction fetches. + * @CDP_DATA: Configuration applies to reads and writes. + */ +enum resctrl_conf_type { + CDP_NONE, + CDP_CODE, + CDP_DATA, +}; + +enum resctrl_res_level { + RDT_RESOURCE_L3, + RDT_RESOURCE_L2, + RDT_RESOURCE_MBA, + RDT_RESOURCE_SMBA, +#ifdef CONFIG_ARM64_MPAM + RDT_RESOURCE_L3_MAX, + RDT_RESOURCE_L2_MAX, + RDT_RESOURCE_L3_MIN, + RDT_RESOURCE_L2_MIN, + RDT_RESOURCE_MB_MIN, + RDT_RESOURCE_L3_PRI, + RDT_RESOURCE_L2_PRI, + RDT_RESOURCE_MB_PRI, + RDT_RESOURCE_MB_HDL, +#endif + + /* Must be the last */ + RDT_NUM_RESOURCES, +}; + +#define CDP_NUM_TYPES (CDP_DATA + 1) + +/* + * Event IDs, the values match those used to program IA32_QM_EVTSEL before + * reading IA32_QM_CTR on RDT systems. + * + * Monitor Event IDs, representative a variety of monitoring events: + * QOS_L3_OCCUP_EVENT_ID: L3 Cache Occupancy statistics event + * QOS_L3_MBM_TOTAL_EVENT_ID: Global Memory Bandwidth statistics event + * QOS_L3_MBM_LOCAL_EVENT_ID: L3 Cache Bandwidth statistics event + * QOS_L2_OCCUP_EVENT_ID: L2 Cache Occupancy statistics event + * QOS_L2_MBM_CORE_EVENT_ID: L2 Cache Bandwidth statistics event + */ +enum resctrl_event_id { + QOS_L3_OCCUP_EVENT_ID = 0x01, + QOS_L3_MBM_TOTAL_EVENT_ID = 0x02, + QOS_L3_MBM_LOCAL_EVENT_ID = 0x03, + + QOS_L2_OCCUP_EVENT_ID, + QOS_L2_MBM_CORE_EVENT_ID, +}; + +#endif /* __LINUX_RESCTRL_TYPES_H */