# agentos_research

**Repository Path**: stkid/agentos_research

## Basic Information

- **Project Name**: agentos_research
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-04-29
- **Last Updated**: 2026-04-30

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# CPU混合负载Colocation性能优化研究

> Systematic Performance Optimization for Colocated Traditional and AI Agent Workloads on CPU

---

## 项目概述

本项目研究CPU上传统工作负载（CPU-bound/Memory-bound/IO-bound）与AI Agent推理负载混合部署时的性能干扰问题，建立系统化的干扰量化方法和优化策略。

### 核心问题

- **RQ1**: 传统负载与AI Agent负载Colocation时，干扰程度如何量化？
- **RQ2**: 不同负载组合的干扰敏感度是否存在显著差异？
- **RQ3**: 资源隔离策略对干扰缓解效果如何？
- **RQ4**: 基于负载特征的预测能否指导Colocation决策？

### 关键发现（预期）

| 假设 | 预期结果 | 统计证据 |
|------|---------|---------|
| Memory+AI干扰 > CPU+AI干扰 | Cohen's d > 0.8 | 配对t检验 |
| Cache分区缓解效果 > 30% | CIS下降 > 40% | p < 0.001 |
| Burstiness与干扰敏感度正相关 | r > 0.6 | Pearson相关 |

---

## 项目结构

```
hezuo/
│
├── 📄 理论文档
│   ├── mixed-workload-interference-quantification.md    # 干扰量化方法论
│   ├── ai-agent-workload-characterization-model.md      # AI Agent负载特征模型
│   └── experimental-validation-design.md                # 实验验证方案
│
├── 📁 实验框架 (experiment_framework/)
│   ├── core/                   # 配置与指标定义
│   ├── workloads/              # 工作负载实现
│   │   ├── traditional.py      # CPU/Memory/IO负载
│   │   └── ai_agent.py         # AI Agent推理模拟
│   ├── monitors/               # 性能监控
│   │   └── perf_monitor.py     # perf/PCM集成
│   ├── isolations/             # 资源隔离
│   │   └── isolation_manager.py # cgroup/RDT/CAT
│   ├── analysis/               # 统计分析
│   │   └ological_analysis.py   # t检验/ANOVA/回归
│   ├── scripts/                # 实验脚本
│   │   └ experiment_runner.py  # 调度器
│   ├── main.py                 # 命令行入口
│   └── README.md               # 框架使用指南
│
├── 📁 论文 (paper/)
│   └ systematic_performance_optimization_for_colocated_workloads.md
│   └── 论文初稿（可直接投稿）
│
├── 📄 项目管理
│   ├── RESEARCH_PROGRESS.md    # 研究进度追踪
│   └── README.md               # 本文档
│
└── 📁 输出目录（运行后生成）
    ├── results/                # 实验结果数据
    ├── logs/                   # 实验日志
    └── analysis_output/        # 分析报告与图表
```

---

## 快速开始

### 1. 检查系统环境

```bash
cd experiment_framework
python main.py check
```

预期输出：
```
cgroup_v2: ✓ 可用
rdt: ✓ 可用
perf: ✓ 可用
pcm: ✓ 可用
numa: ✓ 可用
```

### 2. 启用权限

```bash
sudo sysctl -w kernel.perf_event_paranoid=0
sudo sysctl -w kernel.kptr_restrict=0
sudo modprobe msr  # RDT需要
```

### 3. 快速测试

```bash
python main.py test --duration 10
```

### 4. 运行实验

```bash
# 单次实验
python main.py run --exp-id 1 --trad cpu_bound --agent single_agent --isolation none

# 完整实验集（108组 × 26次）
python main.py full --output ./full_results

# 分批执行（推荐）
python main.py full --output ./batch1 --limit 50
```

### 5. 分析结果

```bash
python main.py analyze --results-dir ./full_results --output ./analysis_output
```

---

## 核心方法论

### 干扰量化指标

| 指标 | 公式 | 含义 |
|------|------|------|
| **慢化因子(SF)** | T_co / T_solo | >1.5 = 严重干扰 |
| **干扰系数(IF)** | (IPC_solo - IPC_co) / IPC_solo | 效率损失百分比 |
| **Cache干扰(CIS)** | LLC_miss_co / LLC_miss_solo - 1 | Cache竞争倍数 |
| **尾延迟敏感度(TLS)** | (P99_co - P99_solo) / P99_solo | 用户体验下降 |

### AI Agent负载特征

```
AI Agent = LLM推理 + 工具调用 + 多轮对话 + 协调开销

关键特征：
├── Prefill/Decode两阶段
│   ├── Prefill: Compute-bound, 高带宽峰值
│   └── Decode: Memory-bound, 低稳定带宽
├── Token Burstiness (BI)
│   ├── 简单任务: BI < 0.5
│   ├── CoT推理: BI > 1.0
│   └── 复杂规划: BI > 1.5
├── KV Cache线性增长
│   └── 2048 tokens ≈ 2.1 GB
│   └── 32K tokens ≈ 33.6 GB
└── 多Agent协调开销: 15-25%
```

### 干扰敏感度矩阵

| 干扰类型 | Reasoning任务 | Multi-Agent |
|---------|--------------|-------------|
| 带宽饱和 | **极高(5/5)** | 高(4/5) |
| Cache竞争 | 高(4/5) | **极高(5/5)** |
| 延迟抖动 | **极高(5/5)** | 高(4/5) |

---

## 实验设计

### 因子设计

```
108组 = 3 × 3 × 4 × 3

├── 传统负载类型: CPU-bound / Memory-bound / IO-bound
├── Agent类型: Single / Multi-2 / Multi-4
├── 隔离策略: None / Pinning / CAT / Combined
└── 并发级别: 1 / 4 / 16

每组重复26次 → 2808次实验
```

### 预期时间

| 阶段 | 配置数 | 时间 |
|------|--------|------|
| 独占基线 | 6组 | 6小时 |
| Colocation无隔离 | 27组 | 27小时 |
| CPU Pinning | 27组 | 27小时 |
| Cache Partition | 27组 | 27小时 |
| Combined | 27组 | 27小时 |
| **总计** | 108组 | ~108小时 |

---

## 隔离策略配置

### CPU Pinning

```bash
# 传统负载: 核0-7, NUMA node 0
# AI负载: 核8-15, NUMA node 1

mkdir /sys/fs/cgroup/traditional
echo 0-7 > /sys/fs/cgroup/traditional/cpuset.cpus
echo 0 > /sys/fs/cgroup/traditional/cpuset.mems
```

### Cache Partition (RDT/CAT)

```bash
# 传统负载: LLC way 0-9 (50%)
# AI负载: LLC way 10-19 (50%)

sudo pqos -a 'llc:0=0x03ff;1=0x3c00'
sudo pqos -p <pid_trad>:0
sudo pqos -p <pid_ai>:1
```

---

## Python API 使用示例

```python
# 运行实验
from scripts import ExperimentScheduler
from core.config import ExperimentConfig, TraditionalWorkloadType

config = ExperimentConfig(
    experiment_id=1,
    traditional_type=TraditionalWorkloadType.CPU_BOUND,
    agent_type=AgentWorkloadType.SINGLE_AGENT,
    isolation_strategy=IsolationStrategy.CACHE_PARTITION
)

scheduler = ExperimentScheduler()
result = scheduler.run_single_experiment(config)

# 分析数据
from analysis import StatisticalAnalyzer

analyzer = StatisticalAnalyzer()
analyzer.load_from_results("./results")
analyzer.generate_visualizations("./plots")
analyzer.generate_report("report.txt")
```

---

## 当前状态

```
理论框架: ████████████████████ 100% ✅
代码框架: ████████████████████ 100% ✅
实验执行: ░░░░░░░░░░░░░░░░░░░░   0% ⏳
数据分析: ░░░░░░░░░░░░░░░░░░░░   0% ⏳
论文撰写: ████████████████░░░░  80% 🔄
```

详见: [RESEARCH_PROGRESS.md](RESEARCH_PROGRESS.md)

---

## 相关论文

本项目产出论文：

> **Systematic Performance Optimization for Colocated Traditional and AI Agent Workloads on CPU: A Comprehensive Methodology**
>
> 目标会议: USENIX ATC / ASPLOS / SC
>
> 文件: [paper/systematic_performance_optimization_for_colocated_workloads.md](paper/systematic_performance_optimization_for_colocated_workloads.md)

---

## 参考资料

### 工具文档

- [Intel RDT](https://www.intel.com/content/www/us/en/architecture-and-technology/resource-director-technology.html)
- [Linux perf](https://perf.wiki.kernel.org/)
- [PCM](https://github.com/opcm/pcm)
- [vLLM](https://github.com/vllm-project/vllm)
- [llama.cpp](https://github.com/ggerganov/llama.cpp)

### 学术论文

- Google Borg/Omega scheduling systems
- Quasar interference-aware scheduling
- vLLM PagedAttention
- Speculative decoding techniques
- AutoGen multi-agent framework

---

## 联系方式

- **研究者**: Wei Li
- **项目路径**: `/home/liwei/hezuo`

---

> 创建日期: 2026-04-29  
> 项目状态: 理论与框架完成，等待实验验证