# llm-guard

**Repository Path**: ethan519013/llm-guard

## Basic Information

- **Project Name**: llm-guard
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 6
- **Created**: 2025-12-23
- **Last Updated**: 2025-12-23

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# 🔒 敏感信息检测引擎

一个基于 Python 的个人敏感信息检测系统，采用三级检测体系（正则/BERT/LLM），支持在线训练和模型热更新。

## ✨ 特性

- **三级检测体系**
  - 一级：正则表达式检测（高性能、精确匹配）
  - 二级：BERT 小模型检测（语义理解）
  - 三级：大模型兜底研判（复杂场景）

- **MLOps 能力**
  - 在线增量样本收集
  - 小模型重新训练
  - 模型版本管理与热更新

- **敏感数据类型支持**
  - 身份证号
  - 手机号
  - 电子邮箱
  - 银行卡号
  - 地址
  - 姓名
  - 护照号
  - IP 地址
  - 信用卡号
  - 可扩展...

- **企业级特性**
  - GPU 支持（自动降级 CPU）
  - 完整的审计日志
  - Prometheus 监控指标
  - Docker 一键部署

## 📁 项目结构

```
project_root/
├── app/
│   ├── api/
│   │   ├── router.py          # API 路由
│   │   └── schemas.py         # 数据模型
│   ├── core/
│   │   ├── config.py          # 配置管理
│   │   ├── logger.py          # 日志管理
│   │   └── thresholds.py      # 阈值配置
│   ├── detectors/
│   │   ├── base.py            # 检测器基类
│   │   ├── regex_detector.py  # 正则检测器
│   │   ├── bert_detector.py   # BERT 检测器
│   │   └── llm_detector.py    # LLM 检测器
│   ├── services/
│   │   ├── orchestrator.py    # 检测编排器
│   │   ├── model_manager.py   # 模型管理
│   │   └── sample_manager.py  # 样本管理
│   ├── training/
│   │   ├── trainer.py         # 模型训练器
│   │   ├── dataset.py         # 数据集处理
│   │   └── augmentation.py    # 数据增强
│   └── tests/
│       ├── test_api.py        # API 测试
│       └── test_detectors.py  # 检测器测试
├── config/                     # 配置文件目录
├── data/                       # 数据目录
├── logs/                       # 日志目录
├── models/                     # 模型目录
├── main.py                     # 应用入口
├── Dockerfile
├── docker-compose.yml
├── requirements.txt
└── README.md
```

## 🚀 快速开始

### 环境要求

- Python 3.10+
- CUDA 11.8+（可选，用于 GPU 加速）

### 安装依赖

```bash
# 创建虚拟环境
python -m venv venv
source venv/bin/activate  # Linux/Mac
# venv\Scripts\activate   # Windows

# 安装依赖
pip install -r requirements.txt
```

### 配置环境变量

```bash
# 复制示例配置
cp .env.example .env

# 编辑配置文件
vim .env
```

### 启动服务

```bash
# 开发模式
python main.py

# 或使用 uvicorn
uvicorn main:app --host 0.0.0.0 --port 8000 --reload
```

### Docker 部署

```bash
# 构建并启动（CPU 版本）
docker-compose up -d sensitive-detector

# GPU 版本
docker-compose --profile gpu up -d sensitive-detector-gpu

# 启动完整监控栈
docker-compose --profile monitoring up -d
```

## 📚 API 文档

启动服务后访问：
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc

### 核心接口

#### 检测接口

```bash
# POST /api/v1/detect
curl -X POST "http://localhost:8000/api/v1/detect" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "我的手机号是13812345678，邮箱是test@example.com"
  }'
```

响应示例：

```json
{
  "request_id": "uuid",
  "is_sensitive": true,
  "risk_level": "high",
  "overall_confidence": 0.95,
  "matches": [
    {
      "type": "phone",
      "value": "138****5678",
      "start_pos": 6,
      "end_pos": 17,
      "confidence": 0.95,
      "detector": "regex"
    },
    {
      "type": "email",
      "value": "t***@example.com",
      "start_pos": 21,
      "end_pos": 37,
      "confidence": 0.99,
      "detector": "regex"
    }
  ],
  "detector_results": [...],
  "total_latency_ms": 15.5,
  "model_version": "v1.0.0"
}
```

#### 批量检测

```bash
# POST /api/v1/detect/batch
curl -X POST "http://localhost:8000/api/v1/detect/batch" \
  -H "Content-Type: application/json" \
  -d '{
    "texts": [
      "手机号13812345678",
      "普通文本"
    ]
  }'
```

#### 添加样本

```bash
# POST /api/v1/samples
curl -X POST "http://localhost:8000/api/v1/samples" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "测试手机号13800138000",
    "sensitive_type": "phone",
    "is_sensitive": true,
    "source": "manual"
  }'
```

#### 启动训练

```bash
# POST /api/v1/training/start
curl -X POST "http://localhost:8000/api/v1/training/start" \
  -H "Content-Type: application/json" \
  -d '{
    "epochs": 10,
    "batch_size": 32,
    "incremental": false
  }'
```

#### 获取训练状态

```bash
# GET /api/v1/training/{training_id}
curl "http://localhost:8000/api/v1/training/{training_id}"
```

#### 更新阈值

```bash
# PUT /api/v1/thresholds
curl -X PUT "http://localhost:8000/api/v1/thresholds" \
  -H "Content-Type: application/json" \
  -d '{
    "type_name": "phone",
    "bert_high_threshold": 0.9
  }'
```

#### 重新加载模型

```bash
# POST /api/v1/models/reload
curl -X POST "http://localhost:8000/api/v1/models/reload"
```

#### 健康检查

```bash
# GET /api/v1/health
curl "http://localhost:8000/api/v1/health"
```

#### 监控指标

```bash
# GET /api/v1/metrics
curl "http://localhost:8000/api/v1/metrics"
```

## 🔧 配置说明

### 检测策略配置

| 配置项 | 说明 | 默认值 |
|--------|------|--------|
| `ENABLE_LLM_FALLBACK` | 是否启用大模型兜底 | `true` |
| `REGEX_CONFIDENCE_THRESHOLD` | 正则匹配置信度阈值 | `0.95` |
| `BERT_HIGH_THRESHOLD` | BERT 高置信度阈值 | `0.85` |
| `BERT_LOW_THRESHOLD` | BERT 低置信度阈值 | `0.3` |

### 模型配置

| 配置项 | 说明 | 默认值 |
|--------|------|--------|
| `USE_GPU` | 是否使用 GPU | `true` |
| `INFERENCE_BACKEND` | 推理后端 (pytorch/onnx) | `pytorch` |
| `BERT_MODEL_PATH` | BERT 模型路径 | `models/bert-sensitive` |

### LLM 配置

| 配置项 | 说明 | 默认值 |
|--------|------|--------|
| `LLM_PROVIDER` | LLM 提供商 | `openai` |
| `OPENAI_API_KEY` | OpenAI API Key | - |
| `OPENAI_MODEL` | 使用的模型 | `gpt-3.5-turbo` |

## 🧪 测试

```bash
# 运行所有测试
pytest app/tests/ -v

# 运行 API 测试
pytest app/tests/test_api.py -v

# 运行检测器测试
pytest app/tests/test_detectors.py -v

# 生成覆盖率报告
pytest --cov=app --cov-report=html
```

## 🏗️ 扩展指南

### 添加新的敏感类型

1. 在 `regex_detector.py` 中添加正则规则：

```python
self.add_rule(RegexRule(
    name="custom_type",
    sensitive_type="custom",
    pattern=re.compile(r'YOUR_PATTERN'),
    confidence=0.95,
))
```

2. 在 `dataset.py` 的 `LABEL_MAP` 中添加标签：

```python
LABEL_MAP = {
    ...
    "custom": 11,
}
```

3. 更新配置文件中的 `enabled_sensitive_types`

### 添加新的检测器

1. 继承 `BaseDetector` 类：

```python
from app.detectors.base import BaseDetector, DetectionResult

class CustomDetector(BaseDetector):
    async def detect(self, text: str, detect_types=None) -> DetectionResult:
        # 实现检测逻辑
        pass
```

2. 在 `orchestrator.py` 中注册新检测器

### 添加新的 LLM 提供商

1. 继承 `BaseLLMProvider` 类：

```python
from app.detectors.llm_detector import BaseLLMProvider

class CustomLLMProvider(BaseLLMProvider):
    async def generate(self, prompt: str, **kwargs) -> str:
        # 实现生成逻辑
        pass
    
    def is_available(self) -> bool:
        return True
```

2. 在 `LLMDetector._init_provider()` 中添加提供商

## 📊 监控

### Prometheus 指标

- `http_requests_total`: HTTP 请求总数
- `detection_latency_seconds`: 检测延迟
- `detection_results`: 检测结果统计

### 日志

- 应用日志: `logs/app.log`
- 审计日志: `logs/audit.log`
- 错误日志: `logs/error.log`

## 🔐 安全建议

1. 生产环境禁用调试模式 (`DEBUG=false`)
2. 配置 CORS 白名单
3. 启用 HTTPS
4. 定期轮转日志
5. 敏感配置使用环境变量
6. 定期备份样本数据库

## 📝 许可证

MIT License

## 🤝 贡献

欢迎提交 Issue 和 Pull Request！

## 📮 联系方式

如有问题，请提交 Issue 或联系维护者。