# boom-rag

**Repository Path**: yangzijing/boom-rag

## Basic Information

- **Project Name**: boom-rag
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-02-05
- **Last Updated**: 2026-02-13

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Boom RAG

金融研报 RAG（检索增强生成）系统。基于 LlamaIndex 框架重构，支持 PDF 文档解析、向量检索、混合搜索（向量 + BM25）、S3/MinIO 对象存储。为本地私有化部署而构建，服务于投研分析师小团队。

## CLI 使用方法

本系统提供命令行工具 `main.py`，支持文档摄入和搜索功能。

### 安装依赖

```bash
uv pip install -e ".[dev]"
```

### 配置环境变量

```bash
cp .env.example .env
# 编辑 .env 文件配置数据库、Qdrant 和模型参数
```

### 常用命令

#### 1. 摄入文档

```bash
# 摄入本地 PDF 文档（默认路径 ./data/documents）
python main.py ingest --path ./data/documents --recursive

# 摄入单个文件
python main.py ingest --path ./data/documents/report.pdf

# 使用语义分块
python main.py ingest --path ./data/documents --parser semantic

# 重建集合
python main.py ingest --path ./data/documents --recreate

# 从 S3/MinIO 摄入
python main.py ingest --source s3 --path stock/research_reports
```

#### 2. 搜索文档

```bash
# 混合搜索（默认，向量 + BM25）
python main.py search "美联储降息的影响"

# 向量搜索（语义搜索）
python main.py search "新能源汽车行业发展趋势" --mode vector

# 文本搜索（BM25）
python main.py search "AI投资机会" --mode text

# 自定义返回数量
python main.py search "宏观经济" --top-k 20

# 指定集合
python main.py search "财报" --collection my-collection
```

#### 3. 查看索引信息

```bash
# 查看集合信息
python main.py info

# 重置集合
python main.py reset --recreate
```

#### 4. 测试嵌入模型

```bash
# 测试嵌入模型
python main.py embed-test --provider ollama --model-name qwen3-embedding:0.6b
```

## Python API 使用方法

`main.py` 提供了 Python API，可直接集成到代码中使用。

### 快速开始

```python
from main import create_rag_pipeline, run_ingest, run_search
```

### 方法 1：使用工厂函数

```python
from main import create_rag_pipeline

# 创建管道
pipeline = create_rag_pipeline()

# 摄入文档
result = pipeline.ingest_local("./data/pdf")
print(f"文档: {result['documents']}, 分块: {result['chunks']}")

# 搜索文档（默认混合检索）
results = pipeline.search("行业趋势分析")

# 指定检索模式
results = pipeline.search("新能源汽车", mode="vector")      # 向量检索
results = pipeline.search("新能源汽车", mode="text")        # BM25 文本检索
results = pipeline.search("新能源汽车", mode="hybrid")       # 混合检索（默认）

# 使用便捷方法
results = pipeline.search_hybrid("行业趋势分析")
results = pipeline.search_vector("新能源")
results = pipeline.search_text("财报")
```

### 方法 2：使用便捷函数

```python
from main import run_ingest, run_search

# 快速摄入
result = run_ingest("./data/pdf")

# 快速搜索
results = run_search("行业趋势分析")
results = run_search("行业趋势分析", mode="vector", top_k=10)
```

### 方法 3：直接使用 RAGPipeline 类

```python
from main import RAGPipeline

# 创建管道（可自定义配置）
pipeline = RAGPipeline(
    embed_model="qwen3-embedding:0.6b",
    chunk_size=512,
    chunk_overlap=50,
    collection_name="boom-rag",
)

# 摄入文档
result = pipeline.ingest(
    source="./data/pdf",
    source_type="local",
    recursive=True,
    parser_type="sentence",
)

# 搜索并打印结果
results = pipeline.search(
    query="美联储降息的影响",
    mode="hybrid",
    top_k=5,
    print_results=True,
)

# 查看集合信息
info = pipeline.get_collection_info()
print(f"向量数量: {info.get('points_count')}")

# 重置集合
pipeline.reset_collection()
```

### API 参考

#### RAGPipeline 类

| 方法                                                                       | 说明             |
| -------------------------------------------------------------------------- | ---------------- |
| `ingest(source, source_type, recursive, parser_type, recreate_collection)` | 执行文档摄入     |
| `ingest_local(directory, recursive, parser_type, recreate_collection)`     | 从本地目录摄入   |
| `ingest_s3(bucket, parser_type, recreate_collection)`                      | 从 S3/MinIO 摄入 |
| `search(query, mode, top_k, print_results)`                                | 搜索文档         |
| `search_vector(query, top_k, print_results)`                               | 向量检索         |
| `search_text(query, top_k, print_results)`                                 | BM25 文本检索    |
| `search_hybrid(query, top_k, print_results)`                               | 混合检索         |
| `get_collection_info()`                                                    | 获取集合信息     |
| `reset_collection()`                                                       | 重置集合         |

#### 检索模式

| 模式     | 说明                       |
| -------- | -------------------------- |
| `vector` | 向量检索，使用语义相似度   |
| `text`   | BM25 文本检索，关键字匹配  |
| `hybrid` | 混合检索，RRF 融合（默认） |

## Docker 启动方式

```bash
docker-compose up -d
```

## API 服务

```bash
# 启动 FastAPI 服务
uvicorn app.main:app --host 0.0.0.0 --port 8000

# 启动 Streamlit 前端
streamlit run frontend/app.py
```

## API 端点

| 端点             | 方法 | 描述         |
| ---------------- | ---- | ------------ |
| `/api/documents` | GET  | 列出所有文档 |
| `/api/documents` | POST | 上传新文档   |
| `/api/search`    | POST | 执行搜索     |
| `/api/ingest`    | POST | 触发摄入     |

## CLI 详细参数

### ingest 命令

| 参数              | 说明                         | 默认值           |
| ----------------- | ---------------------------- | ---------------- |
| `--path`          | 文件或目录路径               | ./data/documents |
| `--source`        | 数据源 (local/s3)            | local            |
| `--recursive`     | 递归扫描子目录               | True             |
| `--parser`        | 分块类型 (sentence/semantic) | sentence         |
| `--chunk-size`    | 分块大小                     | 512              |
| `--chunk-overlap` | 分块重叠                     | 50               |
| `--recreate`      | 重建集合                     | False            |
| `--collection`    | 集合名称                     | boom_rag         |

### search 命令

| 参数           | 说明                          | 默认值   |
| -------------- | ----------------------------- | -------- |
| `query`        | 搜索查询文本                  | -        |
| `--mode`       | 搜索模式 (vector/text/hybrid) | hybrid   |
| `--top-k`      | 返回结果数量                  | 10       |
| `--collection` | 集合名称                      | boom_rag |

### embed-test 命令

| 参数           | 说明           | 默认值               |
| -------------- | -------------- | -------------------- |
| `--provider`   | 嵌入模型提供商 | ollama               |
| `--model-name` | 模型名称       | qwen3-embedding:0.6b |

### info 命令

查看当前集合的向量数量和配置信息。

### reset 命令

| 参数         | 说明     |
| ------------ | -------- |
| `--recreate` | 重建集合 |

### 搜索模式

- **vector**: 向量搜索，使用语义相似度
- **text**: 文本搜索，使用 BM25
- **hybrid**: 混合搜索，RRF 融合向量和 BM25 结果