# pyvastbase2

**Repository Path**: dbalyy_dbalyy/pyvastbase2

## Basic Information

- **Project Name**: pyvastbase2
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-04-14
- **Last Updated**: 2026-05-21

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Pyvastbase2

Pyvastbase2 是一个基于 SQLAlchemy+SQL 实现的向量存储库，用于在支持向量类型的数据库中存储和检索向量数据。它完全脱离了 LangChain 的依赖，提供了独立的向量存储和检索功能，支持自定义表字段、全文检索、稀疏向量、索引管理等高级特性。

## 主要功能

- **向量存储与检索**：支持多种距离策略的向量搜索
- **全文检索**：支持基于文本的搜索功能，可通过enable_fulltext参数启用
- **稀疏向量**：支持稀疏向量的存储和检索，可通过enable_sparse参数启用
- **混合检索**：支持向量检索、全文检索和稀疏向量检索的混合
- **自定义表字段**：允许用户定义额外的表字段
- **索引管理**：支持向量索引、元数据索引、全文索引和稀疏向量索引
- **词典管理**：支持创建和使用自定义文本搜索词典
- **异步操作**：支持异步数据库操作
- **FulltextManager**：专门管理全文检索相关的索引和词典
- **SparseManager**：专门管理稀疏向量相关的索引

## 安装方法

### 从源码安装

```bash
# 克隆项目
git clone https://gitee.com/dbalyy_dbalyy/pyvastbase2.git
cd pyvastbase2

# 安装依赖
pip install -r requirements.txt

# 从源码安装
pip install -e .
```

### 打包为 whl 并安装

```bash
# 打包为 whl
python -m build

# 安装生成的 whl 文件
pip install dist/pyvastbase2-*.whl
```

## 快速开始

```python
from pyvastbase2 import VBVector, Document, Embeddings, DistanceStrategy

# 自定义嵌入模型
class MyEmbeddings(Embeddings):
    def embed_documents(self, texts):
        # 实现文档嵌入
        return [[0.1] * 1536 for _ in texts]
    
    def embed_query(self, text):
        # 实现查询嵌入
        return [0.1] * 1536

# 创建向量存储
store = VBVector(
    embeddings=MyEmbeddings(),
    connection="postgresql+psycopg://test:Abc.1234@1.95.206.159:5432/test",
    collection_name="my_collection",
    embedding_length=1536,
    distance_strategy=DistanceStrategy.COSINE
)

# 添加文档
store.add_texts(
    texts=["Hello world", "Test document"],
    metadatas=[{"source": "test1"}, {"source": "test2"}]
)

# 搜索
results = store.similarity_search("query", k=4)
print(results)
```

## 详细使用

### 1. 初始化 VBVector

```python
from pyvastbase2 import VBVector, Embeddings, DistanceStrategy

# 自定义嵌入模型
class MyEmbeddings(Embeddings):
    def embed_documents(self, texts):
        return [[0.1] * 1536 for _ in texts]
    
    def embed_query(self, text):
        return [0.1] * 1536

# 初始化向量存储（启用稀疏向量）
store = VBVector(
    embeddings=MyEmbeddings(),
    connection="postgresql+psycopg://test:Abc.1234@1.95.206.159:5432/test",
    collection_name="my_collection",
    embedding_length=1536,
    distance_strategy=DistanceStrategy.COSINE,
    pre_delete_collection=False,
    async_mode=False,
    index_type='graph_index',
    index_params={"m": 16, "ef_construction": 128},
    index_search_params={"hnsw_ef_search": 100},
    enable_fulltext=True,
    fulltext_fields=["document"],
    fulltext_dict="cn_tokenizer",
    fulltext_algorithm="bm25",
    fulltext_coefficients={"b": 0.75, "k": 1.2},
    enable_sparse=True,           # 启用稀疏向量
    sparse_max_dim=10000,        # 稀疏向量最大维度
    sparse_field="svec",          # 稀疏向量字段名
    custom_fields={"category": "VARCHAR(50)", "timestamp": "TIMESTAMP"}
)
```

### 2. 添加文档（含稀疏向量）

```python
# 添加文本（含稀疏向量）
texts = ["Hello world", "Test document"]
sparse_vectors = [
    {0: 1.0, 2: 2.5, 5: 3.7},     # 稀疏向量1
    {1: 0.5, 3: 4.2, 7: 1.8},     # 稀疏向量2
]

ids = store.add_texts(
    texts=texts,
    metadatas=[{"source": "test1"}, {"source": "test2"}],
    sparse_vectors=sparse_vectors
)
print(f"Added documents with IDs: {ids}")

# 异步添加文本
import asyncio

async def add_texts_async():
    ids = await store.aadd_texts(
        texts=["Async test", "Another async test"],
        metadatas=[{"source": "async1"}, {"source": "async2"}],
        sparse_vectors=[{0: 1.0}, {1: 2.0}]
    )
    print(f"Added documents with IDs: {ids}")

asyncio.run(add_texts_async())
```

### 3. 向量检索

```python
# 相似度搜索
results = store.similarity_search("query", k=4)
for doc in results:
    print(f"Document: {doc.page_content}, Metadata: {doc.metadata}")

# 相似度搜索并返回分数
results_with_score = store.similarity_search_with_score("query", k=4)
for doc, score in results_with_score:
    print(f"Document: {doc.page_content}, Score: {score}")

# 通过向量进行相似度搜索
query_embedding = [0.1] * 1536
results = store.similarity_search_by_vector(query_embedding, k=4)

# 异步相似度搜索
import asyncio

async def search_async():
    results = await store.asimilarity_search("query", k=4)
    for doc in results:
        print(f"Document: {doc.page_content}, Metadata: {doc.metadata}")

asyncio.run(search_async())
```

### 4. 稀疏向量检索

```python
# 准备查询稀疏向量
query_sparse_vector = {0: 1.0, 2: 2.0, 5: 3.0}

# 执行稀疏向量检索
results = store.sparse_search(
    sparse_vector=query_sparse_vector,
    k=10,
    filter={"category": "technology"},
    with_score=True
)

# 输出结果
for doc, score in results:
    print(f"ID: {doc.id}, Score: {score}, Content: {doc.page_content[:50]}")
```

### 5. 全文检索

```python
# 创建文本搜索词典
store.create_text_search_dictionary(
    dict_name="my_dict",
    stopwords=["的", "了", "是"],
    userdict=["Pyvastbase2", "向量存储"],
    synonyms={"天津支行": ["天津市支行", "天津第一银行"], "清华大学": ["五道口男子职业技术学院", "THU"]}
)

# 创建全文索引
store.create_fulltext_index(
    fields=["document"],
    dict_names=["my_dict"],
    algorithms=["bm25"],
    coefficients=["k=1.2:b=0.75"],
    parallel_workers=4
)

# 全文检索
results = store.fulltext_search("query", k=10, with_score=True)
for doc, score in results:
    print(f"Document: {doc.page_content}, Score: {score}")
```

### 6. 混合检索

```python
# 创建混合索引（向量+全文+稀疏）
store.create_hybrid_index()

# 混合检索（向量+全文，使用min-max归一化）
results = store.hybrid_search(
    query="query",
    k=4,
    vector_weight=0.5,
    fulltext_weight=0.5,
    normalize_method="min-max"
)

# 稠密稀疏混合检索（使用RRF归一化）
query_sparse_vector = {0: 1.0, 2: 2.5, 5: 3.7}
results = store.dense_sparse_hybrid_search(
    query="query",
    sparse_vector=query_sparse_vector,
    k=4,
    dense_weight=0.5,
    sparse_weight=0.5,
    normalize_method="rrf"
)

for doc, score in results:
    print(f"Document: {doc.page_content}, Normalized Score: {score}")
```

### 7. 索引管理

```python
# 列出所有索引
indexes = store.index_manager.list_indexes()
print(f"Indexes: {indexes}")

# 创建元数据索引
meta_index = store.index_manager.create_meta_data_index()
print(f"Created metadata index: {meta_index}")

# 创建向量索引
vector_index = store.index_manager.create_vector_index(
    index_type="graph_index",
    distance_strategy=DistanceStrategy.COSINE,
    m=16,
    ef_construction=128
)
print(f"Created vector index: {vector_index}")

# 创建稀疏向量索引
sparse_index = store.create_sparse_index()
print(f"Created sparse index: {sparse_index}")

# 删除向量索引
store.index_manager.drop_vector_index()

# 删除稀疏向量索引
store.sparse_manager.drop_sparse_index()
```

### 8. 文档管理

```python
# 根据ID获取文档
docs = store.get_by_ids(["id1", "id2"])
for doc in docs:
    print(f"Document: {doc.page_content}, Metadata: {doc.metadata}")

# 删除文档
store.delete(ids=["id1", "id2"])

# 根据过滤条件删除文档
store.delete(filter={"source": "test"})

# 删除集合
store.delete_collection()
```

### 9. 词典管理

```python
# 创建词典
store.create_text_search_dictionary(
    dict_name="my_dict",
    stopwords=["的", "了", "是"],
    userdict=["Pyvastbase2", "向量存储"],
    synonyms={"天津支行": ["天津市支行", "天津第一银行"]}
)

# 删除词典
store.drop_text_search_dictionary("my_dict")
```

## 核心方法

### VBVector 类方法

#### 初始化方法

- __init__：初始化 VBVector 实例
- __post_init__：同步初始化存储
- __apost_init__：异步初始化存储

#### 文档操作

- **add_texts**：添加文本到向量存储
- **aadd_texts**：异步添加文本到向量存储
- **add_embeddings**：添加文本和嵌入向量到存储
- **aadd_embeddings**：异步添加文本和嵌入向量到存储
- **delete**：删除文档
- **adelete**：异步删除文档
- **get_by_ids**：根据ID获取文档
- **aget_by_ids**：异步根据ID获取文档
- **delete_collection**：删除集合及其所有数据
- **adelete_collection**：异步删除集合及其所有数据

#### 搜索方法

- **similarity_search**：相似度搜索
- **asimilarity_search**：异步相似度搜索
- **similarity_search_with_score**：使用向量相似度搜索并返回分数
- **asimilarity_search_with_score**：异步相似度搜索并返回分数
- **similarity_search_with_score_by_vector**：通过向量进行相似度搜索并返回分数
- **asimilarity_search_with_score_by_vector**：异步通过向量进行相似度搜索并返回分数
- **similarity_search_by_vector**：通过向量进行相似度搜索
- **asimilarity_search_by_vector**：异步通过向量进行相似度搜索
- **fulltext_search**：全文检索搜索
- **sparse_search**：稀疏向量检索
- **hybrid_search**：混合检索（向量+全文）
- **dense_sparse_hybrid_search**：稠密稀疏混合检索

#### 索引管理

- **create_fulltext_index**：创建全文检索索引
- **create_sparse_index**：创建稀疏向量索引
- **create_hybrid_index**：创建混合检索索引

#### 词典管理

- **create_text_search_dictionary**：创建文本搜索词典
- **drop_text_search_dictionary**：删除文本搜索词典

#### 工具方法

- **calculate_distance**：计算两个向量之间的距离
- **_select_relevance_score_fn**：选择相关性评分函数
- **_normalize_scores_min_max**：Min-Max归一化
- **_normalize_scores_rrf**：RRF归一化
- **_combine_results**：结果融合

### IndexManager 类方法

- **list_indexes**：列出嵌入列的所有索引
- **alist_indexes**：异步列出嵌入列的所有索引
- **create_meta_data_index**：在meta_data列上创建GIN索引
- **acreate_meta_data_index**：异步在meta_data列上创建GIN索引
- **create_vector_index**：在嵌入列上创建索引
- **acreate_vector_index**：异步在嵌入列上创建索引
- **drop_vector_index**：删除向量索引
- **adrop_vector_index**：异步删除向量索引

### FulltextManager 类方法

- **create_text_search_dictionary**：创建文本搜索词典
- **drop_text_search_dictionary**：删除文本搜索词典
- **create_fulltext_index**：创建全文检索索引

### SparseManager 类方法

- **create_sparse_index**：创建稀疏向量索引
- **acreate_sparse_index**：异步创建稀疏向量索引
- **drop_sparse_index**：删除稀疏向量索引

## 类方法

### VBVector 类方法

- **from_texts**：从文本列表创建向量存储
- **from_documents**：从文档列表创建向量存储
- **from_embeddings**：从文本和嵌入创建向量存储
- **afrom_embeddings**：异步从文本和嵌入创建向量存储
- **afrom_texts**：异步从文本列表创建向量存储
- **afrom_documents**：异步从文档列表创建向量存储

## 依赖

Pyvastbase2 依赖以下库：

- numpy
- sqlalchemy
- psycopg
- psycopg2 (Windows)
- psycopg2-binary (Linux)

## 版本

当前版本：1.0.0

## 作者

Pyvastbase2 开发团队

## 文档

- [稠密稀疏混合检索使用手册](稠密稀疏混合检索使用手册.md)
- [VexDB向量检索使用说明文档](VexDB向量检索使用说明文档.md)