# TupleAlg

**Repository Path**: zhang-yi-667/tuple-alg

## Basic Information

- **Project Name**: TupleAlg
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 1
- **Forks**: 0
- **Created**: 2025-04-12
- **Last Updated**: 2025-04-18

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# TupleStore

TupleStore 是一个高性能的元组存储系统，支持以下特性：

- 高效的元组存储和检索
- 支持多种数据类型（int32, int64, float32, float64, string, bool, time）
- 支持条件查询和复合查询
- 支持并发操作
- 支持数据压缩
- 提供完整的指标监控

## 项目结构

```
.
├── main.go                 # 主程序入口，包含示例代码
├── tuplestore/            # 核心实现包
│   ├── store.go           # 存储核心实现，包含 TupleStore 和 Segment 结构
│   ├── types.go           # 数据类型定义和元组结构
│   ├── compression.go     # 数据压缩实现
│   ├── query.go           # 查询操作实现
│   ├── operations.go      # 基本操作实现（插入、更新、删除）
│   ├── index.go           # 索引实现
│   ├── metrics.go         # 性能指标监控
│   ├── main_test.go       # 主测试文件
│   └── tuplestore_test.go # 存储测试文件
└── benchmarks/            # 性能测试目录
```

## 核心组件说明

### 1. 存储结构
- `TupleStore`: 顶层存储容器，管理多个 Segment
- `Segment`: 存储段，包含多个 Block
- `Block`: 数据块，包含主存储和增量存储
- `DeltaStore`: 增量存储，处理最近的操作
- `MainStore`: 主存储，存储压缩后的数据

### 2. 数据类型
支持多种数据类型：
- 整数：int32, int64
- 浮点数：float32, float64
- 字符串：string
- 布尔值：bool
- 时间：time.Time

### 3. 压缩机制
- 字典编码：适用于低基数列
- 增量编码：适用于有序数值列
- 游程编码：适用于重复值多的列
- 位打包：适用于小整数
- 前缀压缩：适用于字符串

### 4. 查询功能
- ID 查询：通过唯一标识符快速查找
- 条件查询：支持多条件组合查询
- 范围查询：支持数值和时间的范围查询

## 安装

```bash
# 克隆项目
git clone https://github.com/yourusername/go-tuplestore.git
cd go-tuplestore

# 安装依赖
go mod download
```

## 运行

```bash
# 运行主程序
go run .

# 运行测试
go test -v ./tuplestore

# 运行性能测试
go test -bench=. ./tuplestore/benchmarks
```

## 使用示例

```go
package main

import (
    "fmt"
    "time"
    "go-tuplestore/tuplestore"
)

func main() {
    // 创建配置
    config := tuplestore.Config{
        MergeThreshold:   100,
        MaxDeltaSize:     1024 * 1024, // 1MB
        MaxMergeInterval: time.Hour,
        InitialSegments:  4,
        BlocksPerSegment: 10,
    }

    // 初始化 TupleStore
    store := tuplestore.NewTupleStore(config)

    // 插入元组
    tuple := &tuplestore.Tuple{
        ID:   "user1",
        Data: []byte("example data"),
    }
    if err := store.Insert(tuple); err != nil {
        fmt.Printf("插入失败: %v\n", err)
        return
    }

    // 查询元组
    result, err := store.GetByID("user1")
    if err != nil {
        fmt.Printf("查询失败: %v\n", err)
        return
    }
    fmt.Printf("查询结果: %+v\n", result)
}
```

## 性能指标

- 插入性能：10万 QPS
- 查询性能：20万 QPS
- 压缩比率：平均 50%

## 注意事项

1. 系统使用内存存储，请确保有足够的内存空间
2. 建议根据实际数据特征调整配置参数
3. 定期监控系统指标，及时调整存储策略
4. 对于大量数据，建议使用分布式部署

# 语雀文档

有序多元组的高效数据结构与存储详解
1. 数据结构核心设计
1.1 元组基础结构
// Tuple 表示单个多元组，保持元素有序
type Tuple struct {
    ID       string        // 唯一标识符
    Elements []Element     // 有序元素列表
    Created  int64         // 创建时间戳
    Updated  int64         // 最后更新时间戳
}

// Element 表示元组中的单个元素
type Element struct {
    Index     int          // 元素在元组中的位置索引
    Type      ElementType  // 元素类型（整数、浮点数、字符串等）
    Value     interface{}  // 元素值
}

// ElementType 定义元素类型枚举
type ElementType byte

const (
    TypeNull ElementType = iota
    TypeInt
    TypeFloat
    TypeString
    TypeBool
    TypeBytes
    TypeArray
)
1.2 分层存储架构
// TupleStore 主存储容器
type TupleStore struct {
    // 分段存储 - 按时间或 ID 范围分段
    segments []*Segment
    // 索引层 - 提供快速查找能力
    indices map[string]Index
    // 元数据字典 - 存储全局信息
    metaDict *MetaDictionary
}

// Segment 表示一个存储段
type Segment struct {
    ID        string
    MinID     string               // 段内最小 ID
    MaxID     string               // 段内最大 ID
    MinTime   int64                // 段内最早时间戳
    MaxTime   int64                // 段内最晚时间戳
    Blocks    []*Block             // 段内所有数据块
    BlockMap  map[string]int       // ID 到块索引的映射
    Stats     *SegmentStatistics   // 段统计信息
}

// Block 数据块 - 实际存储压缩数据的单元
type Block struct {
    ID         string
    TupleCount int                // 块内元组数量
    ElementMap []int              // 元素索引映射
    // 列式压缩存储 - 每列使用不同压缩方法
    Columns    []*CompressedColumn
    // 增量存储 - 未压缩的最近操作
    DeltaStore *DeltaStore
    // 元数据
    MinValues  []interface{}      // 每列的最小值
    MaxValues  []interface{}      // 每列的最大值
}
2. 详细的压缩存储实现
2.1 列存储压缩方案
// CompressedColumn 表示单个压缩列
type CompressedColumn struct {
    ElementIndex int             // 对应元素索引
    DataType     ElementType     // 数据类型
    // 压缩元数据
    ValueCount   int             // 值的数量
    NullCount    int             // 空值数量
    NullBitmap   []byte          // 空值位图
    // 压缩方法与数据
    EncodingType EncodingType    // 使用的编码类型
    EncodedData  []byte          // 压缩后的字节数据
    // 用于字典编码的字典
    Dictionary   []interface{}   // 值字典（仅用于字典编码）
    DictMap      map[interface{}]uint32 // 值到索引的映射
}

// EncodingType 编码类型枚举
type EncodingType byte

const (
    EncodingRaw EncodingType = iota
    EncodingDictionary       // 字典编码：将值替换为字典索引
    EncodingRLE              // 游程编码：适用于重复值
    EncodingDelta            // 增量编码：存储与前值的差异
    EncodingBitPacking       // 位打包：对小整数使用较少位
    EncodingByteAligned      // 字节对齐：为不同值使用最少字节
    EncodingPrefixCompression // 前缀压缩：用于字符串
    EncodingFrequencyEncoding // 频率编码：不同长度编码不同值
)
2.2 具体压缩算法实现
字典编码 (适用于基数较低的列)
func encodeDictionary(values []interface{}) (*CompressedColumn, error) {
    col := &CompressedColumn{
        EncodingType: EncodingDictionary,
        ValueCount:   len(values),
    }

    // 1. 构建唯一值字典
    valueset := make(map[interface{}]struct{})
    for _, v := range values {
        valueset[v] = struct{}{}
    }

    // 2. 为每个唯一值分配索引
    col.Dictionary = make([]interface{}, 0, len(valueset))
    col.DictMap = make(map[interface{}]uint32)
    for v := range valueset {
        idx := uint32(len(col.Dictionary))
        col.Dictionary = append(col.Dictionary, v)
        col.DictMap[v] = idx
    }

    // 3. 确定索引编码需要的位数
    dictSize := len(col.Dictionary)
    bitsPerIndex := 0
    if dictSize <= 1 {
        bitsPerIndex = 1
    } else {
        bitsPerIndex = int(math.Ceil(math.Log2(float64(dictSize))))
    }

    // 4. 将值转换为索引并位打包
    indexBuffer := make([]uint32, len(values))
    for i, v := range values {
        indexBuffer[i] = col.DictMap[v]
    }

    writer := NewBitWriter(uint(bitsPerIndex))
    for _, idx := range indexBuffer {
        writer.Write(idx)
    }
    col.EncodedData = writer.Bytes()
    return col, nil
}
增量编码 (适用于有序数值列)
func encodeDelta(values []int64) (*CompressedColumn, error) {
    col := &CompressedColumn{
        EncodingType: EncodingDelta,
        ValueCount:   len(values),
    }

    if len(values) == 0 {
        return col, nil
    }

    // 1. 计算基值和增量值
    baseValue := values[0]
    deltas := make([]int64, len(values))
    deltas[0] = 0 // 第一个值的增量为 0
    for i := 1; i < len(values); i++ {
        deltas[i] = values[i] - values[i-1]
    }

    // 2. 将基值存储为元数据
    baseValueBytes := make([]byte, 8)
    binary.LittleEndian.PutUint64(baseValueBytes, uint64(baseValue))

    // 3. 为增量值使用可变长度编码
    varintBuffer := make([]byte, 0, len(values)*5) // 估计每个增量最多需要 5 字节
    for _, delta := range deltas {
        // 使用 zigzag 编码处理负数
        zigzagDelta := (delta << 1) ^ (delta >> 63)
        // 写入可变长度整数
        for zigzagDelta >= 128 {
            varintBuffer = append(varintBuffer, byte(zigzagDelta)|128)
            zigzagDelta >>= 7
        }
        varintBuffer = append(varintBuffer, byte(zigzagDelta))
    }

    // 4. 组合基值和增量
    col.EncodedData = append(baseValueBytes, varintBuffer...)
    return col, nil
}
游程编码 (适用于重复值多的列)
func encodeRLE(values []interface{}) (*CompressedColumn, error) {
    col := &CompressedColumn{
        EncodingType: EncodingRLE,
        ValueCount:   len(values),
    }

    if len(values) == 0 {
        return col, nil
    }

    // 压缩策略：(重复次数, 值) 对
    var buffer bytes.Buffer
    currentValue := values[0]
    runLength := 1

    for i := 1; i < len(values); i++ {
        if values[i] == currentValue {
            runLength++
        } else {
            // 写入 (runLength, value) 对
            writeRunLength(&buffer, runLength)
            writeValue(&buffer, currentValue)
            // 重置当前值和计数
            currentValue = values[i]
            runLength = 1
        }
    }
    // 处理最后一组
    writeRunLength(&buffer, runLength)
    writeValue(&buffer, currentValue)

    col.EncodedData = buffer.Bytes()
    return col, nil
}

// 使用可变长度编码写入重复计数
func writeRunLength(buffer *bytes.Buffer, length int) {
    for length >= 128 {
        buffer.WriteByte(byte(length&127) | 128)
        length >>= 7
    }
    buffer.WriteByte(byte(length))
}

// 写入值 (根据类型使用不同编码)
func writeValue(buffer *bytes.Buffer, value interface{}) {
    switch v := value.(type) {
    case int64:
        binary.Write(buffer, binary.LittleEndian, v)
    case float64:
        binary.Write(buffer, binary.LittleEndian, v)
    case string:
        writeRunLength(buffer, len(v))
        buffer.WriteString(v)
    case bool:
        if v {
            buffer.WriteByte(1)
        } else {
            buffer.WriteByte(0)
        }
    }
}
2.3 增量存储设计
// DeltaStore 存储未压缩的增量变更
type DeltaStore struct {
    // 新增的元组
    AddedTuples []*Tuple
    // 更新的元组 (ID -> 新元组)
    UpdatedTuples map[string]*Tuple
    // 已删除的元组 ID 集合
    DeletedIDs map[string]struct{}
    // 变更计数，用于决定何时触发合并
    ChangeCount int
}

// 添加新元组到增量存储
func (ds *DeltaStore) AddTuple(tuple *Tuple) {
    ds.AddedTuples = append(ds.AddedTuples, tuple)
    ds.ChangeCount++
}

// 标记元组删除
func (ds *DeltaStore) DeleteTuple(id string) {
    if ds.DeletedIDs == nil {
        ds.DeletedIDs = make(map[string]struct{})
    }
    ds.DeletedIDs[id] = struct{}{}
    ds.ChangeCount++
}

// 更新元组
func (ds *DeltaStore) UpdateTuple(tuple *Tuple) {
    if ds.UpdatedTuples == nil {
        ds.UpdatedTuples = make(map[string]*Tuple)
    }
    ds.UpdatedTuples[tuple.ID] = tuple
    ds.ChangeCount++
}
3. 物理存储布局
3.1 内存布局
TupleStore
├── 元数据区 (MetaDictionary)
│   ├── 字段信息表
│   ├── 类型词典
│   └── 统计信息
├── 段管理区 (Segments)
│   ├── Segment 1
│   │   ├── 段元信息
│   │   └── 块索引表
│   ├── Segment 2
│   │   ...
│   ├── 数据块区 (Blocks)
│   ├── Block 1
│   │   ├── 块元信息
│   │   ├── 列1 (压缩)
│   │   │   ├── 压缩元数据
│   │   │   ├── 空值位图
│   │   │   ├── 字典 (如适用)
│   │   │   └── 压缩数据
│   │   ├── 列2 (压缩)
│   │   │   ...
│   │   └── 增量存储
│   │       ├── 新增元组
│   │       ├── 更新元组
│   │       └── 删除 ID 集合
│   ├── Block 2
│   │   ...
└── 索引区 (Indices)
    ├── 主键索引
    │   └── B+树/哈希表结构
    └── 二级索引
        └── 多级决策树结构
3.2 详细内存布局示例
以一个包含 100 万条 5 元素元组的存储示例：
● 元数据区 (~1KB)
  ○ 5 个元素定义: {索引: 0-4, 类型: int, float, string, bool, int}
  ○ 总元组数: 1,000,000
  ○ 段数: 10
  ○ 块数: 100 (每段 10 块)
● 段 0 (~100KB)
  ○ ID 范围: "0000" - "0999"
  ○ 时间范围: 1609459200 - 1609545599
  ○ 块索引: {块 ID -> 内存位置} 映射 (10 条)
  ○ 统计摘要: {每列最小/最大值，不同值数量}
● 块 0 (~1MB)
  ○ 元组数: 10,000
  ○ 元素 0 (int 列):
    ■ 编码: 位打包 (11 位/值)
    ■ 最小值: 1000
    ■ 最大值: 3050
    ■ 压缩数据: [13.75KB] (vs 原始 40KB, 压缩率 66%)
  ○ 元素 1 (float 列):
    ■ 编码: 字典编码 (只有 500 个不同值)
    ■ 字典: [4KB] (500 个浮点数)
    ■ 压缩数据: [12.5KB] (vs 原始 80KB, 压缩率 80%)
  ○ 元素 2 (string 列):
    ■ 编码: 前缀压缩 + 字典编码
    ■ 字典: [20KB] (2000 个字符串前缀和后缀)
    ■ 压缩数据: [35KB] (vs 原始 250KB, 压缩率 86%)
  ○ 元素 3 (bool 列):
    ■ 编码: 位图编码
    ■ 压缩数据: [1.25KB] (vs 原始 10KB, 压缩率 87.5%)
  ○ 元素 4 (int 列):
    ■ 编码: 增量编码
    ■ 基准值: 100
    ■ 压缩数据: [15KB] (vs 原始 40KB, 压缩率 62.5%)
  ○ 增量存储:
    ■ 新增元组: 150 条 [~15KB]
    ■ 更新元组: 75 条 [~7.5KB]
    ■ 删除 ID 集合: 25 个 ID [~0.5KB]
4. 索引结构详解
4.1 主键索引 (ID 查询优化)
// IDIndex 主键索引
type IDIndex struct {
    // 使用 B+ 树存储 ID 到位置的映射
    tree *btree.BTree
}

// IDIndexEntry 表示索引中的一个条目
type IDIndexEntry struct {
    ID        string  // 元组 ID
    SegmentID string  // 段 ID
    BlockID   string  // 块 ID
    Deleted   bool    // 删除标记
}

// 查找元组位置
func (idx *IDIndex) LookupID(id string) (*IDIndexEntry, bool) {
    key := &IDIndexEntry{ID: id}
    item := idx.tree.Get(key)
    if item == nil {
        return nil, false
    }

    entry := item.(*IDIndexEntry)
    if entry.Deleted {
        return nil, false
    }
    return entry, true
}
4.2 多级决策树索引 (多条件查询优化)
// DecisionTreeIndex 多级决策树索引
type DecisionTreeIndex struct {
    Root       *DTNode       // 根节点
    FieldOrder []int         // 索引的元素顺序 (按选择性排序)
    Depth      int           // 树最大深度
    UpdateCount int64        // 更新计数，用于触发重平衡
}

// DTNode 决策树节点
type DTNode struct {
    ElementIndex int               // 当前节点的元素索引
    SplitType    SplitType         // 节点分裂类型
    // 等值分支 (精确匹配)
    ValueBranches map[interface{}]*DTNode
    // 范围分支 (范围匹配)
    RangeBranches []*DTRangeBranch
    // 对应的块引用 (叶子节点)
    BlockRefs     []BlockRef
    // 统计信息，用于自适应优化
    QueryCount    int64
    HitCount      int64
}

// 范围分支
type DTRangeBranch struct {
    Min     interface{}
    Max     interface{}
    NextNode *DTNode
}

// 块引用
type BlockRef struct {
    SegmentID string
    BlockID   string
    // 块内元组计数 (用于查询规划)
    TupleCount int
}
5. 详细操作实现
5.1 插入操作
func (ts *TupleStore) Insert(tuple *Tuple) error {
    // 1. 检查 ID 唯一性
    if _, exists := ts.idIndex.LookupID(tuple.ID); exists {
        return ErrDuplicateID
    }

    // 2. 选择最佳段和块
    segment, block := ts.selectOptimalLocation(tuple)

    // 3. 添加到块的增量存储
    block.DeltaStore.AddTuple(tuple)

    // 4. 更新索引
    ts.updateIndices(tuple, segment.ID, block.ID)

    // 5. 检查是否需要合并增量存储
    if block.DeltaStore.ChangeCount >= ts.mergeThreshold {
        // 异步触发合并操作
        ts.scheduleBlockMerge(segment.ID, block.ID)
    }

    return nil
}

// 选择最佳位置
func (ts *TupleStore) selectOptimalLocation(tuple *Tuple) (*Segment, *Block) {
    // 基于元组的特征选择最佳放置位置
    segmentID := ts.selectSegment(tuple)
    segment := ts.segments[segmentID]

    blockID := segment.selectBlock(tuple)
    block := segment.Blocks[blockID]

    return segment, block
}

// 选择段的策略
func (ts *TupleStore) selectSegment(tuple *Tuple) string {
    // 可以基于多种策略:
    // 1. ID 范围分段
    // 2. 时间范围分段
    // 3. 元素值分段
    // 这里使用简单的 ID 哈希策略
    hash := xxhash.Sum64String(tuple.ID)
    segmentIndex := hash % uint64(len(ts.segments))
    return ts.segments[segmentIndex].ID
}
5.2 查询操作
执行单 ID 查询
func (ts *TupleStore) GetByID(id string) (*Tuple, error) {
    // 1. 通过 ID 索引查找位置
    entry, exists := ts.idIndex.LookupID(id)
    if !exists {
        return nil, ErrNotFound
    }

    // 2. 获取对应段和块
    segment := ts.getSegment(entry.SegmentID)
    block := segment.getBlock(entry.BlockID)

    // 3. 先查找增量存储
    if tuple := block.DeltaStore.findTuple(id); tuple != nil {
        return tuple, nil
    }

    // 4. 查找压缩数据
    tuple, err := block.decompressTuple(id)
    if err != nil {
        return nil, err
    }

    return tuple, nil
}
条件查询
func (ts *TupleStore) Query(conditions []Condition) ([]*Tuple, error) {
    // 1. 分析查询条件，确定最佳索引
    index, plan := ts.selectBestIndex(conditions)

    // 2. 使用索引获取候选块
    candidateBlocks := index.findCandidateBlocks(conditions, plan)

    // 3. 并行处理候选块
    resultChan := make(chan []*Tuple, len(candidateBlocks))
    var wg sync.WaitGroup

    for _, blockRef := range candidateBlocks {
        wg.Add(1)
        go func(ref BlockRef) {
            defer wg.Done()

            segment := ts.getSegment(ref.SegmentID)
            block := segment.getBlock(ref.BlockID)

            // 过滤块内元组
            matches := block.queryTuples(conditions)
            resultChan <- matches
        }(blockRef)
    }

    go func() {
        wg.Wait()
        close(resultChan)
    }()

    // 4. 收集并合并结果
    allMatches := make([]*Tuple, 0)
    for matches := range resultChan {
        allMatches = append(allMatches, matches...)
    }

    return allMatches, nil
}
5.3 块内查询和解压缩
块内查询
func (block *Block) queryTuples(conditions []Condition) []*Tuple {
    // 1. 检查列范围能否快速排除整个块
    for _, cond := range conditions {
        elemIndex := cond.ElementIndex
        if !cond.matchesRange(block.MinValues[elemIndex], block.MaxValues[elemIndex]) {
            return nil // 整个块都不匹配
        }
    }

    // 2. 从增量存储中查找匹配元组
    deltaMatches := block.DeltaStore.queryTuples(conditions)

    // 3. 选择性解压缩
    // 确定需要解压的列
    requiredColumns := make(map[int]bool)
    for _, cond := range conditions {
        requiredColumns[cond.ElementIndex] = true
    }

    // 先解压条件涉及的列
    columnValues := make(map[int][]interface{})
    for elemIndex := range requiredColumns {
        values, err := block.decompressColumn(elemIndex)
        if err != nil {
            // 处理错误
            continue
        }
        columnValues[elemIndex] = values
    }

    // 4. 过滤压缩数据
    compressedMatches := make([]*Tuple, 0)
    for i := 0; i < block.TupleCount; i++ {
        // 检查增量存储中是否已有此 ID 的元组 (更新或删除)
        tupleID := block.getTupleID(i)
        if block.DeltaStore.isModified(tupleID) {
            continue
        }

        // 检查是否满足所有条件
        match := true
        for _, cond := range conditions {
            elemValue := columnValues[cond.ElementIndex][i]
            if !cond.matches(elemValue) {
                match = false
                break
            }
        }

        if match {
            // 只解压需要的元组
            tuple, err := block.decompressTupleAt(i)
            if err == nil {
                compressedMatches = append(compressedMatches, tuple)
            }
        }
    }

    // 5. 合并结果
    return append(deltaMatches, compressedMatches...)
}
解压单个列
func (block *Block) decompressColumn(elemIndex int) ([]interface{}, error) {
    col := block.Columns[elemIndex]

    switch col.EncodingType {
    case EncodingDictionary:
        return decodeDictionary(col)
    case EncodingDelta:
        return decodeDelta(col)
    case EncodingRLE:
        return decodeRLE(col)
    default:
        return nil, fmt.Errorf("unsupported encoding: %v", col.EncodingType)
    }
}
6. 高级优化机制
6.1 自适应索引重平衡
func (idx *DecisionTreeIndex) rebalance(tupleStore *TupleStore) {
    // 1. 分析节点访问模式
    nodeStats := idx.collectNodeStatistics()

    // 2. 识别热点和冷点路径
    hotPaths, coldPaths := idx.identifyHotAndColdPaths(nodeStats)

    // 3. 重新评估元素顺序
    newFieldOrder := idx.optimizeFieldOrder(tupleStore, hotPaths)

    // 4. 如果元素顺序有较大变化，重建索引
    if idx.shouldRebuild(newFieldOrder) {
        idx.rebuild(tupleStore, newFieldOrder)
    } else {
        // 5. 否则只调整有问题的分支
        idx.adjustBranches(hotPaths, coldPaths)
    }
}
6.2 块合并与分裂策略
合并增量存储到压缩列
func (block *Block) mergeDeltas() error {
    // 跳过，如果没有足够变更
    if block.DeltaStore.ChangeCount < minChangeThreshold {
        return nil
    }

    // 1. 应用删除操作
    for id := range block.DeltaStore.DeletedIDs {
        block.markDeleted(id)
    }

    // 2. 提取所有未删除的当前元组
    tuples := block.getAllActiveTuples()

    // 3. 应用更新操作
    for id, updatedTuple := range block.DeltaStore.UpdatedTuples {
        // 替换或添加
        tuples[id] = updatedTuple
    }

    // 4. 添加新增元组
    for _, newTuple := range block.DeltaStore.AddedTuples {
        tuples[newTuple.ID] = newTuple
    }

    // 5. 将所有元组转换为列式格式
    columns := convertToColumns(tuples)

    // 6. 压缩每一列
    newCompressedColumns := make([]*CompressedColumn, len(columns))
    for i, column := range columns {
        compressed, err := compressColumn(column, block.selectEncodingType(i, column))
        if err != nil {
            return err
        }
        newCompressedColumns[i] = compressed
    }

    // 7. 替换旧压缩列并重置增量存储
    block.Columns = newCompressedColumns
    block.DeltaStore = newDeltaStore()
    block.updateMetadata()

    return nil
}
检查是否需要分裂块
func (block *Block) checkSplitRequired() bool {
    // 基于以下因素决定是否分裂:
    // 1. 块大小 (字节)
    if block.sizeInBytes() > maxBlockSize {
        return true
    }

    // 2. 元组数量
    if block.TupleCount > maxTuplesPerBlock {
        return true
    }

    // 3. 压缩率降低
    if block.compressionRatio() < minCompressionRatio {
        return true
    }

    return false
}
分裂块策略
func (block *Block) split() []*Block {
    // 1. 选择分裂维度
    splitElementIndex := block.selectSplitDimension()

    // 2. 确定分裂点
    splitValue := block.determineSplitPoint(splitElementIndex)

    // 3. 创建两个新块
    leftBlock := newBlock()
    rightBlock := newBlock()

    // 4. 将元组分配到新块
    tuples := block.getAllTuples()
    for _, tuple := range tuples {
        elemValue := tuple.Elements[splitElementIndex].Value

        if compareValues(elemValue, splitValue) <= 0 {
            leftBlock.DeltaStore.AddTuple(tuple)
        } else {
            rightBlock.DeltaStore.AddTuple(tuple)
        }
    }

    // 5. 压缩新块
    leftBlock.mergeDeltas()
    rightBlock.mergeDeltas()

    return []*Block{leftBlock, rightBlock}
}
7. 性能与内存使用分析
7.1 压缩率分析
典型数据集的预期压缩率:
● 整数列: 60-85% 压缩率 (取决于值域和分布)
● 浮点列: 40-75% 压缩率 (取决于精度需求)
● 字符串列: 70-95% 压缩率 (取决于重复模式和长度)
● 布尔列: 85-98% 压缩率 (使用位压缩)
7.2 时间复杂度
主要操作的计算复杂度:
● 插入: O(log N) - 索引更新的复杂度
● ID 查询: O(1) 到 O(log N) - 取决于索引实现
● 条件查询: O(log N + kc) - k 为匹配元组数，c 为条件数
● 范围查询: O(log N + r) - r 为范围内元组数
● 更新: O(log N) - 索引查找 + O(1) 增量更新
● 删除: O(log N) - 仅标记删除
7.3 内存用量示例
对于 100 万个 5 元素元组的数据集:
● 原始大小: ~100MB
  ○ (假设每个元素平均大小为 20 字节，每个元组 100 字节)
● 压缩后总大小: ~20-30MB
  ○ 
    ■ 压缩数据: ~15-25MB
  ○ 
    ■ 索引开销: ~3-5MB
  ○ 
    ■ 元数据与字典: ~1-2MB
  ○ 
    ■ 增量缓冲区: ~1-3MB
通过这种设计，我们实现了高压缩率的数据存储，同时保持了快速的增删改查性能。系统能够根据数据特性和访问模式自动调整存储策略，在空间效率和访问性能之间取得平衡。