# SLiCE

**Repository Path**: mirrors_microsoft/SLiCE

## Basic Information

- **Project Name**: SLiCE
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-08-13
- **Last Updated**: 2025-09-13

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# SLiCE: Schema Lineage Composite Evaluation

[![PyPI version](https://img.shields.io/pypi/v/slice-score.svg)](https://pypi.org/project/slice-score/)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Paper: ArXiv](https://img.shields.io/badge/Paper-ArXiv-red.svg)](https://arxiv.org/abs/2508.07179)

SLiCE is a Python package for evaluating schema lineage extraction accuracy by comparing model predictions with gold standards. It provides comprehensive metrics for assessing the quality of schema lineage extraction in data pipeline analysis. 

## Features

- **Component-wise Evaluation**: Separate scoring for source schema, source tables, transformations, and aggregations
- **Multiple Similarity Metrics**: BLEU scores, fuzzy matching, F1 scores, and AST-based similarity
- **Flexible Weighting**: Customizable weights for different components and metrics
- **Multi-language Support**: Handles Python, SQL, and C# code in transformations
- **Sample Data Module**: Built-in access to curated datasets for testing and demonstration
- **Batch Processing**: Parallel evaluation of multiple lineage pairs
- **Command Line Interface**: Easy-to-use CLI for quick evaluations

## Installation

### From PyPI (recommended)

```bash
pip install slice-score
```

### From Source

```bash
git clone https://github.com/microsoft/SLiCE.git
cd SLiCE
pip install -e .
```

### Development Installation

For development with all testing and linting tools:

```bash
git clone https://github.com/microsoft/SLiCE.git
cd SLiCE

# Using pip
pip install -e ".[dev]"

# Using uv (recommended - faster)
uv sync --extra dev
```

## Quick Start

### Python API

```python
from slice import SchemaLineageEvaluator

# Initialize evaluator
evaluator = SchemaLineageEvaluator()

# Example lineage data
predicted = {
    "source_schema": "cuisine_type",
    "source_table": "restaurants.ss",
    "transformation": "R.cuisine_type AS CuisineType", 
    "aggregation": "COUNT() GROUP BY restaurant_id"
}

ground_truth = {
    "source_schema": "cuisine_type",
    "source_table": "restaurants.ss", 
    "transformation": "R.cuisine_type AS CuisineType",
    "aggregation": ""
}

# Evaluate
results = evaluator.evaluate(predicted, ground_truth)
print(f"Overall Score: {results['overall']:.4f}")
```

### Command Line Interface

```bash
# Basic evaluation
slice-eval predicted.json ground_truth.json

# With custom weights
slice-eval --weights source_table=0.5,transformation=0.3,aggregation=0.2 predicted.json ground_truth.json

# Include metadata evaluation
slice-eval --metadata predicted.json ground_truth.json

# Save results to file
slice-eval predicted.json ground_truth.json --output results.txt
```

## Data Format

SLiCE expects lineage data as dictionaries with the following structure:

```json
{
    "source_schema": "column_name",
    "source_table": "table_references",
    "transformation": "transformation_logic",
    "aggregation": "aggregation_operations",
    "metadata": "additional_metadata (optional)"
}
```

## Evaluation Metrics

### Component Scores

- **Source Schema**: Exact match of schema/column names
- **Source Table**: F1 score + fuzzy matching of table references  
- **Transformation**: BLEU + weighted BLEU + AST similarity
- **Aggregation**: BLEU + weighted BLEU + AST similarity
- **Metadata**: BLEU + weighted BLEU + AST similarity (optional)

### Overall Score

The final score combines component scores using configurable weights:

```
Overall = format_correctness × source_schema × (
    w₁ × source_table_score + 
    w₂ × transformation_score + 
    w₃ × aggregation_score +
    w₄ × metadata_score  # if applicable
)
```

Default weights: `source_table=0.4, transformation=0.4, aggregation=0.2`

## Configuration

### Custom Weights

```python
# Component weights
weights = {
    'source_table': 0.5,
    'transformation': 0.3, 
    'aggregation': 0.2
}

# Metric weights for transformations
transformation_weights = {
    'bleu': 0.6,
    'weighted_bleu': 0.3,
    'ast': 0.1
}

evaluator = SchemaLineageEvaluator(
    weights=weights,
    transformation_weights=transformation_weights
)
```

### Language Support

```python
# Custom syntax and operators
evaluator = SchemaLineageEvaluator(
    sql_syntax={'SELECT', 'FROM', 'WHERE'},
    python_syntax={'def', 'class', 'import'},
    csharp_syntax={'using', 'namespace', 'class'}
)
```

## Examples

See the `examples/` directory for complete usage examples:

- `basic_usage.py`: Basic evaluation with default settings
- `custom_weights.py`: Using custom weights and configurations
- `batch_evaluation.py`: Processing multiple lineage pairs
- `sample_data_usage.py`: Using package sample data for evaluation. 

## Testing

Run the test suite:

```bash
# Using pip
pip install -e ".[dev]"
pytest

# Using uv (recommended)
uv sync --extra dev
uv run pytest

# Run with coverage
uv run pytest --cov=slice

# Run specific test file
uv run pytest tests/test_schema_lineage_evaluator.py -v

# Code quality checks
uv run black slice/ tests/     # Format code
uv run flake8 slice/           # Lint code  
uv run mypy slice/             # Type checking
```

## Contributing

We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Make your changes
4. Add tests for your changes
5. Run the test suite (`pytest`)
6. Commit your changes (`git commit -m 'Add amazing feature'`)
7. Push to the branch (`git push origin feature/amazing-feature`)
8. Open a Pull Request

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Citation

If you use SLiCE in your research, please cite:

```bibtex
@software{slice2025,
  title={SLiCE: Schema Lineage Composite Evaluation},
  author={Jiaqi Yin and Yi-Wei Chen and Meng-Lung Lee and Xiya Liu},
  year={2025},
  url={https://github.com/microsoft/SLiCE}
}

@misc{yin2025schemalineageextractionscale,
      title={Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks}, 
      author={Jiaqi Yin and Yi-Wei Chen and Meng-Lung Lee and Xiya Liu},
      year={2025},
      eprint={2508.07179},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.07179}, 
}
```

## Support

- **Documentation**: [Link to documentation]
- **Issues**: [GitHub Issues](https://github.com/microsoft/SLiCE/issues)
- **Discussions**: [GitHub Discussions](https://github.com/microsoft/SLiCE/discussions)