# aella-data-explorer

**Repository Path**: EtienneXIONG/aella-data-explorer

## Basic Information

- **Project Name**: aella-data-explorer
- **Description**: LAION research paper dataset visual explorer 🔬 🧑‍🔬 👩‍🔬
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-12-02
- **Last Updated**: 2025-12-02

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Aella Science Dataset Explorer

Interactive visualization and exploration of scientific papers from the Aella open science dataset.

This project is a collaboration between [Inference.net](https://inference.net) and [LAION](https://laion.ai). LAION curated the original dataset which is about ~100m scrapped scientific and research articles and Inference.net fine-tuned a custom model to extract structured summaries from the articles. This repo contains a visual explorer for a small subset of the extracted dataset.

View the live explorer at [https://aella.inference.net](https://aella.inference.net).

## Overview

A web application for exploring scientific papers with semantic embeddings, dimensionality reduction, and clustering visualizations.

## Architecture

- **Frontend**: React + TypeScript + Vite with D3.js for interactive visualizations
- **Backend**: Python FastAPI serving data from SQLite (D1 in production)
- **Storage**: SQLite locally, Cloudflare D1 + R2 in production

## Prerequisites

You'll need the following tools installed:

- **Python 3.11+** - [Download](https://www.python.org/downloads/)
- **uv** - Python dependency management - [Install](https://docs.astral.sh/uv/getting-started/installation/)
- **bun** - JavaScript runtime - [Install](https://bun.sh/)
- **Task** - Task runner - [Install](https://taskfile.dev/installation/)

## Setup

Install all dependencies:

```bash
task setup
```

This will install both backend and frontend dependencies.

## Quick Start

### 1. Get the Database

Download the database from R2:

```bash
task db:setup
```

This will download the SQLite database to `backend/data/db.sqlite`.

### 2. Run the Application

Run the backend and frontend in separate terminals:

**Backend (Terminal 1):**

```bash
task backend:dev
```

**Frontend (Terminal 2):**

```bash
task frontend:dev
```

The application will be available at:

- Frontend: `http://localhost:5173`
- API: `http://localhost:8787`
- API Docs: `http://localhost:8787/docs`

## Data Pipeline

> The code for the data pipeline that we used to construct this dataset is not yet open source, mostly because it was setup for a one-time process and not production-ready.

However, the general process was:

1. Initial data extraction and filtering

- Ran a pipeline to generate the summaries
- Excluded specific non-scientific content and failed summaries
- Compiled results for further processing

2. Semantic Embedding

- Generates 768-dimensional embeddings using SPECTER2 (allenai/specter2_base)
- Processes papers in batches with GPU acceleration support
- Stores embeddings as binary blobs for similarity search

3. Visualization & Clustering

- Reduces embeddings to 2D coordinates using UMAP with cosine distance
- Applies K-Means clustering with automatic optimization (20-60 clusters via silhouette scores)
- Generates initial cluster labels using TF-IDF analysis of titles and fields

4. LLM-Curated Labels

- Applies manually reviewed, domain-specific cluster labels
- Improves interpretability over automated TF-IDF labels

## Deployment

Deploy to Cloudflare:

```bash
task deploy
```

This will prompt you to deploy the backend API and/or frontend.

## Contributing

We welcome contributions to this project! Here's what you should know:

**Bug Fixes & Minor Improvements**

- Bug fixes are always welcome! Please submit a PR with a clear description of the issue and fix.
- Minor improvements to documentation, code quality, or performance are appreciated.

**New Features**

- This project is intentionally scoped as a one-time preview of this dataset.
- We are generally not planning to greatly expand the functionality beyond its current scope.
- If you want to add significant new features, we encourage you to fork the project and build on it!

**Before Submitting a PR**

- Ensure your code passes linting and formatting checks:
  ```bash
  task check
  ```
- Keep changes focused and well-documented.
- Test your changes with sample data when applicable.

## License

MIT License - see [LICENSE](LICENSE) file for details.