# lilac **Repository Path**: mirrors_databricks/lilac ## Basic Information - **Project Name**: lilac - **Description**: Curate better data for LLMs - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-10-25 - **Last Updated**: 2025-12-20 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README
> See our [3min walkthrough video](https://www.youtube.com/watch?v=RrcvVC3VYzQ)
## 🔥 Getting started
### 💻 Install
```sh
pip install lilac[all]
```
If you prefer no local installation, you can duplicate our
[Spaces demo](https://lilacai-lilac.hf.space/) by following documentation
[here](https://docs.lilacml.com/deployment/huggingface_spaces.html).
For more detailed instructions, see our
[installation guide](https://docs.lilacml.com/getting_started/installation.html).
### 🌐 Start a webserver
Start a Lilac webserver with our `lilac` CLI:
```sh
lilac start ~/my_project
```
Or start the Lilac webserver from Python:
```py
import lilac as ll
ll.start_server(project_dir='~/my_project')
```
This will open start a webserver at http://localhost:5432/ where you can now load datasets and
explore them.
### Lilac Garden
Lilac Garden is our hosted platform for running dataset-level computations. We utilize powerful GPUs
to accelerate expensive signals like Clustering, Embedding, and PII.
[Sign up](https://forms.gle/Gz9cpeKJccNar5Lq8) to join the pilot.
- Cluster and title **a million** data points in **20 mins**
- Embed your dataset at **half a billion** tokens per min
- Run your own signal
### 📊 Load data
Datasets can be loaded directly from HuggingFace, Parquet, CSV, JSON,
[LangSmith from LangChain](https://www.langchain.com/langsmith), SQLite,
[LLamaHub](https://llamahub.ai/), Pandas, Parquet, and more. More documentation
[here](https://docs.lilacml.com/datasets/dataset_load.html).
```python
import lilac as ll
ll.set_project_dir('~/my_project')
dataset = ll.from_huggingface('imdb')
```
If you prefer, you can load datasets directly from the UI without writing any Python:
### ✨ Clustering
Cluster any text column to get automated dataset insights:
```python
dataset = ll.get_dataset('local', 'imdb')
dataset.cluster('text') # add `use_garden=True` to offload to Lilac Garden
```
> [!TIP]
> Clustering on device can be slow or impractical, especially on machines without a powerful GPU or
> large memory. Offloading the compute to [Lilac Garden](https://www.lilacml.com/#garden), our
hosted data processing platform, can speedup clustering by more than 100x.
### ⚡ Annotate with Signals (PII, Text Statistics, Language Detection, Neardup, etc)
Annotating data with signals will produce another column in your data.
```python
dataset = ll.get_dataset('local', 'imdb')
dataset.compute_signal(ll.LangDetectionSignal(), 'text') # Detect language of each doc.
# [PII] Find emails, phone numbers, ip addresses, and secrets.
dataset.compute_signal(ll.PIISignal(), 'text')
# [Text Statistics] Compute readability scores, number of chars, TTR, non-ascii chars, etc.
dataset.compute_signal(ll.PIISignal(), 'text')
# [Near Duplicates] Computes clusters based on minhash LSH.
dataset.compute_signal(ll.NearDuplicateSignal(), 'text')
# Print the resulting manifest, with the new field added.
print(dataset.manifest())
```
We can also compute signals from the UI:
### 🔎 Search
Semantic and conceptual search requires computing an embedding first:
```python
dataset.compute_embedding('gte-small', path='text')
```
#### Semantic search
In the UI, we can search by semantic similarity or by classic keyword search to find chunks of
documents similar to a query:
We can also label all data given a filter. In this case, adding the label "short" to all text with a
small amount of characters. This field was produced by the automatic `text_statistics` signal.
We can do the same in Python:
```python
dataset.add_labels(
'short',
filters=[
(('text', 'text_statistics', 'num_characters'), 'less', 1000)
]
)
```
Labels can be exported for downstream tasks. Detailed documentation
[here](https://docs.lilacml.com/datasets/dataset_labels.html).
## 💬 Contact
For bugs and feature requests, please
[file an issue on GitHub](https://github.com/lilacai/lilac/issues).
For general questions, please [visit our Discord](https://discord.com/invite/jNzw9mC8pp).