# MOSAIC
**Repository Path**: ahlih_admin/MOSAIC
## Basic Information
- **Project Name**: MOSAIC
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-02-05
- **Last Updated**: 2026-02-05
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
This repository contains the official implementation of "[Collective Intelligence for AI-Assisted Chemical Synthesis](https://www.nature.com/articles/s41586-026-10131-4)".
## Overview
MOSAIC is a computational framework that fine-tunes the open-source Llama 3.1-8B-instruct model into 2,498 specialized chemistry experts. MOSAIC implements a scaling search paradigm that partitions and navigates chemical space through Voronoi regions.
## Requirements
```bash
python>=3.11.9
numpy==1.23.5
torch==2.4.1
pandas==1.5.3
rdkit==2023.9.6
tqdm==2.4.41
matplotlib==3.9.1
faiss==1.7.4
transformers==4.45.1
peft==0.12.0
trl==0.11.1
datasets==3.0.0
python-Levenshtein==0.26.0
```
For a complete list of dependencies, see [requirement.txt].
## Installation
1. Clone the repository:
```bash
git clone https://github.com/haoteli/MOSAIC.git
cd MOSAIC
```
2. Install dependencies:
```bash
pip install -r requirement.txt
```
Typical installation time on a normal desktop computer is 10-20 minutes.
3. Set up the Pistachio database, or use any custom databases
## Project Structure
```
MOSAIC/
├── DataProcessing/
│ ├── Get_Molecular_FP_MultiProcessing_LargeRAM.ipynb # Generate molecular fingerprints
│ └── Create_Voronoi_Domains.ipynb # Train FAISS and obtain Voronoi breakdown
├── KernelMetricNetwork/
│ └── OneNotebookForAll.ipynb # Train kernel metric network
├── PredictionUtils/ # Core utilities for predictions
│ ├── ChemUtils.py # Chemistry/RDKit related utilities
│ ├── NLPUtils.py # NLP/referencing utilities
│ ├── PredictionUtils.py # Main prediction functions
│ ├── Transformation_Model.py # kernel metric network
│ └── run_prediction.sh # bash execution script
├── Training/
│ ├── DownloadingModel.ipynb # Download Llama model
│ ├── First_Finetuning/ # First fine-tuning for general knowledge exposure
│ │ ├── Multi_GPU_Submit_Optimizing.py
│ │ └── Submit_Training.sub
│ └── Expert_Finetuning/ # Continued training (Expert) to develop domain knowledge
│ ├── RSFP_Expert_Index_Finetuning.py
│ └── Submit_All_Expert_Trainings.ipynb
├── Main_Control.ipynb # Main execution notebook with examples
└── requirement.txt # Project dependencies
```
## Usage Guide
1. **Data Processing**
- Start with the DataProcessing folder
- Generate molecular fingerprints using multi-processing
- Follow `Get_Molecular_FP_MultiProcessing_LargeRAM.ipynb`
2. **Kernel Metric Network Training**
- Navigate to KernelMetricNetwork folder
- Train the network using `OneNotebookForAll.ipynb`
3. **Create Voronoi Domains**
- Return to DataProcessing folder
- Generate Voronoi expert indices using FAISS
- Assign indices to Pistachio database entries
- Follow `Create_Voronoi_Domains.ipynb`
4. **Model Training**
- Download Llama3.1-8B-Instruct model using `DownloadingModel.ipynb`
- Fine-tune base model using `Multi_GPU_Submit_Optimizing.py`
- Train expert models using `RSFP_Expert_Index_Finetuning.py`
- Use provided submission scripts for batch processing
5. **Running the Framework**
- Execute `Main_Control.ipynb` for testing and examples
## Citation
```bibtex
@article{li2026collective,
title={Collective intelligence for AI-assisted chemical synthesis},
author={Li, Haote and Sarkar, Sumon and Lu, Wenxin and Loftus, Patrick O and Qiu, Tianyin and Shee, Yu and Cuomo, Abbigayle E and Webster, John-Paul and Kelly, H and Manee, Vidhyadhar and others},
journal={Nature},
pages={1--3},
year={2026},
publisher={Nature Publishing Group}
}
```
## Contact
For questions about this work, please open an issue or contact the corresponding authors.
- Prof. Victor S. Batista (victor.batista@yale.edu) - Computational
- Prof. Timothy R. Newhouse (timothy.newhouse@yale.edu) - Experimental