# MOSAIC **Repository Path**: ahlih_admin/MOSAIC ## Basic Information - **Project Name**: MOSAIC - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-02-05 - **Last Updated**: 2026-02-05 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

MOSAIC

This repository contains the official implementation of "[Collective Intelligence for AI-Assisted Chemical Synthesis](https://www.nature.com/articles/s41586-026-10131-4)". ## Overview MOSAIC is a computational framework that fine-tunes the open-source Llama 3.1-8B-instruct model into 2,498 specialized chemistry experts. MOSAIC implements a scaling search paradigm that partitions and navigates chemical space through Voronoi regions. ## Requirements ```bash python>=3.11.9 numpy==1.23.5 torch==2.4.1 pandas==1.5.3 rdkit==2023.9.6 tqdm==2.4.41 matplotlib==3.9.1 faiss==1.7.4 transformers==4.45.1 peft==0.12.0 trl==0.11.1 datasets==3.0.0 python-Levenshtein==0.26.0 ``` For a complete list of dependencies, see [requirement.txt]. ## Installation 1. Clone the repository: ```bash git clone https://github.com/haoteli/MOSAIC.git cd MOSAIC ``` 2. Install dependencies: ```bash pip install -r requirement.txt ``` Typical installation time on a normal desktop computer is 10-20 minutes. 3. Set up the Pistachio database, or use any custom databases ## Project Structure ``` MOSAIC/ ├── DataProcessing/ │ ├── Get_Molecular_FP_MultiProcessing_LargeRAM.ipynb # Generate molecular fingerprints │ └── Create_Voronoi_Domains.ipynb # Train FAISS and obtain Voronoi breakdown ├── KernelMetricNetwork/ │ └── OneNotebookForAll.ipynb # Train kernel metric network ├── PredictionUtils/ # Core utilities for predictions │ ├── ChemUtils.py # Chemistry/RDKit related utilities │ ├── NLPUtils.py # NLP/referencing utilities │ ├── PredictionUtils.py # Main prediction functions │ ├── Transformation_Model.py # kernel metric network │ └── run_prediction.sh # bash execution script ├── Training/ │ ├── DownloadingModel.ipynb # Download Llama model │ ├── First_Finetuning/ # First fine-tuning for general knowledge exposure │ │ ├── Multi_GPU_Submit_Optimizing.py │ │ └── Submit_Training.sub │ └── Expert_Finetuning/ # Continued training (Expert) to develop domain knowledge │ ├── RSFP_Expert_Index_Finetuning.py │ └── Submit_All_Expert_Trainings.ipynb ├── Main_Control.ipynb # Main execution notebook with examples └── requirement.txt # Project dependencies ``` ## Usage Guide 1. **Data Processing** - Start with the DataProcessing folder - Generate molecular fingerprints using multi-processing - Follow `Get_Molecular_FP_MultiProcessing_LargeRAM.ipynb` 2. **Kernel Metric Network Training** - Navigate to KernelMetricNetwork folder - Train the network using `OneNotebookForAll.ipynb` 3. **Create Voronoi Domains** - Return to DataProcessing folder - Generate Voronoi expert indices using FAISS - Assign indices to Pistachio database entries - Follow `Create_Voronoi_Domains.ipynb` 4. **Model Training** - Download Llama3.1-8B-Instruct model using `DownloadingModel.ipynb` - Fine-tune base model using `Multi_GPU_Submit_Optimizing.py` - Train expert models using `RSFP_Expert_Index_Finetuning.py` - Use provided submission scripts for batch processing 5. **Running the Framework** - Execute `Main_Control.ipynb` for testing and examples ## Citation ```bibtex @article{li2026collective, title={Collective intelligence for AI-assisted chemical synthesis}, author={Li, Haote and Sarkar, Sumon and Lu, Wenxin and Loftus, Patrick O and Qiu, Tianyin and Shee, Yu and Cuomo, Abbigayle E and Webster, John-Paul and Kelly, H and Manee, Vidhyadhar and others}, journal={Nature}, pages={1--3}, year={2026}, publisher={Nature Publishing Group} } ``` ## Contact For questions about this work, please open an issue or contact the corresponding authors. - Prof. Victor S. Batista (victor.batista@yale.edu) - Computational - Prof. Timothy R. Newhouse (timothy.newhouse@yale.edu) - Experimental