# data_efficient_grammar **Repository Path**: feed69/data_efficient_grammar ## Basic Information - **Project Name**: data_efficient_grammar - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: colab - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-11-09 - **Last Updated**: 2024-11-09 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Data-Efficient Graph Grammar Learning for Molecular Generation This repository contains the implementation code for paper [Data-Efficient Graph Grammar Learning for Molecular Generation ](https://openreview.net/forum?id=l4IHywGq6a) (__ICLR 2022 oral__). In this work, we propose a data-efficient generative model (__DEG__) that can be learned from datasets with orders of magnitude smaller sizes than common benchmarks. At the heart of this method is a learnable graph grammar that generates molecules from a sequence of production rules. Our learned graph grammar yields state-of-the-art results on generating high-quality molecules for three monomer datasets that contain only ∼20 samples each. ![overview](assets/pipeline.png) ## Installation ### Prerequisites - __Retro*:__ The training of our DEG relies on [Retro*](https://github.com/binghong-ml/retro_star) to calculate the metric. Follow the instruction [here](#conda) to install. - __Pretrained GNN:__ We use [this codebase](https://github.com/snap-stanford/pretrain-gnns) for the pretrained GNN used in our paper. The necessary code & pretrained models are built in the current repo. ### Conda You can use ``conda`` to install the dependencies for DEG from the provided ``environment.yml`` file, which can give you the exact python environment we run the code for the paper: ```bash git clone git@github.com:gmh14/data_efficient_grammar.git cd data_efficient_grammar conda env create -f environment.yml conda activate DEG pip install -e retro_star/packages/mlp_retrosyn pip install -e retro_star/packages/rdchiral ``` >Note: it may take a decent amount of time to build necessary wheels using conda. ### Install ``Retro*``: - Download and unzip the files from this [link](https://www.dropbox.com/s/ar9cupb18hv96gj/retro_data.zip?dl=0), and put all the folders (```dataset/```, ```one_step_model/``` and ```saved_models/```) under the ```retro_star``` directory. - Install dependencies: ```bash conda deactivate conda env create -f retro_star/environment.yml conda activate retro_star_env pip install -e retro_star/packages/mlp_retrosyn pip install -e retro_star/packages/rdchiral pip install setproctitle ``` ## Train For Acrylates, Chain Extenders, and Isocyanates, ```bash conda activate DEG python main.py --training_data=./datasets/**dataset_path** ``` where ``**dataset_path**`` can be ``acrylates.txt``, ``chain_extenders.txt``, or ``isocyanates.txt``. For Polymer dataset, ```bash conda activate DEG python main.py --training_data=./datasets/polymers_117.txt --motif ``` Since ``Retro*`` is a major bottleneck of the training speed, we separate it from the main process, run multiple ``Retro*`` processes, and use file communication to evaluate the generated grammar during training. This is a compromise on the inefficiency of the built-in python multiprocessing package. We need to run the following command in another terminal window, ```bash conda activate retro_star_env bash retro_star_listener.sh **num_processes** ``` >Note: opening multiple ``Retro*`` is EXTREMELY memory consuming (~5G each). We suggest to start from using only one process by ``bash retro_star_listener.sh 1`` and monitor the memory usage, then accordingly increase the number to maximize the efficiency. We use ``35`` in the paper. After finishing the training, to kill all the generated processes related to ``Retro*``, run ```bash killall retro_star_listener ``` ## Use DEG Download and unzip the log & checkpoint files from this [link](https://drive.google.com/file/d/12g28WNAgRGzaLtuG6ESg25W-uzlNrpLQ/view?usp=sharing). See ``visualization.ipynb`` for more details. ## Acknowledgements The implementation of DEG is partly based on [Molecular Optimization Using Molecular Hypergraph Grammar](https://github.com/ibm-research-tokyo/graph_grammar) and [Hierarchical Generation of Molecular Graphs using Structural Motifs ](https://github.com/wengong-jin/hgraph2graph). ## Citation If you find the idea or code useful for your research, please cite [our paper](https://openreview.net/forum?id=l4IHywGq6a): ```bib @inproceedings{guo2021data, title={Data-Efficient Graph Grammar Learning for Molecular Generation}, author={Guo, Minghao and Thost, Veronika and Li, Beichen and Das, Payel and Chen, Jie and Matusik, Wojciech}, booktitle={International Conference on Learning Representations}, year={2021} } ``` ## Contact Please contact guomh2014@gmail.com if you have any questions. Enjoy!