# multiobj-rationale **Repository Path**: greitzmann/multiobj-rationale ## Basic Information - **Project Name**: multiobj-rationale - **Description**: Multi-Objective Molecule Generation using Interpretable Substructures (ICML 2020) - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-01-18 - **Last Updated**: 2021-01-18 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Multi-Objective Molecule Generation using Interpretable Substructures This is the implementation of our ICML 2020 paper: https://arxiv.org/abs/2002.03244 ## Property Predictors The property predictors for GSK3 and JNK3 are provided in `data/gsk3/gsk3.pkl` and `data/jnk3/jnk3.pkl`. For example, to predict properties of given molecules, run ``` python properties.py --prop jnk3 < data/jnk3/rationales.txt python properties.py --prop gsk3,jnk3 < data/dual_gsk3_jnk3/rationales.txt ``` ## Rationale Extraction The rationale extraction module will produce a list of triplets `(molecule, rationale, score)`, where `molecule` is an active compound, `rationale` is a subgraph that explains the property and `score` is its predicted score. The following script uses 4 CPU cores (can be adjusted with `--ncpu` argument): ``` python mcts.py --data data/jnk3/actives.txt --prop jnk3 --ncpu 4 > jnk3_rationales.txt python mcts.py --data data/gsk3/actives.txt --prop gsk3 --ncpu 4 > gsk3_rationales.txt ``` To construct multi-property rationales, we can merge the single-property rationales for GSK3 and JNK3: ``` python merge_rationale.py --rationale1 data/gsk3/rationales.txt --rationale2 data/jnk3/rationales.txt > gsk3_jnk3.txt ``` ## Generative Model Pre-training The molecule completion model is pre-trained on the ChEMBL dataset. To construct the training set, run ``` python preprocess.py --train data/chembl/all.txt --ncpu 4 mkdir chembl-processed mv tensor-* chembl-processed ``` To train the molecule completion model, run ``` python gnn_train.py --train chembl-processed --save_dir ckpt/chembl-molgen ``` ## GSK3 + JNK3 + QED + SA Molecule Design This task seeks to design dual inhibitors against GSK3 and JNK3 with drug-likeness and synthetic accessibility constraints. We have already computed multi-property rationales in `data/gsk3_jnk3_qed_sa/rationales.txt`. It is a subset of GSK3-JNK3 rationales with QED > 0.6 and SA < 4.0. ### Step 1: Fine-tuning with Policy Gradient Given a set of rationales, the model learns to complete them into full molecules. The molecule completion model has been pre-trained on ChEMBL, and it needs to be fine-tuned so that generated molecules will satisfy all the property constraints. To fine-tune the model on the GSK3 + JNK3 + QED + SA task, run ``` python finetune.py \ --init_model ckpt/chembl-h400beta0.3/model.20 --save_dir ckpt/tmp/ \ --rationale data/gsk3_jnk3_qed_sa/rationales.txt --num_decode 200 --prop gsk3,jnk3,qed,sa --epoch 30 --alpha 0.5 ``` ### Step 2: Molecule Generation The molecule generation script will expand the extracted rationales into full molecules. The output is a list of pairs `(rationale, molecule)`, where `molecule` is the completion of `rationale`. In the following example, each rationale is completed for 100 times, with different sampled latent vectors z. ``` python decode.py --model ckpt/gsk3_jnk3_qed_sa/model.final > outputs.txt ``` ### Step 3: Evaluation You can evaluate the outputs for the four property constraint task by ``` python properties.py --prop gsk3,jnk3,qed,sa < outputs.txt | python scripts/qed_sa_dual_eval.py --ref_path data/dual_gsk3_jnk3/actives.txt ``` Here `--ref_path` contains all the reference molecules which is used for computing the novelty score.