# moses **Repository Path**: greitzmann/moses ## Basic Information - **Project Name**: moses - **Description**: Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-01-16 - **Last Updated**: 2021-01-16 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Molecular Sets (MOSES): A benchmarking platform for molecular generation models [](https://travis-ci.com/molecularsets/moses) [](https://badge.fury.io/py/molsets) Deep generative models are rapidly becoming popular for the discovery of new molecules and materials. Such models learn on a large collection of molecular structures and produce novel compounds. In this work, we introduce Molecular Sets (MOSES), a benchmarking platform to support research on machine learning for drug discovery. MOSES implements several popular molecular generation models and provides a set of metrics to evaluate the quality and diversity of generated molecules. With MOSES, we aim to standardize the research on molecular generation and facilitate the sharing and comparison of new models. __For more details, please refer to the [paper](https://arxiv.org/abs/1811.12823).__ If you are using MOSES in your research paper, please cite us as ``` @article{10.3389/fphar.2020.565644, title={{M}olecular {S}ets ({MOSES}): {A} {B}enchmarking {P}latform for {M}olecular {G}eneration {M}odels}, author={Polykovskiy, Daniil and Zhebrak, Alexander and Sanchez-Lengeling, Benjamin and Golovanov, Sergey and Tatanov, Oktai and Belyaev, Stanislav and Kurbanov, Rauf and Artamonov, Aleksey and Aladinskiy, Vladimir and Veselov, Mark and Kadurin, Artur and Johansson, Simon and Chen, Hongming and Nikolenko, Sergey and Aspuru-Guzik, Alan and Zhavoronkov, Alex}, journal={Frontiers in Pharmacology}, year={2020} } ```  ## Dataset We propose [a benchmarking dataset](https://media.githubusercontent.com/media/molecularsets/moses/master/data/dataset_v1.csv) refined from the ZINC database. The set is based on the ZINC Clean Leads collection. It contains 4,591,276 molecules in total, filtered by molecular weight in the range from 250 to 350 Daltons, a number of rotatable bonds not greater than 7, and XlogP less than or equal to 3.5. We removed molecules containing charged atoms or atoms besides C, N, S, O, F, Cl, Br, H or cycles longer than 8 atoms. The molecules were filtered via medicinal chemistry filters (MCFs) and PAINS filters. The dataset contains 1,936,962 molecular structures. For experiments, we split the dataset into a training, test and scaffold test sets containing around 1.6M, 176k, and 176k molecules respectively. The scaffold test set contains unique Bemis-Murcko scaffolds that were not present in the training and test sets. We use this set to assess how well the model can generate previously unobserved scaffolds. ## Models * [Character-level Recurrent Neural Network (CharRNN)](./moses/char_rnn/README.md) * [Variational Autoencoder (VAE)](./moses/vae/README.md) * [Adversarial Autoencoder (AAE)](./moses/aae/README.md) * [Junction Tree Variational Autoencoder (JTN-VAE)](https://github.com/wengong-jin/icml18-jtnn/tree/master/fast_molvae) * [Latent Generative Adversarial Network (LatentGAN)](./moses/latentgan/README.md) ## Metrics Besides standard uniqueness and validity metrics, MOSES provides other metrics to access the overall quality of generated molecules. Fragment similarity (Frag) and Scaffold similarity (Scaff) are cosine distances between vectors of fragment or scaffold frequencies correspondingly of the generated and test sets. Nearest neighbor similarity (SNN) is the average similarity of generated molecules to the nearest molecule from the test set. Internal diversity (IntDiv) is an average pairwise similarity of generated molecules. Fréchet ChemNet Distance (FCD) measures the difference in distributions of last layer activations of ChemNet. Novelty is a fraction of unique valid generated molecules not present in the training set.
Model | Valid (↑) | Unique@1k (↑) | Unique@10k (↑) | FCD (↓) | SNN (↑) | Frag (↑) | Scaf (↑) | IntDiv (↑) | IntDiv2 (↑) | Filters (↑) | Novelty (↑) | ||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Test | TestSF | Test | TestSF | Test | TestSF | Test | TestSF | ||||||||
Train | 1.0 | 1.0 | 1.0 | 0.008 | 0.4755 | 0.6419 | 0.5859 | 1.0 | 0.9986 | 0.9907 | 0.0 | 0.8567 | 0.8508 | 1.0 | 1.0 |
HMM | 0.076±0.0322 | 0.623±0.1224 | 0.5671±0.1424 | 24.4661±2.5251 | 25.4312±2.5599 | 0.3876±0.0107 | 0.3795±0.0107 | 0.5754±0.1224 | 0.5681±0.1218 | 0.2065±0.0481 | 0.049±0.018 | 0.8466±0.0403 | 0.8104±0.0507 | 0.9024±0.0489 | 0.9994±0.001 |
NGram | 0.2376±0.0025 | 0.974±0.0108 | 0.9217±0.0019 | 5.5069±0.1027 | 6.2306±0.0966 | 0.5209±0.001 | 0.4997±0.0005 | 0.9846±0.0012 | 0.9815±0.0012 | 0.5302±0.0163 | 0.0977±0.0142 | 0.8738±0.0002 | 0.8644±0.0002 | 0.9582±0.001 | 0.9694±0.001 |
Combinatorial | 1.0±0.0 | 0.9983±0.0015 | 0.9909±0.0009 | 4.2375±0.037 | 4.5113±0.0274 | 0.4514±0.0003 | 0.4388±0.0002 | 0.9912±0.0004 | 0.9904±0.0003 | 0.4445±0.0056 | 0.0865±0.0027 | 0.8732±0.0002 | 0.8666±0.0002 | 0.9557±0.0018 | 0.9878±0.0008 |
CharRNN | 0.9748±0.0264 | 1.0±0.0 | 0.9994±0.0003 | 0.0732±0.0247 | 0.5204±0.0379 | 0.6015±0.0206 | 0.5649±0.0142 | 0.9998±0.0002 | 0.9983±0.0003 | 0.9242±0.0058 | 0.1101±0.0081 | 0.8562±0.0005 | 0.8503±0.0005 | 0.9943±0.0034 | 0.8419±0.0509 |
AAE | 0.9368±0.0341 | 1.0±0.0 | 0.9973±0.002 | 0.5555±0.2033 | 1.0572±0.2375 | 0.6081±0.0043 | 0.5677±0.0045 | 0.991±0.0051 | 0.9905±0.0039 | 0.9022±0.0375 | 0.0789±0.009 | 0.8557±0.0031 | 0.8499±0.003 | 0.996±0.0006 | 0.7931±0.0285 |
VAE | 0.9767±0.0012 | 1.0±0.0 | 0.9984±0.0005 | 0.099±0.0125 | 0.567±0.0338 | 0.6257±0.0005 | 0.5783±0.0008 | 0.9994±0.0001 | 0.9984±0.0003 | 0.9386±0.0021 | 0.0588±0.0095 | 0.8558±0.0004 | 0.8498±0.0004 | 0.997±0.0002 | 0.6949±0.0069 |
JTN-VAE | 1.0±0.0 | 1.0±0.0 | 0.9996±0.0003 | 0.3954±0.0234 | 0.9382±0.0531 | 0.5477±0.0076 | 0.5194±0.007 | 0.9965±0.0003 | 0.9947±0.0002 | 0.8964±0.0039 | 0.1009±0.0105 | 0.8551±0.0034 | 0.8493±0.0035 | 0.976±0.0016 | 0.9143±0.0058 |
LatentGAN | 0.8966±0.0029 | 1.0±0.0 | 0.9968±0.0002 | 0.2968±0.0087 | 0.8281±0.0117 | 0.5371±0.0004 | 0.5132±0.0002 | 0.9986±0.0004 | 0.9972±0.0007 | 0.8867±0.0009 | 0.1072±0.0098 | 0.8565±0.0007 | 0.8505±0.0006 | 0.9735±0.0006 | 0.9498±0.0006 |