# DECIMER-Image-to-SMILES **Repository Path**: evani/DECIMER-Image-to-SMILES ## Basic Information - **Project Name**: DECIMER-Image-to-SMILES - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-04-19 - **Last Updated**: 2024-11-28 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # DECIMER-Image-to-SMILES The repository contains the network and the related scripts for auto-encoder based Chemical Image Recognition ### The project contains code which was written throughout the project (Continuously updated) #### Top-level directory layout ```bash ├── Network/ # Main model and evaluator scripts + ├ ─ Trainer_Image2Smiles.py # Main training script - further could be modified for training + ├ ─ I2S_Data.py # Data reader module for training + ├ ─ I2S_Model.py # Autoencoder network + ├ ─ Evaluate.py # To Load trained model and evaluate an image (Predicts SMILES) + └ ─ I2S_evalData.py # To load the tokenizer and the images for evaluation + ├── Utils/ # Utilities used to generate the text data + ├ ─ Deepsmiles_Encoder.py # Used for encoding SMILES to DeepSMILES + ├ ─ Deepsmiles_Decoder.py # Used for decoding DeepSMILES to SMILES + ├ ─ Smilesto_selfies.py # Used for encoding SMILES to SELFIES + ├ ─ Smilesto_selfies.py # Used for encoding SELFIES to SMILES + └ ─ Tanimoto_Calculator_Rdkit.py # Calculates Tanimoto similarity on Original VS Predicted SMILES + ├── LICENSE ├── Python_Requirements # Python requirements needed to run the scripts without error └── README.md ``` ## Installation of required dependencies: ### Installation of TensorFlow - This can be done using pip, check the [Tensorflow](https://www.tensorflow.org/install) website for the installation guide. DECIMER can run on both CPU and GPU platforms. Installing Tensorflow-GPU should be done according to this [guide](https://www.tensorflow.org/install/gpu). ### Requirements - matplotlib - sklearn - pillow - deepsmiles ## How to set up the directories: - Directories can be easily specified inside the scripts. - The path to the SMILES data is specified in I2S_Data.py - The path to the image data is specified in Trainer_Image2Smiles.py - The path to checkpoints will be generated in the same folder where your Trainer script is located, If you would like to use a different path it can be modified in Trainer_Image2Smiles.py. #### Recommended layout of the directory ```bash ├── Image2SMILES/ + ├ ─ checkpoints/ + ├ ─ Trainer_Image2Smiles.py + ├ ─ I2S_Data.py + ├ ─ I2S_Model.py + ├ ─ Evaluate.py + └ ─ I2S_evalData.py + ├── Data/ + ├ ─ Train_Images/ + └ ─ DeepSMILES.txt + └── Predictions/ └ ─ Utils/ ``` ## How to generate data and train Image2SMILES: - Generating image data: - You can generate your images using SDF or SMILES. The [DECIMER](https://github.com/Kohulan/DECIMER/tree/master/src/org/openscience/decimer) Java repository contains the scripts used to generate images that were used for training in our case. You simply have to clone the repository, get the [CDK](https://cdk.github.io) libraries, and use them as referenced libraries to compile the scripts you want to use. ```bash e.g: javac -cp cdk-2.3.jar:. SmilesDepictor.java # Compiling the script on your local directory. java -cp cdk-2.3.jar:. SmilesDepictor # Run the compiled script. ``` - The generated images should be placed under /Image2SMILES/Data/Train_Images/ - Generating Text Data: - You should use the corresponding SDF or SMILES file to generate the text data. Here, the text data is [DeepSMILES](https://github.com/baoilleach/deepsmiles) strings. The DeepSMILES can be generated using [Deepsmiles_Encoder.py] under Utils. Split the DeepSMILES strings appropriately after generating them. - Place the DeepSMILES data under /Image2SMILES/Data/ ### Training Image2SMILES - After specifying the "paths" to the data correctly. you can train the Image2SMILES network on a GPU enabled machine(CPU platform can be much slower for a big number of Images). ```bash $ python3 Image2SMILES.py &> log.txt & ``` - After the training is finished, you can use your images to test the model trained using the Evaluate.py. to generate a completely new set of test data, you can use the same steps as above mentioned to generate training data. ### Predicting using the trained model - To use the trained model provided in the repository please follow these steps; - Model also available here: [Trained Model](https://storage.googleapis.com/decimer_weights/Trained_Models.zip) and should be placed under Trained_Models directory - Clone the repository ``` git clone https://github.com/Kohulan/DECIMER-Image-to-SMILES.git ``` - Change directory to Network folder ``` cd DECIMER-Image-to-SMILES/Network ``` - Copy a sample image to the Network folder, check the path to the model inside Predictor.py and run ``` python3 Predictor.py --input sample.png ``` ## License: - This project is licensed under the MIT License - see the [LICENSE](https://github.com/Kohulan/Decimer-Python/blob/master/LICENSE) file for details ## Citation - Use this BibTeX to cite ``` @article{Rajan2020, abstract = {The automatic recognition of chemical structure diagrams from the literature is an indispensable component of workflows to re-discover information about chemicals and to make it available in open-access databases. Here we report preliminary findings in our development of Deep lEarning for Chemical ImagE Recognition (DECIMER), a deep learning method based on existing show-and-tell deep neural networks, which makes very few assumptions about the structure of the underlying problem. It translates a bitmap image of a molecule, as found in publications, into a SMILES. The training state reported here does not yet rival the performance of existing traditional approaches, but we present evidence that our method will reach a comparable detection power with sufficient training time. Training success of DECIMER depends on the input data representation: DeepSMILES are superior over SMILES and we have a preliminary indication that the recently reported SELFIES outperform DeepSMILES. An extrapolation of our results towards larger training data sizes suggests that we might be able to achieve near-accurate prediction with 50 to 100 million training structures. This work is entirely based on open-source software and open data and is available to the general public for any purpose.}, author = {Rajan, Kohulan and Zielesny, Achim and Steinbeck, Christoph}, doi = {10.1186/s13321-020-00469-w}, issn = {1758-2946}, journal = {Journal of Cheminformatics}, month = {dec}, number = {1}, pages = {65}, title = {{DECIMER: towards deep learning for chemical image recognition}}, url = {https://doi.org/10.1186/s13321-020-00469-w https://jcheminf.biomedcentral.com/articles/10.1186/s13321-020-00469-w}, volume = {12}, year = {2020} } ``` ## Author: - [Kohulan](github.com/Kohulan) [![GitHub Logo](https://github.com/Kohulan/DECIMER-Image-to-SMILES/raw/master/assets/DECIMER.gif)](https://kohulan.github.io/Decimer-Official-Site/) ## Project Website - [DECIMER](https://kohulan.github.io/Decimer-Official-Site/) ## Research Group - [Website](https://cheminf.uni-jena.de) ![GitHub Logo](/assets/CheminfGit.png)