# RSscore

**Repository Path**: xu-chenyang/rsscore

## Basic Information

- **Project Name**: RSscore
- **Description**: RSscore for chemical reaction evaluation
- **Primary Language**: Unknown
- **License**: GPL-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2023-08-17
- **Last Updated**: 2024-04-25

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# RSscore: Reaction Superiority Learned from Reaction Mapping Hypergraph

RSscore is the evaluation of the reactions superiority considering the influence of reaction agents. It generates superior reaction probabilities by constructing reaction mapping graph using atom mapping relationship.

## Environment:
For GIN-AIR, GIN-AIR-pre-train model
python == 3.8 or 3.9, dgl == 0.8.2, torch == 1.10.1, numpy == 1.24.3, umap-learn == 0.5.3, seaborn == 0.12.2, rdkit == 2023.9.4 
For the other GNN model
python == 3.8 or 3.9, dgl == 0.9.1, torch == 1.10.1, numpy == 1.24.3, umap-learn == 0.5.3, seaborn == 0.12.2, rdkit == 2023.9.4 
For xgboost Adaboost and Random forest model
python == 3.8 or 3.9, scikit-image=0.18.3, scikit-learn-intelex=2021.5.0, xgboost

## Data Preparation
The data preparation code are mainly saved in .\Data_Prepare&Augmentation directory. In the data preparation process, the research firstly extracts the reaction time, temperature and yield in Open Reaction DataBase(https://open-reaction-database.org). And the specific reaction information extraction code can be finded in ord-scheme (https://github.com/open-reaction-database/ord-schema). Since most of the reaction yields are recorded in the ORD reaction details part, the \RTAAM\extract_ORD\ORD2excel.py file is applied to extract the reaction informations. 
The \Label_smoothing\select_good.py are \Label_smoothing\select_bad.py are applied to filter reactions and assign the corresponding labels to them. 
After complementing the reaction total atom-atom mapping feature of all extracted reaction data using the \RTAAM\RAAM_generate\generate_reaction_fully_mapping.py file and smoothing reactions labels through the \Label_smoothing\label_smoothing.py. 
2390k unlabelled data, 20k labelled data, and 1400 reactions with labels but without reaction atom mapping features were obtained in this study. For the 20k labelled data, the specific results are shown in \RTAAM\smoothing_total.xlsx. 
For the 1400 labeled reactions without atom-atom mapping relationship, this study used manual complementation method to refine the data and validate the necessity of unsupervised learning, and the specific results are shown in \outsider_test\The_necessity_of_contrastive_learning\outsider.xlsx.

## Model Training & Evaluations
Model training & evaluations part are mainly saved in ./Models directory. During the Model training process, the Morgan fingerprints' model training parameters are located in /MorganFP+RF_Adaboost_Xgboost/RF.py. The GNN Models' training parameters are located in \{Sub_Model_name}\configs\simclr_Rxn.yaml in each model subdirectory.
In model pre-training process, the SIMCLR constractive learning method is utilzed to pre-train the model initial parameters and the specific code is located in \GINE-AIR-with_pretrain\main.py. 
For the ablation experimental model and fine-tuning model, the training code are utilized in \{Sub_Model_name}\linear_fintune.py, and the records of the training process as well as the evaluation of the model are located in the \{Sub_Model_name}\LOG\eval_data.log file. 
The detailed results of these models located in the table below. Compared to the other baseline models, GINE-AIR with pretraining performed best in accuracy, F1-Score, and AUC evaluation metrics, which shows the model is able to distinguish superior and inferior reactions. 

| Model      | Accuracy | F1-Score | AUC |
| ----------- | ----------- | ----------- | ----------- |
| Morgan FP-RF | 0.692 | 0.733 | 0.779 |
| Morgan FP-Adaboost | 0.749 | 0.753 | 0.824 |
| Morgan FP-Xgboost | 0.779 | 0.788 | 0.861 |
| GCN-None | 0.853 | 0.848 | 0.927 |
| GAT-None | 0.864 | 0.861 | 0.912 |
| SAGE-None | 0.865 | 0.865 | 0.936 |
| GINE-None | 0.884 | 0.885 | 0.940 |
| GCN-AIR | 0.876 | 0.878 | 0.920 |
| GAT-AIR | 0.871 | 0.873 | 0.942 |
| SAGE-AIR | 0.877 | 0.880 | 0.942 |
| GINE-AIR | 0.897 | 0.898 | 0.951 |
| GINE-AIR with pre-training | 0.903 | 0.903 | 0.965 |

## How to use
for reaction evaluation task
```cd ./reaction_evaluation/model_test.py```
for route evaluation task
```cd ./synthetic_route_analysis/calculate_Route_RS.py```

## Reaction graph augmentation 
The graph data augmentation for reaction graph is designed to solve the problem of traditional graph data augmentation methods that destroy the reaction or molecule structure and the rationality of representation. The details of the reaction graph reaction graph augmentation algorithm are in the ./Data_Prepare&Augmentation/Data_Augmentation.