# sAMPpred-GAT

**Repository Path**: zhangbeibei_page/sAMPpred-GAT

## Basic Information

- **Project Name**: sAMPpred-GAT
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2024-01-21
- **Last Updated**: 2024-01-21

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# sAMPpred-GAT
The implementation of the paper ***sAMPpred-GAT: Prediction of Antimicrobial Peptide by Graph Attention Network and Predicted Peptide Structure***

## Requirements
The majoy dependencies used in this project are as following:

```
python  3.7
numpy 1.21.6
tqdm  4.64.1
pyyaml  6.0
scikit-learn  1.0.2
torch  1.11.0+cu113
torch-cluster  1.6.0
torch-scatter  2.0.9
torch-sparse  0.6.15
torch-geometric  1.7.2
tensorflow  1.14.0
tensorboardX  2.5.1
```
More detailed python libraries used in this project are referred to `requirements.txt`. 
Check your CPU device and install the pytorch and pyG (torch-cluster, torch-scatter, torch-sparse, torch-geometric) according to your CUDA version.
> **Note** that torch-geometric 1.7.2 and tensorflow 1.14.0 are required, becuase our trained model does not support the `torch-geometric` with higher version , and the model from trRosetta does not support the `tensorflow` with higher version.
> 
The The installed pyG (torch-cluster, torch-scatter, torch-sparse, torch-geometric) must be a GPU version according to your CUDA. If you installed a wrong vesion, there will be some unexpected errors like https://github.com/rusty1s/pytorch_scatter/issues/248 and https://github.com/pyg-team/pytorch_geometric/issues/2040. We provide the installation process of pytorch and pyG in our environment for reference:

```
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
```
```
pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric==1.7.2 -f https://data.pyg.org/whl/torch-1.11.0+cu113.html
```


## Tools
Two multiple sequence alignment tools and three databases are required: 
```
psi-blast 2.12.0
hhblits 3.3.0
```
Databases:
```
nrdb90(http://bliulab.net/sAMPpred-GAT/static/download/nrdb90.tar.gz)
NR(https://ftp.ncbi.nlm.nih.gov/blast/db/)
uniclust30_2018_08(https://wwwuser.gwdg.de/~compbiol/uniclust/2018_08/uniclust30_2018_08_hhsuite.tar.gz)
```
**nrdb90**: We have supplied the nrdb90 databases on our webserver. You need to put it into the `utils/psiblast/` directoy and decompress it. 

**NR**:You can download NR dababase from `https://ftp.ncbi.nlm.nih.gov/blast/db/`. Note that only the files with format `nr.*` are needed. You need to download them can put them into the `utils/psiblast/nr/` directory. The `utils/psiblast/nr/` folder should contain `nr.00.psq`, `nr.00.ppi`, ..., `nr.54.phd`, etc..

**uniclust30_2018_08**:You can download it dababase from `https://wwwuser.gwdg.de/~compbiol/uniclust/2018_08/uniclust30_2018_08_hhsuite.tar.gz`. Just decompress it in the directory `utils/hhblits/` and rename this database folder to `uniclust30_2018_08`.

**trRosetta**: The structures are predicted by trRosetta(https://github.com/gjoni/trRosetta), you need to download and put the trRosetta pretrain model(https://files.ipd.uw.edu/pub/trRosetta/model2019_07.tar.bz2) and decompress it into `utils/trRosetta/`.

> **Note** that all the defalut paths of the tools and databases are shown in `config.yaml`. You can change the paths of the tools and databases by configuring `config.yaml` as you need. 


`psi-blast` and `hhblist` are recommended to be configured as the system envirenment path. Your can follow these steps to install them:
### How to install psiblast

Download 

```
wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.12.0/ncbi-blast-2.12.0+-x64-linux.tar.gz
tar zxvf ncbi-blast-2.12.0+-x64-linux.tar.gz
```

Add the path to system envirenment in `~/.bashrc`.

```
export BLAST_HOME={your_path}/ncbi-blast-2.12.0+
export PATH=$PATH:$BLAST_HOME/bin
```

Finally, reload the system envirenment and check the psiblast command:

```
source ~/.bashrc
psiblast -h
```


### How to install hhblits
You can download and install the hhblits througth `conda` quickly.

```
conda install -c conda-forge -c bioconda hhsuite==3.3.0
```
Check the installation:

```
hhblits -h
```

## Feature extraction

`generate_features.py` is the entry of feature extraction process. An usage example is shown in `generate_features_example.sh`. 

Run the example by: 
```
chmod +x generate_features_example.sh
./generate_features_example.sh
```
The features of the examples will be genrerated if your tools and databases are configured correctly. 
Some common errors:
+ `BLAST Database error` means the nrdb90 or NR is failed to found.
+ `ERROR:   could not open file ... uniclust30_2018_08_cs219.ffdata` means the uniclust30_2018_08 is failed to found.

If you want generate the features using your own file in fasta format, just follow the `generate_features_example.sh` and change the pathes into yours.

## Usage
It takes 3 steps to train/test our model:
(1) copy the train/test soucre files in fasta format, which is  supplied in `datasets` folder, into the `data` folder.
(2) generate features, including the predicted sturctures and the sequential features.
(3) train / test.

`train.py` and `test.py` are used for training and testing, respectively. 
Running `python train.py -h` and `python test.py -h` to learn the meaning of each parameter.

The input folder should like:

```

-positive/
XXX(name of the positive file).fasta
--pssm/
---output/
----A.pssm
----B.pssm
---- ...
--hhm/
---output/
----A.hhm
----B.hhm
---- ...
--npz/
---A.npz
---B.npz

-negative
XXX(name of the negative file).fasta
 --pssm/
---output/
----C.pssm
----D.pssm
---- ...
--hhm/
---output/
----C.hhm
----D.hhm
---- ...
--npz/
---C.npz
---D.npz

```
The script `generate_features_example.sh` just generated the right folder structure. Just follow the example to generate the input folder.
> **Note** that before you train and test the model, you must successfully run  `generate_features_example.sh`.


### Test 

A trained model for XUAMP is supplied in saved_models/samp.model as an example. Run `test.py` to predict the example sequences:
```
python test.py
```

If you want test the specific dataset, for example XUAMP, you should copy the corresponding files in fasta format in `datasets/independent test datasets/` directory into the `data/test_data/positive/` and `data/test_data/negative/`, and set the ***args*** relative to the inputs. An example is given by `test.sh`:
```
chmod +x test.sh
./test.sh
```

### Train

If you want train a model based on the specific dataset, for example XUAMP, you should copy the files in fasta format in `datasets/train datasets/` directory into the `data/train_data/positive/` and `data/train_data/negative/`, and set the ***args*** relative to the inputs. An example is given by `train.sh`:
```
chmod +x train.sh
./train.sh
```
When the training process finished, the `saved_models/auc_XU_final.model`(We have supplied a well trained model and rename it to `samp.model`) will be the model optimized by AUC as introduced in this paper .