# sAMPpred-GAT **Repository Path**: zhangbeibei_page/sAMPpred-GAT ## Basic Information - **Project Name**: sAMPpred-GAT - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-01-21 - **Last Updated**: 2024-01-21 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # sAMPpred-GAT The implementation of the paper ***sAMPpred-GAT: Prediction of Antimicrobial Peptide by Graph Attention Network and Predicted Peptide Structure*** ## Requirements The majoy dependencies used in this project are as following: ``` python 3.7 numpy 1.21.6 tqdm 4.64.1 pyyaml 6.0 scikit-learn 1.0.2 torch 1.11.0+cu113 torch-cluster 1.6.0 torch-scatter 2.0.9 torch-sparse 0.6.15 torch-geometric 1.7.2 tensorflow 1.14.0 tensorboardX 2.5.1 ``` More detailed python libraries used in this project are referred to `requirements.txt`. Check your CPU device and install the pytorch and pyG (torch-cluster, torch-scatter, torch-sparse, torch-geometric) according to your CUDA version. > **Note** that torch-geometric 1.7.2 and tensorflow 1.14.0 are required, becuase our trained model does not support the `torch-geometric` with higher version , and the model from trRosetta does not support the `tensorflow` with higher version. > The The installed pyG (torch-cluster, torch-scatter, torch-sparse, torch-geometric) must be a GPU version according to your CUDA. If you installed a wrong vesion, there will be some unexpected errors like https://github.com/rusty1s/pytorch_scatter/issues/248 and https://github.com/pyg-team/pytorch_geometric/issues/2040. We provide the installation process of pytorch and pyG in our environment for reference: ``` pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113 ``` ``` pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric==1.7.2 -f https://data.pyg.org/whl/torch-1.11.0+cu113.html ``` ## Tools Two multiple sequence alignment tools and three databases are required: ``` psi-blast 2.12.0 hhblits 3.3.0 ``` Databases: ``` nrdb90(http://bliulab.net/sAMPpred-GAT/static/download/nrdb90.tar.gz) NR(https://ftp.ncbi.nlm.nih.gov/blast/db/) uniclust30_2018_08(https://wwwuser.gwdg.de/~compbiol/uniclust/2018_08/uniclust30_2018_08_hhsuite.tar.gz) ``` **nrdb90**: We have supplied the nrdb90 databases on our webserver. You need to put it into the `utils/psiblast/` directoy and decompress it. **NR**:You can download NR dababase from `https://ftp.ncbi.nlm.nih.gov/blast/db/`. Note that only the files with format `nr.*` are needed. You need to download them can put them into the `utils/psiblast/nr/` directory. The `utils/psiblast/nr/` folder should contain `nr.00.psq`, `nr.00.ppi`, ..., `nr.54.phd`, etc.. **uniclust30_2018_08**:You can download it dababase from `https://wwwuser.gwdg.de/~compbiol/uniclust/2018_08/uniclust30_2018_08_hhsuite.tar.gz`. Just decompress it in the directory `utils/hhblits/` and rename this database folder to `uniclust30_2018_08`. **trRosetta**: The structures are predicted by trRosetta(https://github.com/gjoni/trRosetta), you need to download and put the trRosetta pretrain model(https://files.ipd.uw.edu/pub/trRosetta/model2019_07.tar.bz2) and decompress it into `utils/trRosetta/`. > **Note** that all the defalut paths of the tools and databases are shown in `config.yaml`. You can change the paths of the tools and databases by configuring `config.yaml` as you need. `psi-blast` and `hhblist` are recommended to be configured as the system envirenment path. Your can follow these steps to install them: ### How to install psiblast Download ``` wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.12.0/ncbi-blast-2.12.0+-x64-linux.tar.gz tar zxvf ncbi-blast-2.12.0+-x64-linux.tar.gz ``` Add the path to system envirenment in `~/.bashrc`. ``` export BLAST_HOME={your_path}/ncbi-blast-2.12.0+ export PATH=$PATH:$BLAST_HOME/bin ``` Finally, reload the system envirenment and check the psiblast command: ``` source ~/.bashrc psiblast -h ``` ### How to install hhblits You can download and install the hhblits througth `conda` quickly. ``` conda install -c conda-forge -c bioconda hhsuite==3.3.0 ``` Check the installation: ``` hhblits -h ``` ## Feature extraction `generate_features.py` is the entry of feature extraction process. An usage example is shown in `generate_features_example.sh`. Run the example by: ``` chmod +x generate_features_example.sh ./generate_features_example.sh ``` The features of the examples will be genrerated if your tools and databases are configured correctly. Some common errors: + `BLAST Database error` means the nrdb90 or NR is failed to found. + `ERROR: could not open file ... uniclust30_2018_08_cs219.ffdata` means the uniclust30_2018_08 is failed to found. If you want generate the features using your own file in fasta format, just follow the `generate_features_example.sh` and change the pathes into yours. ## Usage It takes 3 steps to train/test our model: (1) copy the train/test soucre files in fasta format, which is supplied in `datasets` folder, into the `data` folder. (2) generate features, including the predicted sturctures and the sequential features. (3) train / test. `train.py` and `test.py` are used for training and testing, respectively. Running `python train.py -h` and `python test.py -h` to learn the meaning of each parameter. The input folder should like: ``` -positive/ XXX(name of the positive file).fasta --pssm/ ---output/ ----A.pssm ----B.pssm ---- ... --hhm/ ---output/ ----A.hhm ----B.hhm ---- ... --npz/ ---A.npz ---B.npz -negative XXX(name of the negative file).fasta --pssm/ ---output/ ----C.pssm ----D.pssm ---- ... --hhm/ ---output/ ----C.hhm ----D.hhm ---- ... --npz/ ---C.npz ---D.npz ``` The script `generate_features_example.sh` just generated the right folder structure. Just follow the example to generate the input folder. > **Note** that before you train and test the model, you must successfully run `generate_features_example.sh`. ### Test A trained model for XUAMP is supplied in saved_models/samp.model as an example. Run `test.py` to predict the example sequences: ``` python test.py ``` If you want test the specific dataset, for example XUAMP, you should copy the corresponding files in fasta format in `datasets/independent test datasets/` directory into the `data/test_data/positive/` and `data/test_data/negative/`, and set the ***args*** relative to the inputs. An example is given by `test.sh`: ``` chmod +x test.sh ./test.sh ``` ### Train If you want train a model based on the specific dataset, for example XUAMP, you should copy the files in fasta format in `datasets/train datasets/` directory into the `data/train_data/positive/` and `data/train_data/negative/`, and set the ***args*** relative to the inputs. An example is given by `train.sh`: ``` chmod +x train.sh ./train.sh ``` When the training process finished, the `saved_models/auc_XU_final.model`(We have supplied a well trained model and rename it to `samp.model`) will be the model optimized by AUC as introduced in this paper .