# PSSpred **Repository Path**: nickcafferry/PSSpred ## Basic Information - **Project Name**: PSSpred - **Description**: PSSpred (Protein Secondary Structure prediction) is a simple neural network training algorithm for accurate protein secondary structure prediction. It first collects multiple sequence alignments using PSI-BLAST. Amino-acid frequence and log-odds data with Henikoff weights are then used to train secondary structure, separately, based on the Rumelhart error backpropagation method. The final secondary structure prediction result is a combination of 7 neural network predictors from different profile - **Primary Language**: Perl - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 2 - **Forks**: 1 - **Created**: 2020-09-22 - **Last Updated**: 2024-03-04 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README `PSSpred `_ =============== |Documentation Status| |Appveyor| |Workflow| |Licence| |Travis| |Codecov| |Gitter| |Circleci| .. |Workflow| image:: https://github.com/nickcafferry/PSSpred/workflows/PSSpred/badge.svg :target: https://github.com/nickcafferry/PSSpred/actions/runs/263139727 .. |Licence| image:: https://img.shields.io/badge/license-MIT-blue.svg?style=flat :target: http://choosealicense.com/licenses/mit/ .. |Travis| image:: https://travis-ci.com/nickcafferry/PSSpred.svg?branch=master :target: https://travis-ci.com/nickcafferry/PSSpred .. |Codecov| image:: https://codecov.io/gh/nickcafferry/PSSpred/branch/master/graph/badge.svg :target: https://codecov.io/gh/nickcafferry/PSSpred .. |Appveyor| image:: https://ci.appveyor.com/api/projects/status/j5e243jmixcnqpy2?svg=true :target: https://ci.appveyor.com/project/nickcafferry/psspred .. |Gitter| image:: https://badges.gitter.im/PSSpred/community.svg :target: https://gitter.im/PSSpred/community?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge .. |Circleci| image:: https://circleci.com/gh/nickcafferry/PSSpred.svg?style=svg :target: https://circleci.com/gh/nickcafferry/PSSpred .. |Documentation Status| image:: https://readthedocs.org/projects/psspred/badge/?version=latest :target: https://psspred.readthedocs.io/en/latest/?badge=latest Copyright |copy| Wei MEI, |MLMS (TM)| |---| all rights reserved. |bamboo| .. |copy| unicode:: 0xA9 .. copyright sign .. |MLMS (TM)| unicode:: MLMS U+2122 .. with trademark sign .. |---| unicode:: U+02014 .. em dash :trim: .. |bamboo| unicode:: 0x1F024 .. bamboo A simple `neural network training algorithm `_ for accurate `protein secondary structure `_ prediction (`PSSpred `_ )! See `documentation `_ for more details. PSSpred (`Protein Secondary Structure `_ prediction) is a simple `neural network training algorithm `_ for accurate `protein secondary structure `_ prediction. It first collects multiple sequence alignments using `PSI-BLAST `_. Amino-acid frequence and log-odds data with `Henikoff weights `_ are then used to train secondary structure, separately, based on the `Rumelhart error backpropagation method `_. The final secondary structure prediction result is a combination of 7 neural network predictors from different profile data and parameters. The program is freely downloadable on this page. We have a community chat at `Gitter `_. Feel free to ask us anything there. We have a very welcoming and helpful community. .. raw:: html
Installation ------- No installation is needed! Simply fork this project and edit the file ```seq.fasta``` (file path: src/PSSpred_v4/seq.fasta) in ```FASTA Format``` in your own repository, then you can acquire the outputs through `github worflow `_ in about 8 minutes, and download them via `artifacts link `_. The output files contains two results, one for ```seq.dat``` (PSSpred prediction in I-TASSER format), one for ```seq.dat.ss``` (the original confidence file). If you want to check more results, you need to edit github workflow file `PSSPred.yml `_: .. image:: https://avatars3.githubusercontent.com/in/15368?s=64&v=4 :target: https://github.com/features/actions Github-Actions ^^^^^^^^^^^^^ .. code:: yaml name: PSSpred on: push: branches: - master jobs: build_docs_and_deploy: runs-on: ubuntu-latest name: running PSSpred steps: - name: Checkout uses: actions/checkout@master - name: running perl run: | echo "Initializing the program....................." echo "---------------------------------------------" cd ../ mkdir output echo "output file already created!" echo "---------------------------------------------" cd PSSpred/ cd src/ mkdir nr cd nr/ wget -O nr.tar.gz https://zhanglab.ccmb.med.umich.edu/cgi-bin/download_ftp.cgi?ID=nr.tar.gz tar -zxvf nr.tar.gz echo "nr.tar.gz already unpacked!" echo "Show the path of this file: " pwd cd ../ cd PSSpred_v4/ ./PSSpred.pl seq.fasta cp seq.dat /home/runner/work/PSSpred/output/ cp seq.dat.ss /home/runner/work/PSSpred/output/ cp blast.out /home/runner/work/PSSpred/output/ cd /home/runner/work/PSSpred/output/ ls pwd - uses: actions/upload-artifact@v2 with: name: output results path: /home/runner/work/PSSpred/output/ Not familiar with ```FASTA format```? Don't panick, this project is very user-friendly. You can type the following protein sequence:: MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRVKHLKTEAEMKASEDLKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHPGNFGADAQLELGAMNKAFRKDIAAKYKELGYQG in ```seq_1.txt``` simply, and upload to the directory (path: src/PSSpred_v4/). Wait for almost 8 minutes (check Appveyor build status: pending? failing? passing?), download the `output files `_ when the job is done. .. image:: https://avatars3.githubusercontent.com/ml/11?s=62&v=4 :target: https://www.appveyor.com/ Appveyor ^^^^^^^^ .. code:: yaml image: Ubuntu install: - sh: cd src/ - sh: mkdir nr - sh: cd nr/ - sh: wget -O nr.tar.gz https://zhanglab.ccmb.med.umich.edu/cgi-bin/download_ftp.cgi?ID=nr.tar.gz - sh: tar -zxvf nr.tar.gz - sh: cd ../PSSpred_v4/ - sh: ./PSSpred.pl seq_1.txt - sh: pwd # Skip project specific build phase. build: off test_script: - "ls" - "pwd" artifacts: - path: src\PSSpred_v4\seq.dat name: seq.dat - path: src\PSSpred_v4\seq.dat.ss name: seq.dat.ss - path: src\PSSpred_v4\protein.fasta name: protein.fasta If you prefer to use CircleCI other than Appveyor, it is alright. Just edit the ```seq_2.txt``` (file path: src/PSSpred_v4/seq_2.txt) and commit. For example, you can use the following protein sequence and generatre the secondary structure prediction by your own. Also, change the ```./PSSpred.pl seq_2.txt``` to ```./PSSpred.pl XXX.txt``` if uploading input files with different file names, by editing the following ```config.yml``` file. .. image:: https://avatars3.githubusercontent.com/ml/7?s=62&v=4 :target: https://circleci.com/ CircleCI(file path: .circleci/config.yml) ^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: yaml version: 2 jobs: build: # name of your job machine: # executor type image: ubuntu-1604:201903-01 # # recommended linux image - includes Ubuntu 16.04, docker 18.09.3, docker-compose 1.23.1 steps: - checkout - run: | cd src/ mkdir nr cd nr/ wget -O nr.tar.gz https://zhanglab.ccmb.med.umich.edu/cgi-bin/download_ftp.cgi?ID=nr.tar.gz tar -zxvf nr.tar.gz echo "nr.tar.gz already unpacked!" echo "Show the path of this file:" pwd cd ../ cd PSSpred_v4/ ./PSSpred.pl seq_2.txt ls - store_artifacts: path: src/PSSpred_v4/seq.dat destination: seq.dat - store_artifacts: path: src/PSSpred_v4/seq.dat.ss destination: seq.dat.ss - store_artifacts: path: src/PSSpred_v4/protein.fasta destination: protein.fasta Download -------- To get the git version do .. code:: sh $ git clone https://github.com/nickcafferry/PSSpred.git Or simply download the repository using the official Github CLI .. code:: sh $ gh repo clone nickcafferry/PSSpred You can also click `here `_ to download PSSpred package version 4, and `v3 `_, `v2 `_, `v1 `_. Also, you can download the whole package by clicking `source code.zip `_ or `source code.tar.gz `_. Usage ----- Simply edit the file ```seq.fasta```, or ```seq_1.txt``` or ```seq_2.txt```, or you can upload your own sequence file and change the workflow file (PSSPred.yml, appveyor.yml, config.yml) correspondinlgy. About Protein Sequence ^^^^^^^^^^^^^^^^^^^^^^ Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions: - lower-case letters are accepted and are mapped into upper-case; - a single hyphen or dash can be used to represent a gap of indeterminate length; - in amino acid sequences, U and * are acceptable letters (see below). - any numerical digits in the query sequence should either be removed or replaced by appropriate letter codes (e.g., N for unknown nucleic acid residue or X for unknown amino acid residue). The nucleic acid codes are: .. code:: python A --> adenosine M --> A C (amino) C --> cytidine S --> G C (strong) G --> guanine W --> A T (weak) T --> thymidine B --> G T C U --> uridine D --> G A T R --> G A (purine) H --> A C T Y --> T C (pyrimidine) V --> G C A K --> G T (keto) N --> A G C T (any) - gap of indeterminate length The accepted amino acid codes are: .. code:: python A ALA alanine P PRO proline B ASX aspartate or asparagine Q GLN glutamine C CYS cystine R ARG arginine D ASP aspartate S SER serine E GLU glutamate T THR threonine F PHE phenylalanine U selenocysteine G GLY glycine V VAL valine H HIS histidine W TRP tryptophan I ILE isoleucine Y TYR tyrosine K LYS lysine Z GLX glutamate or glutamine L LEU leucine X any M MET methionine * translation stop N ASN asparagine - gap of indeterminate length Notes ^^^^^^^^^^^ - seq.txt is fasta file at current directory (the only input file). If you know about `FASTA format`, you can always use that format. - output files:: seq.dat seq.dat.ss - PSSpred.pl consists of three steps:: a. prepare and run PSI-BLAST b. prepare mtx, pssm.txt, profw, freqccw, freqccwG c. run PSSpred and generate output files Example input file ^^^^^^^^^^^^^^^^^^^^ Input file: seq_1.txt(src/PSSpred_v4/seq_1.txt) .. code:: python MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLS EARQHLKDGTCGLVEVEKGVLPQLEQPYVFIKRSDARTAP HGHVMVELVAELEGIQYGRSGETLGVLVPHVGEIPVAYRK VLLRKNGNKGAGGHSYGADLKSFDLGDELGTDPYEDFQEN WNTKHSSGVTRELMRELNGG Snapshot of seq.dat ^^^^^^^^^^^^^^^^^^ .. code:: python 1 MET 1 9 # the first column stands for numbers in order 2 GLU 1 9 # the second column is the amino acid code (see `About Protein Sequence` for more details) 3 SER 1 8 # the third one represents the secondary structure code: 1<->helix, 2<->coil, 4<->strand 4 LEU 1 8 # the fourth one represents the confidence score: 1-9 5 VAL 1 8 6 PRO 1 8 7 GLY 1 8 8 PHE 1 7 9 ASN 1 6 10 GLU 1 3 11 LYS 1 1 12 THR 4 3 13 HIS 4 6 14 VAL 4 8 15 GLN 4 9 16 LEU 4 9 17 SER 4 8 18 LEU 4 6 19 PRO 4 5 20 VAL 4 5 Snapshot of seq.dat.ss ^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: python 180 coil helix beta # 180: the total number of sequence # Protein secondary structure: coil, helix, beta 1 M C 0.958 0.024 0.012 # the third column: the most possible secondary structure (C-coil, H-helix, E-strand) 2 E C 0.900 0.043 0.046 # the second column: input sequence 3 S C 0.871 0.072 0.061 # the first column: enumeration number 4 L C 0.872 0.064 0.067 # 4-6 columns: probability of corresponding protein secondary structure 5 V C 0.891 0.053 0.062 6 P C 0.902 0.042 0.061 7 G C 0.886 0.046 0.070 8 F C 0.808 0.086 0.096 9 N C 0.715 0.124 0.154 10 E C 0.620 0.124 0.272 11 K C 0.546 0.053 0.416 12 T E 0.364 0.013 0.636 13 H E 0.220 0.007 0.782 14 V E 0.105 0.005 0.902 15 Q E 0.069 0.004 0.936 16 L E 0.076 0.005 0.928 17 S E 0.112 0.005 0.895 18 L E 0.204 0.005 0.800 19 P E 0.230 0.008 0.760 20 V E 0.229 0.012 0.760 FASTA format ------------ FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length. An example sequence in FASTA format is: .. code:: python >gi|186681228|ref|YP_001864424.1| phycoerythrobilin:ferredoxin oxidoreductase MNSERSDVTLYQPFLDYAIAYMRSRLDLEPYPIPTGFESNSAVVGKGKNQEEVVTTSYAFQTAKLRQIRA AHVQGGNSLQVLNFVIFPHLNYDLPFFGADLVTLPGGHLIALDMQPLFRDDSAYQAKYTEPILPIFHAHQ QHLSWGGDFPEEAQPFFSPAFLWTRPQETAVVETQVFAAFKDYLKAYLDFVEQAEAVTDSQNLVAIKQAQ LRYLRYRAEKDPARGMFKRFYGAEWTEEYIHGFLFDLERKLTVVK Contributing ------------ This project welcomes contributions and suggestions. Most contributions require you to agree to a `MIT LICENCE `_ (MIT LIC) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit `Code of Conduct `_. Refrence -------- Renxiang Yan, Dong Xu, Jianyi Yang, Sara Walker, Yang Zhang. `A comparative assessment and analysis of 20 representative sequence alignment methods for protein structure prediction `_. Scientific Reports, 3: 2619 (2013).