# ProtHint **Repository Path**: CHANyp/ProtHint ## Basic Information - **Project Name**: ProtHint - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2020-10-20 - **Last Updated**: 2022-10-20 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # ProtHint Tomas Bruna, Alexandre Lomsadze, Mark Borodovsky Georgia Institute of Technology, Atlanta, Georgia, USA Reference: [GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins](https://academic.oup.com/nargab/article/2/2/lqaa026/5836691) # Overview ProtHint is a pipeline for predicting and scoring hints (in the form of introns, start and stop codons) in the genome of interest by mapping and spliced aligning predicted genes to a database of reference protein sequences. # Contents * [Installation](#installation) * [Perl dependencies](#perl-dependencies) * [Python dependencies](#python-dependencies) * [GeneMark-ES](#genemark-es) * [DIAMOND](#diamond) * [Spaln](#spaln) * [Spaln boundary scorer](#spaln-boundary-scorer) * [Usage](#usage) * [Input](#input) * [Protein Database Preparation](#protein-database-preparation) * [Running ProtHint](#running-prothint) * [Output](#output) * [About](#about) # Installation To install, copy the content of this distribution to desired location. To verify the installation, run ProtHint with the sample input located in the [example](example) folder. Running ProtHint requires a Linux system with Bash. The following dependencies need to be satisfied. ### Perl dependencies Perl 5.10 or higher is required. The following non-Core Perl modules are required: * `MCE::Mutex` * `threads` * `YAML` * `Math::Utils` Core module `Thread::Queue` needs to be updated to a version `3.11` or higher. These modules are available at CPAN and can be installed/updated with cpan MCE::Mutex threads YAML Thread::Queue Math::Utils ### Python dependencies Python 3.3 or higher is required. No libraries outside of the Python Standard Library are required. ### GeneMark-ES There are two ways of using GeneMark-ES in ProtHint: 1. Run ProtHint with `--geneMarkGtf genemark.gtf` option which specifies the path to a file with GeneMark-ES predictions. If this option is used, GeneMark-ES does not need to be installed as a part of ProtHint. 2. Install GeneMark-ES, ProtHint will run it automatically as a part of the pipeline. Download and extract the contents of the GeneMark-ES suite (versions 4.30 and up) into the `ProtHint/dependencies/GeneMarkES` folder. GeneMark-ES suite is available at http://exon.gatech.edu/GeneMark/license_download.cgi To verify that GeneMark-ES is installed correctly, run the following command: `ProtHint/dependencies/GeneMarkES/check_install.bash`. ### DIAMOND DIAMOND local sequence aligner (available at https://github.com/bbuchfink/diamond) is included in this distribution package. In case the included version is not working, install DIAMOND from https://github.com/bbuchfink/diamond and replace the `diamond` binary in ProtHint/dependencies folder. ### Spaln Spaln, space-efficient spliced alignment program (available at https://github.com/ogotoh/spaln), is included in this distribution package. In case the included version is not working, install Spaln from https://github.com/ogotoh/spaln and replace the `spaln` binary in ProtHint/dependencies folder. #### Spaln boundary scorer Binary for parsing and scoring hints from Spaln's alignment output is included in this distribution package. In case the included binary is not working, compile it from source at https://github.com/gatech-genemark/spaln-boundary-scorer and replace the `spaln_boundary_scorer` binary in ProtHint/dependencies folder. # Usage ## Input ProtHint inputs consist of: * Genomic sequence from the target species in multi-FASTA format * Reference protein sequences in multi-FASTA format The tool is applicable to complete as well as draft genome assemblies. Every sequence in each multi-FASTA input needs to have a unique ID (first word of a FASTA header is used for ID). Examples of valid FASTA headers: >contig10 ID: contig10 > seq3 genome Z ID: seq3 >IV contig 25 ID: IV ## Protein Database Preparation We recommend to use a relevant portion of OrthoDB protein database as the source of reference protein sequences. For example, if your genome of interest is an insect, download arthropoda proteins: wget https://v100.orthodb.org/download/odb10_arthropoda_fasta.tar.gz tar xvf odb10_arthropoda_fasta.tar.gz and concatenate proteins from all species into a single file: cat arthropoda/Rawdata/* > proteins.fasta For other genomes of interest, you can select the most specific OrthoDB section from the list below and repeat the procedure desribed above. * **Fungi**: https://v100.orthodb.org/download/odb10_fungi_fasta.tar.gz * **Metazoa**: https://v100.orthodb.org/download/odb10_metazoa_fasta.tar.gz * **Arthropoda**: https://v100.orthodb.org/download/odb10_arthropoda_fasta.tar.gz * **Vertebrata**: https://v100.orthodb.org/download/odb10_vertebrata_fasta.tar.gz * **Protozoa**: https://v100.orthodb.org/download/odb10_protozoa_fasta.tar.gz * **Viridiplantae**: https://v100.orthodb.org/download/odb10_plants_fasta.tar.gz ## Running ProtHint To run ProtHint, use the following command: prothint.py genome.fasta proteins.fasta See the [example](example) folder for a sample input and output. To display a list of all available options, use: prothint.py --help Frequently used options are: --workdir WORKDIR Folder for results and temporary files. If not specified, current directory is used --geneMarkGtf GENEMARKGTF File with GeneMark-ES predictions in gtf format. If this file is provided, GeneMark-ES run is skipped. --diamondPairs DIAMONDPAIRS File with "seed gene-protein" hits generated by DIAMOND. If this file is provided, DIAMOND search for protein hits is skipped. ## Output ProtHint generates two main outputs: * `prothint.gff` Gff file with all reported hints (introns, starts and stops) * `evidence.gff` High confidence subset of `prothint.gff` which is, for instance, suitable for the GeneMark-EP Plus mode. This set is generated using default thresholds in `ProtHint/bin/print_high_confidence.py` script. If you wish to use different filtering criteria, re-run `print_high_confidence.py` script with custom thresholds. An output which is ready to be used in [BRAKER](https://github.com/Gaius-Augustus/BRAKER) and [AUGUSTUS](https://github.com/Gaius-Augustus/Augustus) is also generated: * `prothint_augustus.gff` # About ProtHint is developed by Tomas Bruna and Alexandre Lomsadze at [Dr. Mark Borodovsky's Bioinformatics Lab](http://exon.gatech.edu/GeneMark/), Georgia Institute of Technology, Atlanta, USA.