# genome-analysis

**Repository Path**: biozzc/genome_analysis

## Basic Information

- **Project Name**: genome-analysis
- **Description**: For genome assembly, annotation, and comparative genomics
- **Primary Language**: Python
- **License**: GPL-3.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2021-09-15
- **Last Updated**: 2025-06-25

## Categories & Tags

**Categories**: Uncategorized

**Tags**: Genome

## README

# Genome analysis tools

Here, I would record some tools for genome assembly, annotation, and comparative genomics analysis when I used/collected during my own projects. Also, some useful scripts would be addded into this chapter later.

## Genome assembly
Some tools for short reads assembly:
- Trinity (k-mer ≤ 32)
```
# Trinity default k-mer 25, maximum 32
$ Trinity --seqType fq --left reads_1.fq --right reads_2.fq --CPU 6 --max_memory 20G 
```

- SPAdes (Metagenome)
```
# -k: max k-mer
# -l: min k-mer
# -s: step size
# --rna: this flag is required for RNA-Seq data
# --plasmid: runs plasmidSPAdes pipeline for plasmid detection
$ /bin/spades.py --rna -k 21,27,33,55,77,99,127 -1 R1.fastq -2 R2.fastq -o spades_out_dir/
```

- SOAPdenovo (The Beijing Genomics Institute)
- Velvet (de Bruijn)
```
$ velveth output 31,37,2 -shortPaired -fasta -separate left.fa right.fa
# output/ based on velveth output
$ velvetg output/ -min_contig_lgth 500 \
  -exp_cov auto -cov_cutoff auto \
  -clean yes -scaffolding yes -amos_file yes
```

Tools for PacBio long reads assembly:
- **MECAT2** (an ultra-fast and accurate Mapping, Error Correction and de novo Assembly Tools for single molecula sequencing (SMRT) reads)*
```
$ mecat.pl config ecoli_config_file.txt
$ vi ecoli_config_file.txt
$ mecat.pl correct ecoli_config_file.txt
$ mecat.pl trim ecoli_config_file.txt
$ mecat.pl assemble ecoli_config_file.txt
# Run Pillon for local bases polish
$ java -jar pilon-1.23.jar --genome reference_genome.fasta --bam mecat2_mapped_Illumina_sorted.bam --fix all --output pillon_corrected
```
- **FALCON** (PacBio Assembly Tool Suite: Reads in ⇨ Assembly out )*
- **[Flye](https://github.com/fenderglass/Flye/)** (using repeat graph as a core data structure)
- **[Canu](https://github.com/marbl/canu)** (no correction step for PacBio Hifi reads, using HiCanu: -pacbio-hifi)

Tools for PacBio HiFi reads assembly:
- hifiasm
- canu -pacbio-hifi
- LJA
- IPA

Tools for Nanopore long reads assembly:
- **NECAT** (NECAT is an error correction and de-novo assembly tool for Nanopore long noisy reads)*
```
$ necat.pl config ecoli_config.txt
$ necat.pl correct ecoli_config.txt
$ necat.pl assemble ecoli_config.txt
$ necat.pl bridge ecoli_config.txt
```
- **Shasta** (rapidly produce accurate assembled sequence using as input DNA reads generated by [Oxford Nanopore](https://nanoporetech.com) flow cells)
- **NextDenovo**

Tools for mixed assembly:
- MaSuRCA - MaSuRCA is whole genome assembly software. It combines the efficiency of the de Bruijn graph and Overlap-Layout-Consensus (OLC) approaches. MaSuRCA can assemble data sets containing only short reads from Illumina sequencing or a mixture of short reads and long reads (Sanger, 454, Pacbio and Nanopore).


## Post-assembly polish
Tools were used for sequencing errors correction after assembly:
- **Pilon** (polish using illumina short reads, three rounds of Pilon)*
- **Racon** (polish using pacbio long reads, two rounds of Racon)
- **NextPolish** (polish using short reads)
- **Medaka** (Nanopore polish using long reads after Racon polish at least once)

Tools were used for heterozygous/polymorphic genomes assembly (杂合基因组组装):
- **redundans** (return scaffolded homozygous genome assembly)
- **purge_haplotigs** (heterozygous diploid genome assemblies, assembling using FALCON or FALCON-unzip)
- **purge_dups** (remove haplotigs and contig overlaps in a de novo assembly based on read depth)
- **HaploMerger2** (reconstruct a highly-polymorphic diploid assembly, and then output two separated haploid sub-assemblies)
- **Purge_haplogs**
- NOVOheter
- Plantanus
- MSR-CA
- Plantanus-allee
- Canu (batOptions=-dg 3 -db 3 -dr 1 -ca 500 -cp 50)
- MECAT2
- MaSuRCA - MaSuRCA is whole genome assembly software. It combines the efficiency of the de Bruijn graph and Overlap-Layout-Consensus (OLC) approaches. MaSuRCA can assemble data sets containing only short reads from Illumina sequencing or a mixture of short reads and long reads (Sanger, 454, Pacbio and Nanopore).
- Falon
- Flye
- wtdbg2

## Advanced assembly

- **npScarf**: scaffolds and completes draft genomes assemblies in real-time with Oxford Nanopore sequencing.
- **SSPACE**: SSPACE standard is a stand-alone program for scaffolding pre-assembled contigs using NGS paired-read data.
- **bcgsc / LINKS**: LINKS is a genomics application for scaffolding genome assemblies with long reads, such as those produced by Oxford Nanopore Technologies Ltd. It can be used to scaffold high-quality draft genome assemblies with any long sequences (eg. ONT reads, PacBio reads, other draft genomes, etc). It is also used to scaffold contig pairs linked by ARCS/ARKS.
- **bcgsc / ntJoin**: Scaffolding draft assemblies using reference assemblies and minimizer graphs
- **quickmerge**: A simple and fast metassembler and assembly gap filler designed for long molecule based assemblies.
- **FinisherSC**: a repeat-aware and scalable tool for upgrading de novo assembly using long reads. Experiments with real data suggest that FinisherSC can provide longer and higher quality contigs than existing tools while maintaining high concordance.

### Chromosomes assembly
Chromosomes level assembly using Hi-C data:
- [SALSA](https://github.com/marbl/SALSA) (scaffold long read assemblies with Hi-C data, GFA file)
- [HiCAssembler](https://github.com/maxplanck-ie/HiCAssembler) (assemble scaffolds into chromosomes using Hi-C data)
- [3D-DNA](https://github.com/theaidenlab/3d-dna) (3D-DNA pipeline, better with Juicebox tools)
- [ALLHiC](https://github.com/tangerzhang/ALLHiC) (phasing and scaffolding polyploid genomes based on Hi-C data)
- [LACHESIS](https://github.com/shendurelab/LACHESIS) (de novo assemblies using the locations of the Hi-C reads)

Chromosomes level assembly using Optical mapping:
- [Bionano](https://bionanogenomics.com/support/software-downloads/) (web server for experiment and data analysis of Bionano data)
- [OMSV](http://yiplab.cse.cuhk.edu.hk/omsv/) (identify large structural variations from nanochannel optical maps)
- [OMTools](https://github.com/TF-Chan-Lab/OMTools) (optical mapping data processing, analysis and visualization)

Chromosomes level assembly based on previous or homologous genome:
*   [chromosomer](https://github.com/gtamazian/chromosomer) (a reference-assisted assembly for draft chromosome)
*   [RaGoo](https://github.com/malonge/RaGOO) (fast reference-guided scaffolding of genome assembly contigs)
*   [Ragout](https://github.com/fenderglass/Ragout) (chromosome-level scaffolding using multiple references)

Chromosomes level assembly integrated with genetic, optical and Hi-C:
- **[ALLMAPS](https://github.com/tanghaibao/jcvi/) (robust scaffold ordering based on multiple maps)**

## Quality assessment
Tools were used to assess the completeness of the genome assembly:
- [BUSCO](https://busco-archive.ezlab.org/v2/) (assessing genome assembly and annotation completeness with Benchmarking Universal Single-Copy Orthologs)
- [Transrate](https://github.com/Blahah/transrate) (give you a score which represents proportion of input reads that provide positive support for the assembly)
- [CEGMA](http://korflab.ucdavis.edu/datasets/cegma/) (Core Eukaryotic Genes Mapping Approach, 458 core proteins are highly conserved)
- LTRharvest - LAI

Tools were used to infer the properties of a genome from unassembled sequenced data:
- [GenomeScope](https://github.com/schatzlab/genomescope) (infer the global properties of a genome from unassembled short reads )
- [smudgeplot](https://github.com/KamilSJaron/smudgeplot) (inference of ploidy and heterozygosity structure using whole genome sequencing data)

## Genome annotation
Tools for annotating repetitive sequence:
*   [RepeatModeler](https://github.com/Dfam-consortium/RepeatModeler) (a *de novo* transposable element (TE) family identification and modeling package. The program consisted of  RECON, RepeatScout and LtrHarvest/Ltr_retriever)
*   [RepeatMasker](https://github.com/rmhubley/RepeatMasker) (a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences)
*   [EDTA](https://github.com/oushujun/EDTA) (The Extensive de novo TE Annotator, automated whole-genome *de-novo* TE annotation and benechmarking the annotation performance of TE libraries)
*   [LTRpred](https://hajkd.github.io/LTRpred/) (It focuses particularly on [LTR retrotransposons](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC463057/) and aims to annotate only functional and potentially mobile elements)
*   [LTR_retriever](https://github.com/oushujun/LTR_retriever) (LTR_retriever is a highly accurate and sensitive program for identification of LTR retrotransposons; The LTR Assembly Index (**LAI**) is also included in this package)

Tools for annotating gene structure:

- [AUGUSTUS](https://github.com/Gaius-Augustus/Augustus) (Self-trained for rare species, also incorporate hints on the gene structure coming from extrinsic sources such as EST, MS/MS, protein alignments and syntenic genomic alignment)
- SNAP
- GeneMark-ES/ET (for *de novo* eukaryotic genomes)
- Exonerate
- TransDecoder

Some **integrated pipelines** for gene structure annotation:
- [MAKER/MAKER-P](http://www.yandell-lab.org/software/maker.html) (*de novo* annotation of newly sequenced genomes, for updating existing annotations to reflect new evidence, or just to combine annotations, evidence, and quality control statistics for use with other GMOD programs) | [MAKER current protocols](http://www.yandell-lab.org/publications/pdf/maker_current_protocols.pdf)
```
$ maker -CTL
$ vi maker_opts.ctl
```
- [BRAKER2](https://github.com/Gaius-Augustus/BRAKER) (a pipeline for fully automated prediction of protein coding gene structures with GeneMark-ES/ET and AUGUSTUS)
- [Funannotate](https://github.com/nextgenusfs/funannotate) (a pipeline for genome annotation built specifically for fungi, but will also work with higher eukaryotes)
- [FunGAP](https://github.com/biogeeker/FunGAP) (FunGAP uses three gene prediction tools: Augustus, Braker, and Maker. The outcomes of predictions are stored in GFF3 and FASTA files for the next set of evidence score calculations)
```
$ fungap.py -g <genome_assembly> -12UA <trans_read_files> -o <output_dir> -a <augustus_species> -b <busco_dataset> -s <sister_proteome>
```
- [EVM](https://github.com/EVidenceModeler/EVidenceModeler) (framework for combining diverse evidence types into a single automated gene structure annotation system)
```
# weights.txt
ABINITIO_PREDICTION      augustus       1
ABINITIO_PREDICTION      twinscan       1
ABINITIO_PREDICTION      glimmerHMM      1
PROTEIN         spliced_protein_alignments 1
PROTEIN         genewise_protein_alignments   5
TRANSCRIPT      spliced_transcript_alignments      1
TRANSCRIPT      PASA_transcript_assemblies      10
```
- [PASA](https://github.com/PASApipeline/PASApipeline) (exploits spliced alignments of expressed transcript sequences to automatically model gene structures. Failed to install, need MySQL)
- [LoReAn](https://github.com/lfaino/LoReAn) (the key improvement is the incorporation of single-molecule cDNA sequencing data and reduces the requirements for time-consuming manual annotation)

Tools for **Nanopore or PacBio** gene structure annotation:

*   [FLAIR](https://github.com/BrooksLabUCSC/flair) (FLAIR (Full-Length Alternative Isoform analysis of RNA) for the correction, isoform definition, and alternative splicing analysis of noisy reads. FLAIR has primarily been used for nanopore cDNA, native RNA, and PacBio sequencing reads)
*   [SQANTI](https://bitbucket.org/ConesaLab/sqanti/src/master/) (Structural and Quality Annotation of Novel Transcript Isoforms)
*   [Pinfish](https://github.com/nanoporetech/pinfish)/[pipeline-nanopore-ref-isoforms](https://github.com/nanoporetech/pipeline-nanopore-ref-isoforms) (Pipeline for annotating genomes using long read transcriptomics data with stringtie and other tools)

Some tools for genome long-read sequence alignment (Whole Genome Alignment):
- LAST
- [LASTZ](https://www.geneious.com/plugins/lastz-plugin/) (LASTZ (Large-Scale Genome Alignment Tool) is a fast and powerful alignment tool for the pairwise alignment of genomic DNA sequence.)
- MUMmer - nucmer
- MashMap - 默认参数比对, `generateDotPlot.pl` 共线性区域绘图
- [minimap2](https://github.com/lh3/minimap2) (a versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database)*
- [GMAP/GSNAP](http://research-pub.gene.com/gmap/) (Genomic Mapping and Alignment Program for mRNA and EST Sequences, Genomic Short-read Nucleotide Alignment Program)
```
$ gmap_build -d ps2019 Psojae2019.fasta
$ gmap -t 20 -d ps2019 -f gff3_gene cds.fa > cds_gene.gff3
```

- D-Genies Website
- Mauve GUI


## Comparative genomics

- **[JCVI](https://github.com/tanghaibao/jcvi)** (Python library to facilitate genome assembly, annotation, and comparative genomics)*
- [wgd](https://github.com/arzwa/wgd) (Python package and CLI for whole-genome duplication related analyses)
- WGDI


## Synteny plot
- JCVI
- shinysyn
- TBtools
- RectChr

## Population structure
*   [STRUTURE]() (Unknown)
*   [Admixture](http://software.genetics.ucla.edu/admixture/) (Expired)
*   [lostruct](https://github.com/petrelharp/local_pca) (Local PCA/population structure, Local PCA Shows How the Effect of Population Structure Differs Along the Genome)

## Ancestor genome reconstruct
- AGORA
- MGRA
- inferCAR

## Ancestor sequence infer
- treetime
- MEGA


## Reference
[1]. [xuzhougeng](https://github.com/xuzhougeng) / **[Genomic_tools](https://github.com/xuzhougeng/Genomic_tools)**

--- ***End*** ---