# wengan **Repository Path**: xiekunwhy/wengan ## Basic Information - **Project Name**: wengan - **Description**: No description available - **Primary Language**: Unknown - **License**: AGPL-3.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2020-03-15 - **Last Updated**: 2020-12-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README [![HitCount](http://hits.dwyl.io/adigenova/wengan.svg)](http://hits.dwyl.io/adigenova/wengan) [![GitHub Downloads](https://img.shields.io/github/downloads/adigenova/wengan/total.svg?style=social&logo=github&label=Download)](https://github.com/adigenova/wengan/releases) # Wengan An accurate and ultra-fast genome assembler Table of Contents ================= * [SYNOPSIS](#synopsis) * [Description](#description) * [Short-read assembly](#short-read-assembly) * [WenganM [M]](#wenganm-m) * [WenganA [A]](#wengana-a) * [WenganD [D]](#wengand-d) * [Long-read presets](#long-read-presets) * [ontlon](#ontlon) * [ontraw](#ontraw) * [pacraw](#pacraw) * [pacccs (experimental)](#pacccs-experimental) * [Wengan demo](#wengan-demo) * [Wengan benchmark](#wengan-benchmark) * [Wengan components](#wengan-components) * [Getting the latest source code](#getting-the-latest-source-code) * [Instructions](#instructions) * [Building Wengan from source](#building-wengan-from-source) * [Requirements](#requirements) * [Limitations](#limitations) * [About the name](#about-the-name) * [Citation](#citation) # SYNOPSIS # Assembling Oxford nanopore and illumina reads with WenganM wengan.pl -x ontraw -a M -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l ont.fastq.gz -p asm1 -t 20 -g 3000 # Assembling PacBio reads and illumina reads with WenganA wengan.pl -x pacraw -a A -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm2 -t 20 -g 3000 # Assembling ultra-long nanopore reads and BGI reads with WenganM wengan.pl -x ontlon -a M -s lib2.fwd.fastq.gz,lib2.rev.fastq.gz -l ont.fastq.gz -p asm3 -t 20 -g 3000 # Non-hybrid assembly of PacBio Circular Consensus Sequence data with WenganM wengan.pl -x pacccs -a M -l ccs.fastq.gz -p asm4 -t 20 -g 3000 # Assembling ultra-long nanopore reads and Illumina reads with WenganD (need a high memory machine 600Gb) wengan.pl -x ontlon -a D -s lib2.fwd.fastq.gz,lib2.rev.fastq.gz -l ont.fastq.gz -p asm5 -t 20 -g 3000 # Assembling pacraw reads with pre-assembled short-read contigs from Minia3 wengan.pl -x pacraw -a M -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm6 -t 20 -g 3000 -c contigs.minia.fa # Assembling pacraw reads with pre-assembled short-read contigs from Abyss wengan.pl -x pacraw -a A -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm7 -t 20 -g 3000 -c contigs.abyss.fa # Assembling pacraw reads with pre-assembled short-read contigs from DiscovarDenovo wengan.pl -x pacraw -a D -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm8 -t 20 -g 3000 -c contigs.disco.fa # Description **Wengan** is a new genome assembler that unlike most of the current long-reads assemblers avoids entirely the all-vs-all read comparison. The key idea behind **Wengan** is that long-read alignments can be **inferred by building paths** on a sequence graph. To achieve this, **Wengan** builds a new sequence graph called the Synthetic Scaffolding Graph. The SSG is built from a spectrum of synthetic mate-pair libraries extracted from raw long-reads. Longer alignments are then built by peforming a transitive reduction of the edges. Another distinct feature of **Wengan** is that it performs **self-validation** by following the read information. **Wengan** identifies miss-assemblies at differents steps of the assembly process. For more information about the algorithmic ideas behind **Wengan** please read the preprint available in bioRxiv. # Short-read assembly **Wengan** uses a de Bruijn graph assembler to build the assembly backbone from short-read data. Currently, **Wengan** can use **Minia3**, **Abyss2** or **DiscoVarDenovo**. The recommended short-read coverage is **50-60X** of 2 x 150bp or 2 x 250bp reads. ## WenganM \[M\] This **Wengan** mode uses the **Minia3** short-read assembler. This is the fastest mode of **Wengan** and can assemble a complete human genome in less than 210 CPU hours (~50Gb of RAM). ## WenganA \[A\] This **Wengan** mode uses the **Abyss2** short-read assembler, this is the lowest memory mode of **Wengan** and can assemble a complete human genome in less than 40Gb of RAM (~900 CPU hours). This assembly mode takes ~2 days when using 20 CPUs on a single machine. ## WenganD \[D\] This **Wengan** mode uses the **DiscovarDenovo** short-read assembler, this is the greedier memory mode of **Wengan** and for assembling a complete human genome needs about 600Gb of RAM (~900 CPU hours). This assembly mode takes ~2 days when using 20 CPUs on a single machine. # Long-read presets The presets define several variables of the wengan pipeline execution and depend on the long-read technology used to sequence the genome. The recommended long-read coverage is 30X. ## ontlon preset for raw ultra-long-reads from Oxford Nanopore, typically with an N50 > 50kb. ## ontraw preset for raw Nanopore reads typically with an N50 ~\[15kb-40kb\]. ## pacraw preset for raw long-reads from Pacific Bioscience (PacBio) typically with an N50 ~\[8kb-60kb\]. ## pacccs (experimental) preset for Circular Consensus Sequences from Pacific Bioscience (PacBio) typically with an N50 ~\[15kb\]. This type of data is not fully supported yet. # Wengan demo The repository [wengan_demo](https://github.com/adigenova/wengan_demo) contains a small dataset and instructions to test [Wengan v0.1](https://github.com/adigenova/wengan/releases/tag/0.1). ``` #fetch the demo dataset git clone https://github.com/adigenova/wengan_demo.git ``` # Wengan benchmark | Genome | Long reads| Short reads|Wengan Mode| NG50 (Mb) | CPU (h) | RAM (Gb) | Fasta file| |:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:| | | | 2x150bp 50X (GIAB:[rs1][g1] , [rs2][g2])| WenganA | 23.08 | 671 | 45 | [asm][NA12878.WenganA.ONT-ul-rel5.fa.gz] | NA12878| ONT 35X ([rel5][rel5]) | 2x150bp 50X (GIAB:[rs1][g1] , [rs2][g2])| WenganM | 16.67 | 185 | 53 | [asm][NA12878.WenganM.ONT-ul-rel5.fa.gz] | | | 2x250bp 60X (ENA:[rs1][wdsna1] , [rs2][wdsna2])| WenganD | 33.13 | 550 | 622 | [asm][NA12878.WenganD.ONT-ul-rel5.fa.gz] | HG00073 | PAC 90X (ENA:[rl1][ena])| 2x250bp 63X (ENA:[rs1][wdhg1] , [rs2][wdhg2])| WenganD | 29.2 | 800 | 644 | [asm][HG00733.WenganD.PAC-SequelI.fa.gz] | NA24385 | ONT 60X (GIAB:[rl1][giab]) | 2x250bp 70X (GIAB:[rs1][g3])| WenganD | 48.8 | 910 | 650 | [asm][NA24385.WenganD.ONT-ul-final.fa.gz] | CHM13 | ONT 50X (T2T:[rel2][t2t])| 2x250bp 66X (ENA:[rs1][wdch1] , [rs2][wdch2])| WenganD | 57.4 | 1027 | 647 | [asm][CHM13.WenganD.ONT-T2T-rel2.fa.gz] [rel5]: https://github.com/nanopore-wgs-consortium/NA12878/blob/master/nanopore-human-genome/rel5.md [t2t]: https://github.com/nanopore-wgs-consortium/CHM13 [ena]: https://www.ebi.ac.uk/ena/data/view/SRX4480530 [giab]: https://cutt.ly/BekQW1l [wdsna1]: https://www.ebi.ac.uk/ena/data/view/SRR891258 [wdsna2]: https://www.ebi.ac.uk/ena/data/view/SRR891259 [wdch1]: https://www.ebi.ac.uk/ena/data/view/SRR3189742 [wdch2]: https://www.ebi.ac.uk/ena/data/view/SRR3189741 [wdhg1]: https://www.ebi.ac.uk/ena/data/view/SRR5534476 [wdhg2]: https://www.ebi.ac.uk/ena/data/view/SRR5534475 [g1]: https://cutt.ly/iekQmOn [g2]: https://cutt.ly/EekQWrz [g3]: https://cutt.ly/lekQWmZ [NA12878.WenganA.ONT-ul-rel5.fa.gz]: https://zenodo.org/record/2598666/files/NA12878.WenganA.ONT-ul-rel5.fa.gz?download=1 [NA12878.WenganM.ONT-ul-rel5.fa.gz]: https://zenodo.org/record/2598666/files/NA12878.WenganM.ONT-ul-rel5.fa.gz?download=1 [NA12878.WenganD.ONT-ul-rel5.fa.gz]: https://zenodo.org/record/2598666/files/NA12878.WenganD.ONT-ul-rel5.fa.gz?download=1 [CHM13.WenganD.ONT-T2T-rel2.fa.gz]: https://zenodo.org/record/2598666/files/CHM13.WenganD.ONT-T2T-rel2.fa.gz?download=1 [NA24385.WenganD.ONT-ul-final.fa.gz]: https://zenodo.org/record/2598666/files/NA24385.WenganD.ONT-ul-final.fa.gz?download=1 [HG00733.WenganD.PAC-SequelI.fa.gz]: https://zenodo.org/record/2598666/files/HG00733.WenganD.PAC-SequelI.fa.gz?download=1 The assemblies generated using Wengan can be downloaded from [Zenodo](https://zenodo.org/record/2598666). All the assemblies were ran as described in the Wengan preprint. NG50 was computed using a genome size of 3.14Gb. # Wengan components + A de Bruijn graph assembler ([Minia](https://github.com/GATB/minia), [Abyss](https://github.com/bcgsc/abyss) or [DiscovarDenovo](https://software.broadinstitute.org/software/discovar/blog/)) + [FastMIN-SG](https://github.com/adigenova/fastmin-sg) + [IntervalMiss](https://github.com/adigenova/intervalmiss) + [Liger](https://github.com/adigenova/liger) # Getting the latest source code ## Instructions It is recommended to use/download the latest binary release (Linux) from : https://github.com/adigenova/wengan/releases ### Building Wengan from source To compile Wengan run the following command: ```bash #fetch Wengan and its components git clone --recursive https://github.com/adigenova/wengan.git wengan ``` There are specific instructions for each Wengan component. After compilation you have to copy the binaries to wengan-dir/bin. #### Requirements c++ compiler; compilation was tested with gcc version GCC/7.3.0-2.30 (Linux) and clang-1000.11.45.5 (Mac OSX). cmake 3.2+. #### Specific component source code versions used to build Wengan v0.1 1. abyss commit [d4b4b5d](https://github.com/bcgsc/abyss/tree/d4b4b5d3091d90a4967180d987bd7168dbf04585) 2. discovarexp-51885 commit [f827bab](https://github.com/adigenova/discovarexp-51885/tree/f827bab9bd0e328fee3dd57b7fefebfeebd92be4) 3. minia commit [017d23e](https://github.com/GATB/minia/tree/017d23e60d56db183c499bb2241345e95514ebbe) 4. fastmin-sg commit [710aea0](https://github.com/adigenova/fastmin-sg/tree/710aea0b970fa6c7482499e5479927fabbad34fe) 5. intervalmiss commit [bb884c4](https://github.com/adigenova/intervalmiss/tree/bb884c4bf408880fd3acb6621e148e57ad6f695d) 6. liger commit [82658bc](https://github.com/adigenova/liger/tree/82658bcc2adde729d05b90faf17d0a22c500189c) 7. seqtk commit [2efd0c8](https://github.com/adigenova/seqtk/tree/2efd0c85767b2e8ae2366d7ea7edb8041adb0eb1) # Limitations 1.- Genomes larger than 4Gb are not supported yet. # About the name **Wengan** is a [Mapudungun](https://en.wikipedia.org/wiki/Mapuche_language) word. Mapudungun is the language of the [**Mapuche**](https://en.wikipedia.org/wiki/Mapuche) people, the largest indigenous inhabitants of south-central Chile. **Wengan** means "***Making the path***". # Citation Alex Di Genova, Elena Buena-Atienza, Stephan Ossowski, Marie-France Sagot. **Wengan: Efficient and high quality hybrid de novo assembly of human genomes.** BioRxiv, [link](https://www.biorxiv.org/content/10.1101/840447v1)