# helen **Repository Path**: dnastories_dengcao/helen ## Basic Information - **Project Name**: helen - **Description**: H.E.L.E.N. (Homopolymer Encoded Long-read Error-corrector for Nanopore) - **Primary Language**: Python - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 1 - **Forks**: 0 - **Created**: 2019-08-10 - **Last Updated**: 2024-02-06 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # H.E.L.E.N. H.E.L.E.N. (Homopolymer Encoded Long-read Error-corrector for Nanopore) [![Build Status](https://travis-ci.com/kishwarshafin/helen.svg?branch=master)](https://travis-ci.com/kishwarshafin/helen) ___________________________________________________________ Pre-print of a paper describing the methods and overview of a suggested `de novo assembly` pipeline is now available: #### [Efficient de novo assembly of eleven human genomes using PromethION sequencing and a novel nanopore toolkit](https://www.biorxiv.org/content/10.1101/715722v1) __________________________________________________________ ## Overview `HELEN` is a polisher intended to use for polishing human-genome assemblies. `HELEN` operates on the pileup summary generated by [MarginPolish](https://github.com/UCSC-nanopore-cgl/marginPolish). `MarginPolish` uses a probabilistic graphical-model to encode read alignments through a draft assembly to find the maximum-likelihood consensus sequence. The graphical-model operates in run-length space, which helps to reduce errors in homopolymeric regions. `MarginPolish` can produce tensor-like summaries encapsulating the internal likelihood weights. The weights are assigned to each genomic position over multiple likely outcomes that is suitable for inference by a Deep Neural Network model. `HELEN` uses a Recurrent-Neural-Network (RNN) based Multi-Task Learning (MTL) model that can predict a base and a run-length for each genomic position using the weights generated by `MarginPolish`. © 2019 Kishwar Shafin, Trevor Pesout, Benedict Paten.
Computational Genomics Lab (CGL), University of California, Santa Cruz. ## Why MarginPolish-HELEN ? * `MarginPolish-HELEN` outperforms other graph-based and Neural-Network based polishing pipelines. * Easily usable via Docker for both `GPU` and `CPU`. * Highly optimized pipeline that is faster than any other available polishing tool (~4 hours for `HELEN`). * We have sequenced-assembled-polished 11 samples to ensure robustness, runtime-consistency and cost-efficiency. * We tested GPU usage on `Amazon Web Services (AWS)` and `Google Cloud Platform (GCP)` to ensure scalability. * Open source [(MIT License)](LICENSE). ## Walkthrough A `demo` walkthrough is available here: [demo](docs/walkthrough.md) ## Table of contents * [Workflow](#workflow) * [Installation](#Installation) * [Usage](#Usage) * [Models](#Models) * [Released Models](#Released-Models) * [Runtime and Cost](#Runtime-and-Cost) * [Results](#Results) * [Eleven high-quality assemblies](#Eleven-high-quality-assemblies) * [Help](#Help) * [Acknowledgement](#Acknowledgement) ## Workflow The workflow is as follows: * Generate an assembly with [Shasta](https://github.com/chanzuckerberg/shasta). * Create a mapping between reads and the assembly using [Minimap2](https://github.com/lh3/minimap2). * Use [MarginPolish](https://github.com/UCSC-nanopore-cgl/marginPolish) to generate the images. * Use HELEN to generate a polished consensus sequence.

## Installation We have docker support for both `MarginPolish` and `HELEN`. Users can install `MarginPolish` and `HELEN` on `Ubuntu 18.04` or any other Linux-based system by following the instructions from our [Installation Guide](docs/installation.md). If you have locally installed `MarginPolish-HELEN` then please follow the [Local Install Usage Guide](docs/usage_local_install.md) ## Usage `MarginPolish` requires a draft assembly and a mapping of reads to the draft assembly. We commend using `Shasta` as the initial assembler and `MiniMap2` for the mapping. #### Step 1: Generate an initial assembly Although any assembler can be used to generate the initial assembly, we highly recommend using [Shasta](https://github.com/chanzuckerberg/shasta). Please see the [quick start documentation](https://chanzuckerberg.github.io/shasta/QuickStart.html) to see how to use Shasta. Shasta requires memory intensive computing. > For a human size assembly, AWS instance type x1.32xlarge is recommended. It is usually available at a cost around $4/hour on the AWS spot market and should complete the human size assembly in a few hours, at coverage around 60x. An assembly can be generated by running: ```bash # you may need to convert the fastq to a fasta file ./shasta-Linux-0.1.0 --input --output ``` #### Step 2: Create an alignment between reads and shasta assembly We recommend using `MiniMap2` to generate the mapping between the reads and the assembly. ```bash # we recommend using FASTQ as marginPolish uses quality values # This command can run MiniMap2 with 32 threads, you can change the number as you like. minimap2 -ax map-ont -t 32 shasta_assembly.fa reads.fq | samtools sort -@ 32 | samtools view -hb -F 0x104 > reads_2_assembly.bam samtools index -@32 reads_2_assembly.bam # the -F 0x104 flag removes unaligned and secondary sequences ``` #### Step 3: Generate images using MarginPolish ##### Run MarginPolish using docker `MarginPolish` can be used in a docker container. You can get the image from: ```bash docker pull tpesout/margin_polish:latest docker run tpesout/margin_polish:latest --help ``` To generate images with `MarginPolish` docker, first collect all your input data (`shasta_assembly.fa, reads_2_assembly.bam, allParams.np.human.guppy-ff-235.json`) to a directory i.e. ``. Then please run: ```bash docker run -v :/data tpesout/margin_polish:latest reads_2_assembly.bam \ shasta_assembly.fa \ allParams.np.human.guppy-ff-235.json \ -t \ -o output/marginpolish_images \ -f ``` You can get the `params.json` from `path/to/marginpolish/params/allParams.np.human.guppy-ff-235.json`. #### Step 4: Run HELEN ##### Download Model Before running `call_consensus.py` please download the appropriate model suitable for your data. Please read our [model guideline](#Model) to understand which model to pick. ##### Get docker images (GPU) Plase install `CUDA 10.0` to run the GPU supported docker for `HELEN`. ```bash sudo apt-get install nvidia-docker2 sudo docker pull kishwars/helen:0.0.1.gpu sudo nvidia-docker run kishwars/helen:0.0.1.gpu call_consensus.py -h ``` ###### Run call_consensus.py Please gather all your data to a input directory. Then run `call_consensus.py` using the following command: ```bash sudo nvidia-docker run -v :/data kishwars/helen:0.0.1.gpu call_consensus.py \ -i \ -b \ -m \ -o \ -w 0 \ -t 1 \ -g Arguments: -i IMAGE_FILE, --image_file IMAGE_FILE [REQUIRED] Path to a directory where all MarginPolish generated images are. -m MODEL_PATH, --model_path MODEL_PATH [REQUIRED] Path to a trained model (pkl file). Please see our github page to see options. -b BATCH_SIZE, --batch_size BATCH_SIZE Batch size for testing, default is 512. Please set to 512 or 1024 for a balanced execution time. -w NUM_WORKERS, --num_workers NUM_WORKERS Number of workers to assign to the dataloader. FOR THE DOCKER GPU IT HAS TO BE 0. -t THREADS, --threads THREADS Number of PyTorch threads to use, default is 1. This is helpful during CPU-only inference. -o OUTPUT_DIR, --output_dir OUTPUT_DIR Path to the output directory. -g, --gpu_mode If set then PyTorch will use GPUs for inference. ``` ###### Run stitch.py Finally you can run `stitch.py` to get a consensus sequence: ```bash sudo nvidia-docker run -v :/data kishwars/helen:0.0.1.gpu \ stitch.py \ -i \ -t \ -o \ -p Arguments: -i INPUT_HDF, --input_hdf INPUT_HDF [REQUIRED] Path to a HDF5 file that was generated using call consensus. -o OUTPUT_DIR, --output_dir OUTPUT_DIR [REQUIRED] Path to the output directory. -t THREADS, --threads THREADS [REQUIRED] Number of threads. -p OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX Prefix for the output file. Default is: HELEN_consensus ``` ##### Get docker images (CPU) (not recommended) If you want to try running the inference on CPU. ```bash sudo docker pull kishwars/helen:0.0.1.cpu sudo docker run kishwars/helen:0.0.1.cpu call_consensus.py -h ``` ##### Run call_consensus.py (CPU) Please gather all your data to a input directory. Then run `call_consensus.py` using the following command: ```bash sudo docker run -v :/data kishwars/helen:0.0.1.cpu call_consensus.py \ -i \ -b \ -m \ -o \ -w \ -t # make sure (number_of_workers + number_of_threads < total available CPU threads) is TRUE. ``` ##### Run stitch.py Finally you can run `stitch.py` to get a consensus sequence: ```bash docker run -v /data:/data kishwars/helen:0.0.1.cpu \ stitch.py \ -i \ -t \ -o \ -p ``` ## Models #### Released models Change in the basecaller algorithm can directly affect the outcome of HELEN. We will release trained models with new basecallers as they come out.

Model Name	Release Date	Intended base-caller	Link	Comment
r941_flip231_v001.pkl	29/05/2019	Guppy 2.3.1	Model_link	The model is trained on chr1-6 of CHM13 with Guppy 2.3.1 base called data.
r941_flip233_v001.pkl	29/05/2019	Guppy 2.3.3	Model_link	The model is trained on autosomes of HG002 except chr 20 with Guppy 2.3.3 base called data.
r941_flip235_v001.pkl	29/05/2019	Guppy 2.3.5	Model_link	The model is trained on autosomes of HG002 except chr 20 with Guppy 2.3.5 base called data.

We have seen significant difference in the homopolymer base-calls between different basecallers. It is important to pick the right version for the best polishing results. Confusion matrix of Guppy 2.3.1 on CHM13 chromosome X: guppy235

#### Model Schema HELEN implements a Recurrent-Neural-Network (RNN) based Multi-task learning model with hard parameter sharing. It implements a sliding window method where it slides through the input sequence in chunks. As each input sequence is evaluated independently, it allows HELEN to use mini-batch during training and testing.

## Runtime and Cost `MarginPolish-HELEN` ensures runtime consistency and cost efficiency. We have tested our pipeline on `Amazon Web Services (AWS)` and `Google Cloud Platform (GCP)` to ensure scalability. We studied several samples of 50-60x coverage and created a suggestion framework for running the polishing pipeline. Please be advised that these are cost-optimized suggestions. For better run-time performance you can use more resources. #### Google Cloud Platform (GCP) For `MarginPolish` please use n1-standard-64 (64 vCPUs, 240GB RAM) instance.
Our estimated run-time is: 12 hours Estimated cost for `MarginPolish`: $33 For `HELEN`, our suggested instance type is: * Instance type: n1-standard-32 (32 vCPUs, 120GB RAM) * GPUs: 2 x NVIDIA Tesla P100 * Disk: 2TB SSD * Cost: $4.65/hour The estimated runtime with this instance type is 4 hours.
The estimated cost for `HELEN` is $28. Total estimated run-time for polishing: 18 hours.
Total estimated cost for polishing: $61 #### Amazon Web Services (AWS) For `MarginPolish` we recommend c5.18xlarge (72 CPU, 144GiB RAM) instance.
Our estimated run-time is: 12 hours Estimated cost for `MarginPolish`: $39 We recommend using `p2.8xlarge` instance type for `HELEN`. The configuration is as follows: * Instance type: p2.8xlarge (32 vCPUs, 488GB RAM) * GPUs: 8 x NVIDIA Tesla K80 * Disk: 2TB SSD * Cost: $7.20/hour * Suggested AMI: Deep Learning AMI (Ubuntu) Version 23.0 The estimated runtime with this instance type: 4 hours
The estimated cost for `HELEN` is: $36 Total estimated run-time for polishing: 16 hours.
Total estimated cost for polishing: $75 Please see our detailed [run-time case study](docs/runtime_cost.md) documentation for better insight. We also see significant improvement in time over other available polishing algorithm:

## Results We compared `Medaka` and `HELEN` as polishing pipelines on Shasta assembly with `assess_assembly` module available from `Pomoxis`. The summary of the quality we produce is here:

accuracy_plot

We also see that `MarginPolish-HELEN` perform consistently across multiple assemblers.

accuracy_plot

## Eleven high-quality assemblies We have sequenced-assembled-polished 11 human genome assemblies at University of California, Santa Cruz with our pipeline. They can be downloaded from our [google bucket](https://console.cloud.google.com/storage/browser/kishwar-helen/polished_genomes/london_calling_2019/). For quick links, please copy a link from this table and you can run `wget` to download the files: ```bash wget ``` The twelve assemblies with their download links:

Sample name	Download link
HG00733	HG00733_download_link
HG01109	HG01109_download_link
HG01243	HG01243_download_link
HG02055	HG02055_download_link
HG02080	HG02080_download_link
HG02723	HG02723_download_link
HG03098	HG03098_download_link
HG03492	HG03492_download_link
GM24143	GM24143_download_link
GM24149	GM24149_download_link
GM24385/HG002	GM24385_download_link

We also polished `CHM13` genome assembly available from the [Telomere-to-telomere consortium](https://github.com/nanopore-wgs-consortium/CHM13) project.
`CHM13` polished assembly is available for download from here: CHM13_download_link ## Help Please open a github issue if you face any difficulties. ## Acknowledgement We are thankful to [Segey Koren](https://github.com/skoren) and [Karen Miga](https://github.com/khmiga) for their help with `CHM13` data and evaluation. We downloaded our data from [Telomere-to-telomere consortium](https://github.com/nanopore-wgs-consortium/CHM13) to evaluate our pipeline against `CHM13`. We acknowledge the work of the developers of these packages:
* [Shasta](https://github.com/chanzuckerberg/shasta/commits?author=paoloczi) * [pytorch](https://pytorch.org/) * [ssw library](https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library) * [hdf5 python (h5py)](https://www.h5py.org/) * [pybind](https://github.com/pybind/pybind11) * [hyperband](https://github.com/zygmuntz/hyperband) ## Fun Fact guppy235

The name "HELEN" is inspired from the A.I. created by Tony Stark in the Marvel Comics (Earth-616). HELEN was created to control the city Tony was building named "Troy" making the A.I. "HELEN of Troy". READ MORE: [HELEN](https://marvel.fandom.com/wiki/H.E.L.E.N._(Earth-616)) © 2019 Kishwar Shafin, Trevor Pesout, Benedict Paten.