## Installation
We have docker support for both `MarginPolish` and `HELEN`. Users can install `MarginPolish` and `HELEN` on `Ubuntu 18.04` or any other Linux-based system by following the instructions from our [Installation Guide](docs/installation.md).
If you have locally installed `MarginPolish-HELEN` then please follow the [Local Install Usage Guide](docs/usage_local_install.md)
## Usage
`MarginPolish` requires a draft assembly and a mapping of reads to the draft assembly. We commend using `Shasta` as the initial assembler and `MiniMap2` for the mapping.
#### Step 1: Generate an initial assembly
Although any assembler can be used to generate the initial assembly, we highly recommend using [Shasta](https://github.com/chanzuckerberg/shasta).
Please see the [quick start documentation](https://chanzuckerberg.github.io/shasta/QuickStart.html) to see how to use Shasta. Shasta requires memory intensive computing.
> For a human size assembly, AWS instance type x1.32xlarge is recommended. It is usually available at a cost around $4/hour on the AWS spot market and should complete the human size assembly in a few hours, at coverage around 60x.
An assembly can be generated by running:
```bash
# you may need to convert the fastq to a fasta file
./shasta-Linux-0.1.0 --input --output
```
#### Step 2: Create an alignment between reads and shasta assembly
We recommend using `MiniMap2` to generate the mapping between the reads and the assembly.
```bash
# we recommend using FASTQ as marginPolish uses quality values
# This command can run MiniMap2 with 32 threads, you can change the number as you like.
minimap2 -ax map-ont -t 32 shasta_assembly.fa reads.fq | samtools sort -@ 32 | samtools view -hb -F 0x104 > reads_2_assembly.bam
samtools index -@32 reads_2_assembly.bam
# the -F 0x104 flag removes unaligned and secondary sequences
```
#### Step 3: Generate images using MarginPolish
##### Run MarginPolish using docker
`MarginPolish` can be used in a docker container. You can get the image from:
```bash
docker pull tpesout/margin_polish:latest
docker run tpesout/margin_polish:latest --help
```
To generate images with `MarginPolish` docker, first collect all your input data (`shasta_assembly.fa, reads_2_assembly.bam, allParams.np.human.guppy-ff-235.json`) to a directory i.e. ``.
Then please run:
```bash
docker run -v :/data tpesout/margin_polish:latest reads_2_assembly.bam \
shasta_assembly.fa \
allParams.np.human.guppy-ff-235.json \
-t \
-o output/marginpolish_images \
-f
```
You can get the `params.json` from `path/to/marginpolish/params/allParams.np.human.guppy-ff-235.json`.
#### Step 4: Run HELEN
##### Download Model
Before running `call_consensus.py` please download the appropriate model suitable for your data. Please read our [model guideline](#Model) to understand which model to pick.
##### Get docker images (GPU)
Plase install `CUDA 10.0` to run the GPU supported docker for `HELEN`.
```bash
sudo apt-get install nvidia-docker2
sudo docker pull kishwars/helen:0.0.1.gpu
sudo nvidia-docker run kishwars/helen:0.0.1.gpu call_consensus.py -h
```
###### Run call_consensus.py
Please gather all your data to a input directory. Then run `call_consensus.py` using the following command:
```bash
sudo nvidia-docker run -v :/data kishwars/helen:0.0.1.gpu call_consensus.py \
-i \
-b \
-m \
-o \
-w 0 \
-t 1 \
-g
Arguments:
-i IMAGE_FILE, --image_file IMAGE_FILE
[REQUIRED] Path to a directory where all MarginPolish
generated images are.
-m MODEL_PATH, --model_path MODEL_PATH
[REQUIRED] Path to a trained model (pkl file). Please
see our github page to see options.
-b BATCH_SIZE, --batch_size BATCH_SIZE
Batch size for testing, default is 512. Please set to
512 or 1024 for a balanced execution time.
-w NUM_WORKERS, --num_workers NUM_WORKERS
Number of workers to assign to the dataloader.
FOR THE DOCKER GPU IT HAS TO BE 0.
-t THREADS, --threads THREADS
Number of PyTorch threads to use, default is 1. This
is helpful during CPU-only inference.
-o OUTPUT_DIR, --output_dir OUTPUT_DIR
Path to the output directory.
-g, --gpu_mode If set then PyTorch will use GPUs for inference.
```
###### Run stitch.py
Finally you can run `stitch.py` to get a consensus sequence:
```bash
sudo nvidia-docker run -v :/data kishwars/helen:0.0.1.gpu \
stitch.py \
-i \
-t \
-o \
-p
Arguments:
-i INPUT_HDF, --input_hdf INPUT_HDF
[REQUIRED] Path to a HDF5 file that was generated
using call consensus.
-o OUTPUT_DIR, --output_dir OUTPUT_DIR
[REQUIRED] Path to the output directory.
-t THREADS, --threads THREADS
[REQUIRED] Number of threads.
-p OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
Prefix for the output file. Default is: HELEN_consensus
```
##### Get docker images (CPU) (not recommended)
If you want to try running the inference on CPU.
```bash
sudo docker pull kishwars/helen:0.0.1.cpu
sudo docker run kishwars/helen:0.0.1.cpu call_consensus.py -h
```
##### Run call_consensus.py (CPU)
Please gather all your data to a input directory. Then run `call_consensus.py` using the following command:
```bash
sudo docker run -v :/data kishwars/helen:0.0.1.cpu call_consensus.py \
-i \
-b \
-m \
-o \
-w \
-t
# make sure (number_of_workers + number_of_threads < total available CPU threads) is TRUE.
```
##### Run stitch.py
Finally you can run `stitch.py` to get a consensus sequence:
```bash
docker run -v /data:/data kishwars/helen:0.0.1.cpu \
stitch.py \
-i \
-t \
-o \
-p
```
## Models
#### Released models
Change in the basecaller algorithm can directly affect the outcome of HELEN. We will release trained models with new basecallers as they come out.
The model is trained on autosomes of HG002 except chr 20 with Guppy 2.3.5 base called data.
We have seen significant difference in the homopolymer base-calls between different basecallers. It is important to pick the right version for the best polishing results.
Confusion matrix of Guppy 2.3.1 on CHM13 chromosome X:
#### Model Schema
HELEN implements a Recurrent-Neural-Network (RNN) based Multi-task learning model with hard parameter sharing. It implements a sliding window method where it slides through the input sequence in chunks. As each input sequence is evaluated independently, it allows HELEN to use mini-batch during training and testing.
## Runtime and Cost
`MarginPolish-HELEN` ensures runtime consistency and cost efficiency. We have tested our pipeline on `Amazon Web Services (AWS)` and `Google Cloud Platform (GCP)` to ensure scalability.
We studied several samples of 50-60x coverage and created a suggestion framework for running the polishing pipeline. Please be advised that these are cost-optimized suggestions. For better run-time performance you can use more resources.
#### Google Cloud Platform (GCP)
For `MarginPolish` please use n1-standard-64 (64 vCPUs, 240GB RAM) instance.
Our estimated run-time is: 12 hours
Estimated cost for `MarginPolish`: $33
For `HELEN`, our suggested instance type is:
* Instance type: n1-standard-32 (32 vCPUs, 120GB RAM)
* GPUs: 2 x NVIDIA Tesla P100
* Disk: 2TB SSD
* Cost: $4.65/hour
The estimated runtime with this instance type is 4 hours.
The estimated cost for `HELEN` is $28.
Total estimated run-time for polishing: 18 hours.
Total estimated cost for polishing: $61
#### Amazon Web Services (AWS)
For `MarginPolish` we recommend c5.18xlarge (72 CPU, 144GiB RAM) instance.
Our estimated run-time is: 12 hours
Estimated cost for `MarginPolish`: $39
We recommend using `p2.8xlarge` instance type for `HELEN`. The configuration is as follows:
* Instance type: p2.8xlarge (32 vCPUs, 488GB RAM)
* GPUs: 8 x NVIDIA Tesla K80
* Disk: 2TB SSD
* Cost: $7.20/hour
* Suggested AMI: Deep Learning AMI (Ubuntu) Version 23.0
The estimated runtime with this instance type: 4 hours
The estimated cost for `HELEN` is: $36
Total estimated run-time for polishing: 16 hours.
Total estimated cost for polishing: $75
Please see our detailed [run-time case study](docs/runtime_cost.md) documentation for better insight.
We also see significant improvement in time over other available polishing algorithm:
## Results
We compared `Medaka` and `HELEN` as polishing pipelines on Shasta assembly with `assess_assembly` module available from `Pomoxis`. The summary of the quality we produce is here:
We also see that `MarginPolish-HELEN` perform consistently across multiple assemblers.
## Eleven high-quality assemblies
We have sequenced-assembled-polished 11 human genome assemblies at University of California, Santa Cruz with our pipeline. They can be downloaded from our [google bucket](https://console.cloud.google.com/storage/browser/kishwar-helen/polished_genomes/london_calling_2019/).
For quick links, please copy a link from this table and you can run `wget` to download the files:
```bash
wget
```
The twelve assemblies with their download links: