# vcf_stuff

**Repository Path**: quanrd_admin/vcf_stuff

## Basic Information

- **Project Name**: vcf_stuff
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2023-11-13
- **Last Updated**: 2023-11-13

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

VCF Stuff
---------
📊Evaluating, filtering, comparing, and visualising genomic variants

![Build](https://github.com/umccr/vcf_stuff/workflows/CI/badge.svg) 
[![Anaconda-Server Badge](https://anaconda.org/umccr/vcf_stuff/badges/installer/conda.svg)](https://anaconda.org/umccr/vcf_stuff) 


## Installation

```
conda install -c vladsaveliev -c conda-forge -c bioconda vcf_stuff
```

If conda is not installed on your computer, preliminary run before the above:

```
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh   # Linux
# wget https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh -O miniconda.sh  # macOS
bash miniconda.sh -b -p ./miniconda && rm miniconda.sh
. miniconda/etc/profile.d/conda.sh
```

## Development

Clone source and install on top of conda:

```
git clone https://github.com/umccr/vcf_stuff
cd vcf_stuff
pip install -e .
```

If you want to explore Jupyter notebooks, install additionally:

```
conda install -y jupyter matplotlib matplotlib-venn
```

## Testing
```
nosetests -s tests/test.py
```

## Variant calling evaluation

`eval_vcf` is a tool that can compare somatic VCFs to a gold standard and report metrics like sensitivity and specificity. The following commands compares `test-ensemble.vcf.gz` and `test-vardict.vcf.gz` to a gold standard `data/test-benchmark.vcf.gz`:

```
cd tests
eval_vcf data/test-benchmark.vcf.gz \
         data/test-ensemble.vcf.gz \
         data/test-vardict.vcf.gz \
         -g GRCh37 \
         -o results
```

The tool will normalize all input VCFs (see `norm_vcf` below), overlap calls in `test-ensemble.vcf.gz` and `test-vardict.vcf.gz` with the reference VCF `test-benchmark.vcf.gz`, evaluate statistics and print them to stdout:

```
       Sample  SNP                         INDEL                      
               TP FP FN    Prec  Recall    TP FP FN    Prec  Recall
test-ensemble   1  0  0 100.00% 100.00%     0  0  0   0.00%   0.00%
 test-vardict   1  5  0  16.67% 100.00%     0  0  0   0.00%   0.00%
        Truth   1  0  1 100.00% 100.00%     0  0  0 100.00% 100.00%
```

And also save them into `results/report.tsv` in a parsable TSV format. 
 
Internally, the tools uses `bcftools isec` to overlap VCFs. Intermediate results are saved into `results/eval/{sample}_bcftools_isec/` for futher analysis if needed:

```
eval/test-ensemble_bcftools_isec/0000.vcf	# FP (records private to test-ensemble.vcf.gz)
eval/test-ensemble_bcftools_isec/0001.vcf	# FN (records private to test-benchmark.vcf.gz)
eval/test-ensemble_bcftools_isec/0002.vcf	# TP (records from test-ensemble.vcf.gz shared by both)
eval/test-ensemble_bcftools_isec/0003.vcf	# TP (records from test-benchmark.vcf.gz shared by both)
```

Optionally, a BED file can be specified with `-r`:

```
eval_vcf benchmark.vcf sample.vcf -g GRCh37 -r callable_regions.bed
```

If provided, evaluation area will be restricted to those regions. I.e. both benchmark and target VCFs will be subset to those regions, after normalization, but before overlapping.

The tools finds the needed reference fasta file location on Spartan and Raijin, or if [umccrise](https://github.com/umccr/umccrise) is loaded in PATH. Otherwise, you can specify the reference fasta explicitly with `--ref-fasta`, e.g.

```
eval_vcf benchmark.vcf sample.vcf --ref-fasta /genomes/seq/hg38.fa -o results
```

On Spartan and Raijin, instead of feeding the truth VCF directly, one can use presets, e.g. `mb` stands for the [ICGC medulloblastoma study](https://www.nature.com/articles/ncomms10001) T/N somatic variant calling benchmark:

```
eval_vcf mb data/test-ensemble.vcf.gz data/test-vardict.vcf.gz -g GRCh37 -o results
```

See for available options at [https://github.com/umccr/reference_data/blob/master/reference_data/paths.yml#L132-L138](https://github.com/umccr/reference_data/blob/master/reference_data/paths.yml#L132-L138)), e.g. `mb`, `colo` ([COLO829 metastatic melanoma cell line](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4837349)), `giab` (GiaB NA12878 germline variants), `dream` (DREAM synthetic challenge 3). For `giab` and `dream`, additionally truth regions BED files are applied, which are merged automatically with `-r` regions if those are also provided.


## CNV evaluation

Similar to the VCF evaluation, you can compare CNV calls to the truth sets. The following callers are supported:

- cnvkit
- facets
- purple
- manta

The following truth sets are supported:

- HCC2218 exome
- COLO829 WGS (Hartwig's)
- COLO820 WGS ([Craig's study](https://www.nature.com/articles/ncomms10001))

Usage: 

```
eval_cnv -g GRCh37 -o results_eval_cnv \
    data/cnv/hcc2218/HCC2218_truthset_cnv_bcbio.tsv \
    data/cnv/hcc2218/HCC2218_cnvkit-call.cns \
    data/cnv/hcc2218/HCC2218_purple.cnv.tsv \
    data/cnv/hcc2218/HCC2218_facets_cncf.tsv
```

The tool produces 3 tables with standard performance statistics (true/false positive, false negatic, recall and precision rates):

```
Gene level comparison
                Sample    TP    FP    FN  Recall   Prec
0  HCC2218_cnvkit-call  6399  2631    13  99.80% 70.86%
1  HCC2218_facets_cncf  6399  5280    13  99.80% 54.79%
2        HCC2218_manta   606    89  5806   9.45% 87.19%
3   HCC2218_purple.cnv  4602  5924  1810  71.77% 43.72%

Event level comparison (Amp, Del)
                Sample    TP    FP    FN  Recall   Prec
0  HCC2218_cnvkit-call  6396  2638    20  99.69% 70.80%
1  HCC2218_facets_cncf  6398  5289    18  99.72% 54.74%
2        HCC2218_manta   317   378  6099   4.94% 45.61%
3   HCC2218_purple.cnv  4022  6508  2394  62.69% 38.20%

CN level comparison
                Sample    TP    FP    FN  Recall   Prec
0  HCC2218_cnvkit-call  5950  3125   494  92.33% 65.56%
1  HCC2218_facets_cncf  5821  5876   623  90.33% 49.76%
2        HCC2218_manta     0   695  6444   0.00%  0.00%
3   HCC2218_purple.cnv  3111  7423  3333  48.28% 29.53%
```

Each table represents the different level of comparison:

- `Gene level comparison` compares the sets of gene in which any event is occured. E.g. the truth set has a deletion in EGFR, and the sample has also any other event in EGFR (e.g. an amplification), it will count as a true positive.

- `Event level comparison (Amp, Del)` also requires the types of events per gene to be the same. It supports 2 types of events: Amp and Del. `DUP` for certain callers is authomatically translated into `Amp`, and `DEL` into `Del`. Callers that do not report event types but report CN values, CN>2 translates into `Amp`, CN<2 translates into `Del`, and CN=2 is ignored (we don't support copy-neutral LOHs).

- `CN level comparison` requires also the integer copy number estimation values to be the same. Only generated for callers and truth sets that contain CN values. For example, for WGS COLO829, it would looks like the following:

```
Gene level comparison
Sample                     TP    FP    FN  Recall  Prec
COLO_TGEN_bwa-cnvkit-call  6716  4680  6   99.91%  58.93%

Event level comparison (Amp, Del)
Sample                     TP    FP    FN  Recall  Prec
COLO_TGEN_bwa-cnvkit-call  6714  4700  8   99.88%  58.82%
```

In addition to that overall stats table, the tool will produce a per-gene table for details exploration, like the following:

```
                 truth         HCC2218_cnvkit-call  HCC2218_facets_cncf  HCC2218_manta  HCC2218_purple.cnv
1  AADACL3       Del:1         Del:1                Del:1                               Del:1
1  AADACL4       Del:1         Del:1                Del:1                               Del:1
1  ABCB10        Amp:6         Amp:6                Amp:6                               Amp:4
1  ABL2          Amp:3         Amp:3                Amp:3
1  AC004824.2    Del:1         Del:1                Del:1                               Del:1
1  AC092782.1    Amp:4         Amp:4                Amp:4                               Amp:3
1  AC092811.1    Amp:5         Amp:5                Amp:5                               Amp:4
...
```

The tools will write the report into `results_eval_cnv/report.tsv`, and the per-gene table into `results_eval_cnv/table.tsv`.

To consistently determine the genes affected by events, the tools re-annotates all events with the [bed_annotation](https://github.com/vladsaveliev/bed_annotation) package that assigns gene names to Ensembl genomic regions.


## Panel of normals

Removing variants detected as germline in a set of unrelated normal tissue samples helps to reduce the FP rate when it was caused by unbalanced coverage in matching regions normals.

Below showing stats for the evaluation of the ICGC MB T/N variant calling with 300x tumor coverage, and 50 normal coverage. The number in `vardict_n1` means how many heats in PoN we allow before we filter out the variant.

![Evaluation of the ICGC MB T/N variant calling with 300x tumor coverage, and 50 normal coverage](vcf_stuff/panel_of_normals/benchmark_50_300.png)

Annotate a VCF against a panel of normals:

```
pon_anno data/test-ensemble.vcf.gz -g GRCh37 -o test-ensemble.pon.vcf
```

This adds `PoN_CNT` `INFO` flags indicating the number of hits in the panel; writes output VCF to `test-ensemble.pon.vcf`

Annotate and soft-filter variants with at least 2 hits (adds `PoN` flag into `FILTER`):

```
pon_anno data/test-ensemble.vcf.gz -h 2 -g GRCh37 -o test-ensemble.pon.vcf
```

By default, `pon_anno` only checks the positions of the variants. If you want to compare exact alleles, use the `--check-allele` flag.

To process multiple samples with multiple threshold hits, you can use the `pon_pipeline` script:

```
pon_pipeline data/test-vardict.vcf.gz data/test-strelka.vcf.gz -o results_pon -h1,2,3 -g GRCh37
```

The filtered VCF files will be written to `results_pon/pon_filter/`.

Scripts only know about the panel location on Spartan and Raijin, so to work outside, provide the path to the panel with `--pon-dir`.

The script is also used withing a bigger [`anno_somatic_vcf`](https://github.com/umccr/vcf_stuff/blob/master/scripts/anno_somatic_vcf), which
annotates against various sources that then can be used by [`filter_somatic_vcf`](https://github.com/umccr/vcf_stuff/blob/master/scripts/filter_somatic_vcf)
to filter a somatic VCF. Both script are used in [Umccrise](https://github.com/umccr/umccrise).


### Building the panel

On Spartan: 

```
cd /data/cephfs/punim0010/extras/panel_of_normals
snakemake -s prep_normals.smk -p -j30
```

## VCF normalisation

Normalise VCF file:

```
norm_vcf data/test-ensemble.vcf.gz -g GRCh37 > test-ensemble.norm.vcf
```

This script does the following steps:

1. Split multi-allelic variants into single sample records.

For instance, split one record 
```
#CHROM  POS     ID      REF     ALT
1       10       .      A       T,C
```
Into 2 separate records
```
#CHROM  POS     ID      REF     ALT
1       10       .      A       T
1       10       .      A       C
```
For that, we are using [vt tools](https://github.com/atks/vt):
```
vt decompose -s vcf_file
```

2. Decompose biallelic block substitutions. 

For instance, split the following one records:

```
#CHROM  POS     ID      REF     ALT
1       20      .       AG      CT       
```

into 2 separate ones:

```
#CHROM  POS     ID      REF     ALT
1       20       .      A       C
1       20       .      G       T
```

We are using for that vcflib's [`vcfallelicprimitives`](https://github.com/vcflib/vcflib#vcfallelicprimitives):

```
vcfallelicprimitives -t DECOMPOSED --keep-geno vcf_file
```

3. Left-align and normalize indels, check if REF alleles match the reference.

For instance, given that the reference chromosome 1 starts with `GCTCCG`, split the following records

```
#CHROM  POS     ID      REF     ALT
1       2       .       CTCC    CCC,C,CCCC
```

into the following 3:

```
#CHROM  POS     ID      REF     ALT
1       1       .       GCTC    G
1       2       .       CT      C
1       3       .       T       C
```

These steps are applied to each input VCF for the `eval_vcf` pipeline above.


## INFO fields normalisation

VCFs coming from bcbio-nextgen are called with different callers, each using its own way to report quality, depth and allelic frequencies (if at all). To facilitate processing and reporting VCFs in [PCGR](https://github.com/sigven/pcgr), we prepared a script that calculates and populates `TUMOR_AF`, `NORMAL_AF`, `TUMOR_DP`, `NORMAL_DP` fields, this way standardizing output from Strelka2, Mutect2, Freebayes, GATK Haplotype Caller, and VarDict, and consequenctly, Ensemble calls.

```
pcgr_prep data/test-ensemble.vcf.gz -g GRCh37 > test-ensemble.pcgr_prep.vcf
```


## Playground

### Somatic filtering

This repository provides description, code and results for the approaches to somatic variant filtering in UMCCR.

We summarize the filtering ideas into a [spreadsheet](https://docs.google.com/spreadsheets/d/1Xbz4nW76mofKb9ym3C725W035qkA7JuUu8_FvYSCOT0/edit#gid=0), where for problem and idea we provide a corresponding solution (if available) used in [CaVEMan](https://github.com/cancerit/cgpCaVEManWrapper), [bcbio](http://bcb.io/2015/03/05/cancerval/), and [AZ VarDict pipeline](vardict/vardict_filtering.md). The basic factors and ideas are from Matt Eldridge's [slides](https://bioinformatics-core-shared-training.github.io/cruk-summer-school-2017/Day3/somatic_snv_filtering.html).


### VarDict VCF filtering

Commands to filter VarDict VCF files.

Moves main sample AF and DP fields into INFO, for PCGR post-processing:
```
proc_vardict_vcf fix_fields vardict.vcf.gz > out.vcf
```

Applies special AF threshold filtering to homopolimers based on `MSI` `INFO` fields generated by VarDict. Writes `MSI_FILTER` into the `FILTER` field.
```
proc_vardict_vcf filter_low_af_msi vardict.vcf.gz > out.vcf
```