# genomics-pipelines

**Repository Path**: mirrors_databricks/genomics-pipelines

## Basic Information

- **Project Name**: genomics-pipelines
- **Description**: secondary analysis pipelines parallelized with apache spark
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2022-03-05
- **Last Updated**: 2025-12-13

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

Genomics pipelines. At scale. With Spark and Glow. :exploding_head:

# What's inside?

Spark based pipelines for:
- Variant calling (built on GATK's HaplotypeCaller)
- Somatic variant calling (built on MuTect2)
- Joint genotyping (built on GenotypeGVCFs)

# Building and testing

1. Clone the repo
2. Unpack the big test files archive located in the project root
  - `tar -xf big-files.tar.gz` 
3. `sbt test`

# Running on a Databricks cluster

1. Create an init script to download the reference genome from cloud storage (see `hls.sh` or
   `prepare_reference.py` for inspiration.
2. Build an uber jar (`sbt assembly`)
3. Create a cluster with the init script from step 1 and attach the assembly jar.
4. Run the desired pipeline using one of the attached notebooks.

# License

[Apache 2.0](LICENSE)

## Disclaimer

This is not an official Databricks product. This project is released without an expectation of continued development or maintenance.