# nfl-demo
**Repository Path**: zhixingfeng/nfl-demo
## Basic Information
- **Project Name**: nfl-demo
- **Description**: The demo of NanoFreeLunch
- **Primary Language**: Unknown
- **License**: GPL-3.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 1
- **Forks**: 1
- **Created**: 2025-01-14
- **Last Updated**: 2025-08-12
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# **Install tools**
## **Install `Julia`**
1. Download and install Julia from https://julialang.org/.
2. Type `Julia`. A console should be launched if it is correctly installed.
## **Install NanoFreeLunch as a command line tool**
1. Install NanoFreeLunch according to the document in https://gitee.com/zhixingfeng/NanoFreeLunch.jl .
## **Install NanoFreeLunch as a Julia Package**
1. Download and install Julia from https://julialang.org/.
2. Lauch Julia and type `using Pkg; Pkg.activate("./tools/env")`.
3. Assuming NanoFreeLunch is unzipped into folder `./tools/NanoFreeLunch` Type `Pkg.develop(path="./tools/NanoFreeLunch")`.
4. Type `Pkg.status("NanoFreeLunch")` to check if the package is correctly installed.
## **Install third-party Julia packages**
1. Launch `Julia` and type `using Pkg; Pkg.activate("./tools/env")`.
2. Type `]` and `add GenomicFeatures DataFrames CSV StatsBase Plotly Random Statistics`.
# **Prepare the reference data (ENCODE, ICR, CpG island, and reference genome)**
1. The histone modification, DNase hypersensitive regions, the ICRs, and CpG islands are pre-downloaded in `ref`.
2. Install `awscli` from https://aws.amazon.com/cn/cli/ or https://anaconda.org/conda-forge/awscli.
3. Enter the `ref` folder and download the reference genome (hg38) according to `download.sh`.
# **Test NanoFreeLunch on the R9 human pangenome data (PromethION)**
Enter the `r9_pangenome` folder by typing `cd r9_pangenome`
## **Data downloading and preparation**
### Download the raw FAST5 files and pre-basecalled FASTQ files (Basecalled using Guppy 4.2.2 and Guppy 2.3.5)
1. Enter the `data` folder by typing `pushd data`.
2. Download the data with `awscli` according to `download.sh`.
**Warning:** the data are extremely large (> 20Tb).
3. The downloaded data have two folder with the names like `HG*`, which represent the sample names. In each folder, for example `HG01109`, the `.fast5.tar` or `.fast5.tar.gz` files in the `nanopore` folder are the raw FAST5 files. Create a folder named `fast5` and unzip the `.fast5.tar` or `.fast5.tar.gz` files into `fast5` (**Warning:** time consuming).
4. The `fastq.gz` files under the `nanopore` folder are the pre-basecalled FASTQ files using Guppy 2.3.5.
5. The `fastq.gz` files under the `nanopore/Guppy_4.2.2/` folder are the pre-basecalled FASTQ files using Guppy 4.2.2.
6. Type `popd` to get back.
### Obtain the overlapped regions between DNase hypersensitive regions and H3K4me3 regions as well as the H3K9me3 regions in autochromosomes.
1. Enter the `script` folder.
2. Type `julia histonecode_vs_nfl_getregions.jl`.
3. The overlapped regions between DNase hypersensitive regions and H3K4me3 regions is in the file `../results/encode/GM12878/dnase_h3k4me3_signalvalue_5.tsv`.
4. H3K9me3 regions in autochromosome is `../results/encode/GM12878/h3k9me3_signalvalue_5.tsv`.
## **Basecalling and alignment of the FAST5 files using Guppy 6.3.8**
1. Install `Guppy`.
2. Use Guppy to basecall and align the FAST5 files to hg38 according to `basecall_guppy6.3.8.slurm`. Type `sbatch basecall_guppy6.3.8.slurm` if you are using `SLURM` as the job management system in your cluster. Otherwise, adapt the `SLURM` code to shell or the script compatible with your job management system. `SLURM` code is basically linux shell code with some special variables to control computational resources such as queue and number of CPUs/GPUs to use.
**Warning:** basecalling might take a couple of days using modern GPUs and is impossible to finish using CPUs.
3. The default outputs of Guppy are a lot of FASTQ files or aligned BAM files. Merge them into a single BAM file according to `basecall_guppy6.3.8_mergebam.slurm`.
4. Filter the unaligned or multiple aligned reads according to `basecall_guppy6.3.8_uniq.slurm`.
5. The output basecalled and aligned reads are in `../results/basecall/guppy/6.3.8/dna_r9.4.1_450bps_hac/HG*`.
## **Align the FASTQs files pre-basecalled using Guppy 4.2.2 and Guppy 2.3.5**
The human pangenome project has provided pre-basecalled FASTQ files using Guppy 4.2.2 and Guppy 2.3.5, so in-house basecalling is not needed. Just align the reads to hg38.
1. Install `minimap2` and `samtools`.
2. Align the filter multiple aligned reads according to `minimap2_guppy4.2.2.slurm` and `minimap2_guppy2.3.5.slurm` respectively.
3. The output aligned BAM files are in `../results/alignment/minimap2/guppy4.2.2` and `../results/alignment/minimap2/guppy2.3.5` respectively.
## **Raw-signal-based methylation calling using Guppy 6.3.8**
1. Install `Guppy`, `tabix`, and `modbam2bed`.
2. Similar to basecalling, use Guppy to modcall and align the FAST5 files to hg38 according to `modcall_guppy.slurm`. Type `sbatch modcall_guppy.slurm` if you are using `SLURM` as the job management system in your cluster.
3. Merge the modcalled and aligned BAM files into a single BAM file according to `modcall_guppy_mergebam.slurm`.
4. Calculate the DNA methylation level of each CpG sites using `modbam2bed` according to `modcall_guppy_modbam2bed.slurm` and `modcall_guppy_modbam2bed_tabix.slurm`.
5. The final output is the indexed BED file in `../results/modcall/guppy/6.3.8/dna_r9.4.1_450bps_modbases_5mc_cg_hac/HG*/bonito.cpg.bed.gz`.
## **Run NanoFreeLunch**
### Feature extraction using NanoFreeLunch (Guppy 6.3.8)
NanoFreeLunch takes the aligned and filtered BAM files at `../results/basecall/guppy/6.3.8/dna_r9.4.1_450bps_hac/HG*/merged.uniq.bam` as the input. The feature extraction module gets features for each CpG site listed in `locifile`, which is at `../results/locifiles/HG*/chr*_s800c10.tsv` and obtained according to `genlocifile.slurm` and `genlocifile_merge.sh`.
1. Run NanoFreeLunch for feature extraction according to `nfl_guppy6.3.8_prepdata_s800c10.slurm`.
2. The output is under folder `../results/nfl/guppy6.3.8/s800c10/prepdata/HG*`.
**Info:** feature extraction takes several hours to complete with 20 threads for each chromosome. The feature folders are large, delete them if Prediction or Training is finished.
### Feature extraction using NanoFreeLunch (Guppy 4.2.2 or 2.3.5)
The input is the alignment results of minimap2 in `../results/alignment/minimap2/guppy4.2.2/*.bam` and `../results/alignment/minimap2/guppy2.3.5/*.bam`.
1. Run NanoFreeLunch for feature extraction according to `nfl_guppy4.2.2_prepdata_s800c10.slurm` and `nfl_guppy2.3.5_prepdata_s800c10.slurm`.
2. The output is under folder `../results/nfl/guppy4.2.2/s800c10/prepdata/HG*` and `../results/nfl/guppy2.3.5/s800c10/prepdata/HG*`.
**Info:** feature extraction takes several hours to complete with 20 threads for each chromosome. The feature folders are large, delete them if Prediction is finished.
### Predict methylation level using NanoFreeLunch
1. Predict DNA methylation level for each CpG site according to `nfl_guppy6.3.8_predict_s800c10.slurm`, `nfl_guppy4.2.2_predict_s800c10.slurm`, and `nfl_guppy2.3.5_predict_s800c10.slurm`.
2. By default, NanoFreeLunch predicts DNA methylation level for each chromosome separately. To merge them, use `nfl_guppy6.3.8_predict_s800c10_merge.sh`, `nfl_guppy4.2.2_predict_s800c10_merge.sh`, and `nfl_guppy2.3.5_predict_s800c10_merge.sh`.
3. The output is in `../results/nfl/guppy6.3.8/s800c10/predict_merge/*.bed`, `../results/nfl/guppy4.2.2/s800c10/predict_merge/*.bed`, and `../results/nfl/guppy2.3.5/s800c10/predict_merge/*.bed`.
4. The output format is defined in https://github.com/epi2me-labs/modbam2bed?tab=readme-ov-file#method-and-output-format.
## **Evalution of NanoFreeLunch**
### Evaluate the locus-level accuracy of NanoFreeLunch
1. Calculate the PCCs and heatmaps to compare NanoFreeLunch and Guppy according to `check_pcc_locus_level_guppy6.3.8.jl`, `check_pcc_locus_level_guppy4.2.2.jl`, and `check_pcc_locus_level_guppy2.3.5.jl`.
2. The figures are in `../results/nfl/guppy6.3.8/s800c10/eval/scatterplot/`, `../results/nfl/guppy4.2.2/s800c10/eval/scatterplot/`, and `../results/nfl/guppy2.3.5/s800c10/eval/scatterplot/`.
3. The PCC matrices are in `../results/nfl/guppy6.3.8/s800c10/eval/pccmat.tsv`, `../results/nfl/guppy4.2.2/s800c10/eval/pccmat.tsv`, and `../results/nfl/guppy2.3.5/s800c10/eval/pccmat.tsv`.
### Evaluate the region-level accuracy of NanoFreeLunch
1. Get .bed file from modcall of Guppy 6.3.8 by filtering loci with depth lower than 10 or score lower than 800 according to `genbedfile.slurm` and `genbedfile_merge.sh`. The output is in `../results/modbedfiles/HG*/all_s800c10.bed`
2. Get average methylation level of CpG islands predicted by Guppy according to `modcall_guppy_cpg.slurm`.
3. Get average methylation level of CpG islands predicted by NanoFreeLunch according to `nfl_guppy6.3.8_cpg_s800c10.slurm`, `nfl_guppy4.2.2_cpg_s800c10.slurm`, and `nfl_guppy2.3.5_cpg_s800c10.slurm`.
4. Calculate the PCCs and heatmaps to compare average methylation level of CpG islands predcited by NanoFreeLunch and Guppy according to `check_pcc_cgi_level_guppy6.3.8.jl`, `check_pcc_cgi_level_guppy4.2.2.jl`, and `check_pcc_cgi_level_guppy2.3.5.jl`.
5. The outputs are in `../results/nfl/guppy6.3.8/s800c10/eval/scatterplot_cgi/`, `../results/nfl/guppy4.2.2/s800c10/eval/scatterplot_cgi/`, and `../results/nfl/guppy2.3.5/s800c10/eval/scatterplot_cgi/`.
### Evaluate NanoFreeLunch in ICR
1. Plot boxplots of methylation level in ICR predicted by NanoFreeLunch according to `plot_ICR_modlevel-fb.jl`. Change `version` in the `.jl` file to make plots for different results using different Guppy versions for basecalling. The results are in `../results/nfl/guppy*/s800c10/imprinting/`.
2. Plot boxplots of methylation level in ICR predicted by Guppy according to `guppy_imprinting.slurm` and `guppy_imprinting_viz.jl`. The results are in `../results/guppy_imprinting_viz`.
### Compare DNA methylation predicted by NanoFreeLunch with histone modifications and DNase sensitivity
1. Get average methylation level of DNase hypersensitive regions marked by H3K4me3 and H3K9me3 regions predicted by NanoFreeLunch according to `histonecode_vs_nfl_modlevel.sh`.
2. Plot boxplots according to `histonecode_vs_nfl_viz.jl`. The results are in `../results/nfl/guppy*/s800c10/encode/GM12878/`.
3. Get average methylation level of DNase hypersensitive regions marked by H3K4me3 and H3K9me3 regions predicted by Guppy according to `guppy_histonecode.slurm`.
4. Plot boxplots of Guppy according to `guppy_histonecode_viz.jl`. The results are in `../results/guppy_histonecode_viz/`.
# **Test NanoFreeLunch on the R9 HG002 data (MinION)**
Enter the `r9_hg002`.
## **Data downloading and preparation**
1. Enter the `data` folder by typing `pushd data`.
2. Download the data with `awscli` according to `download.sh`.
3. The raw-signal FAST5 files are in `gm24385_mod_2021.09/flowcells/` folder.
4. `gm24385_mod_2021.09/extra_analysis/bonito_remora/bonito.cpg.bed.gz` is the pre-calculated raw-signal-based DNA methylation level for each CpG site.
5. Type `popd` to get back.
## **Basecalling and alignment of the FAST5 files using Guppy 6.3.8**
1. Install `Guppy`.
2. Use Guppy to basecall and align the FAST5 files to hg38 according to `basecall_guppy6.3.8.slurm`. Type `sbatch basecall_guppy6.3.8.slurm` if you are using `SLURM` as the job management system in your cluster. Otherwise, adapt the `SLURM` code to shell or the script compatible with your job management system. `SLURM` code is basically linux shell code with some special variables to control computational resources such as queue and number of CPUs/GPUs to use.
**Warning:** basecalling might take a couple of days using modern GPUs and is impossible to finish using CPUs.
3. The default outputs of Guppy are a lot of FASTQ files or aligned BAM files. Merge them into a single BAM file according to `basecall_guppy6.3.8_mergebam.slurm`.
4. Filter the unaligned or multiple aligned reads according to `basecall_guppy6.3.8_uniq.slurm`.
5. The final result is `../results/basecall/guppy/6.3.8/dna_r9.4.1_450bps_hac/gm24385_mod_2021.09/merged.uniq.bam`.
## **Run NanoFreeLunch**
### Feature extraction
The feature extraction module gets features for each CpG site listed in `locifile`, which is at `../results/locifiles/guppy/6.3.8/dna_r9.4.1_450bps_hac/gm24385_mod_2021.09/` and obtained according to `genlocifile.slurm` and `genlocifile_merge.sh`.
1. Run the feature extraction module according to `nfl_guppy6.3.8_prepdata_s800c10.slurm`.
2. The output is under folder `../results/nfl/guppy6.3.8/s800c10/prepdata/gm24385_mod_2021.09/`.
### Predict methylation level using NanoFreeLunch
1. Predict DNA methylation level for each CpG site according to `nfl_guppy6.3.8_predict_s800c10.slurm`.
2. By default, NanoFreeLunch predicts DNA methylation level for each chromosome separately. To merge them, use `nfl_guppy6.3.8_predict_s800c10_merge.sh`.
3. The output is in `../results/nfl/guppy6.3.8/s800c10/predict_merge/*.bed`.
4. The output format is defined in https://github.com/epi2me-labs/modbam2bed?tab=readme-ov-file#method-and-output-format.
## **Evalution of NanoFreeLunch**
### Evaluate the locus-level accuracy of NanoFreeLunch
1. Calculate the PCCs and heatmaps to compare NanoFreeLunch and Guppy according to `check_pcc_locus_level_guppy6.3.8.jl`. The heatmap-based scatter plot is in `../results/nfl/guppy6.3.8/s800c10/eval/scatterplot/` and The PCC matrix is in `../results/nfl/guppy6.3.8/s800c10/eval/pccmat.tsv`.
2. Calculate the PCCs and heatmaps to compare NanoFreeLunch and bisulphite sequencing according to `nfl_vs_bisulphite_guppy6.3.8.jl`. The heatmap-based scatter plot is in `../results/nfl/guppy6.3.8/s800c10/eval/nfl_vs_bisulphite/`.
### Evaluate the region-level accuracy of NanoFreeLunch
1. Get average methylation level of CpG islands predicted by Guppy according to `modcall_guppy_cpg.slurm`.
2. Get average methylation level of CpG islands predicted by bisulfite sequencing according to `bisulfite_cpg.sh`.
3. Get average methylation level of CpG islands predicted by NanoFreeLunch sequencing according to `nfl_guppy6.3.8_cpg_s800c10.slurm`.
4. Calculate the PCCs and heatmaps to compare average methylation level of CpG islands predicted by NanoFreeLunch and Guppy according to `check_pcc_cgi_level_guppy6.3.8.jl`. The result is in `../results/nfl/guppy6.3.8/s800c10/eval/scatterplot_cgi/`.
5. Calculate the PCCs and heatmaps to compare average methylation level of CpG islands predicted by NanoFreeLunch and bisulfite sequencing according to `check_pcc_cgi_level_guppy6.3.8_bisulfite.jl`. The result is in `../results/nfl/guppy6.3.8/s800c10/eval/scatterplot_cgi_bisulfite/`.
# **Test NanoFreeLunch on the R10 Ashkenazim trio data (5kHz)**
Enter the `r10_ashkenazim_trio/5khz/`. Install `Dorado` and the POD5 API `pod5-file-format`.
## Data downloading and preparation
1. Enter folder `data/flowcells` by typing `pushd data/flowcells` and download the raw-signal POD5 files according to `download.sh`. The downloaded three folder `hg002`, `hg003`, and `hg004` have the POD5/FAST5 files for each sample. Most of the raw-signal files are in POD5 format, but some files in `HG004` are in FAST5 format.
2. Type `popd` to get back.
## Basecall the alignment of the POD5 files
1. Merge the POD5 files of each sample according to `modcall-merge-rawfiles.sh` using the POD5 API.
2. Basecall and align the merged POD5 files according to `basecall.slurm` using `Dorado`. The output aligned BAM files are in `../results/basecall`.
3. Filter, sort, and index the BAM files according to `basecall-index.slurm`. The final results are `../results/basecall/*.uniq.bam`.
## Raw-signal-based DNA methylation detection using Dorado
1. Modcall the POD5 files using the base-modification mode of `Dorado` according to `modcall.slurm`.
2. Filter, sort, and index the aligned BAM files with modification tags according to `modcall-index.slurm`. The final results are `../results/modcall/*.uniq.bam`.
3. Calculate the DNA methylation level of each CpG site according to `modcall-modbam2bed.slurm`. The results are indexed BED files in `../results/modcall/*.cpg.bed.gz`, which can be randomly accessed with `tabix`.
## Run NanoFreeLunch
### Feature extraction using NanoFreeLunch
NanoFreeLunch takes the aligned and filtered BAM files at `../results/basecall/*.uniq.bam` as the input. The feature extraction module gets features for each CpG site listed in `locifile`, which is obtained according to `get_locifile.sh`.
1. Run NanoFreeLunch for feature extraction according to `nfl-prepdata.slurm`.
2. The results are in `../results/nfl/prepdata/*/hac/chr*`.
### Predict methylation level using NanoFreeLunch
1. Predict DNA methylation level for each CpG site according to `nfl-predict.slurm`.
2. By default, NanoFreeLunch predicts DNA methylation level for each chromosome separately. To merge them, use `nfl-predict-merge.slurm`.
3. The output is in `../results/nfl/predict_merge/*.bed`. The output format is defined in https://github.com/epi2me-labs/modbam2bed?tab=readme-ov-file#method-and-output-format.
## Evaluate NanoFreeLunch
### Evaluate the locus-level accuracy of NanoFreeLunch
1. Calculate the PCCs and heatmaps to compare NanoFreeLunch and Guppy according to `eval.jl`.
2. The figures are in `../results/nfl/eval/scatterplot/`. The PCC matrices are in `../results/nfl/eval/pccmat.tsv`.
### Evaluate the region-level accuracy of NanoFreeLunch
1. Filter out CpG sites with depth lower than 10x or score lower than 800 according to `get_dorado_prediction.sh`.
2. Calculate average DNA methylation level for each CpG island predicted by Dorado according to `dorado-cpg.slurm`.
3. Calculate average DNA methylation level for each CpG island predicted by NanoFreeLunch according to `nfl-cpg.slurm`.
4. Calculate the PCCs and heatmaps to compare average methylation level of CpG islands predcited by NanoFreeLunch and Dorado according to `eval-cpg-fb.jl`. The output is in `../results/nfl/eval/scatterplot_cpg/`.
# **Test NanoFreeLunch on the R10 Ashkenazim trio data (4kHz)**
Enter the `r10_ashkenazim_trio/4khz/`. Install `Dorado` and the POD5 API `pod5-file-format`.
## Data downloading and preparation
1. Enter folder `data/flowcells` by typing `pushd data/flowcells` and download the raw-signal POD5 files according to `download.sh`. The downloaded three folder `hg002`, `hg003`, and `hg004` have the FAST5 files for each sample.
2. Type `popd` to get back.
## Basecall the alignment of the POD5 files
1. Merge the FAST5 files of each sample and convert them to POD5 files according to `modcall-merge-rawfiles.sh` using the POD5 API.
2. Basecall and align the merged POD5 files according to `basecall.slurm` using `Dorado`. The output aligned BAM files are in `../results/basecall`.
3. Filter, sort, and index the BAM files according to `basecall-index.slurm`. The final results are `../results/basecall/*.uniq.bam`.
## Raw-signal-based DNA methylation detection using Dorado
1. Modcall the POD5 files using the base-modification mode of `Dorado` according to `modcall.slurm`.
2. Filter, sort, and index the aligned BAM files with modification tags according to `modcall-index.slurm`. The final results are `../results/modcall/*.uniq.bam`.
3. Calculate the DNA methylation level of each CpG site according to `modcall-modbam2bed.slurm`. The results are indexed BED files in `../results/modcall/*.cpg.bed.gz`, which can be randomly accessed with `tabix`.
## Run NanoFreeLunch
### Feature extraction using NanoFreeLunch
NanoFreeLunch takes the aligned and filtered BAM files at `../results/basecall/*.uniq.bam` as the input. The feature extraction module gets features for each CpG site listed in `locifile`, which is obtained according to `get_locifile.sh`.
1. Run NanoFreeLunch for feature extraction according to `nfl-prepdata.slurm`.
2. The results are in `../results/nfl/prepdata/*/chr*`.
### Predict methylation level using NanoFreeLunch
1. Predict DNA methylation level for each CpG site according to `nfl-predict.slurm`.
2. By default, NanoFreeLunch predicts DNA methylation level for each chromosome separately. To merge them, use `nfl-predict-merge.slurm`.
3. The output is in `../results/nfl/predict_merge/*.bed`. The output format is defined in https://github.com/epi2me-labs/modbam2bed?tab=readme-ov-file#method-and-output-format.
## Evaluate NanoFreeLunch
### Evaluate the locus-level accuracy of NanoFreeLunch
1. Calculate the PCCs and heatmaps to compare NanoFreeLunch and Guppy according to `eval.jl`.
2. The figures are in `../results/nfl/eval/scatterplot/`. The PCC matrices are in `../results/nfl/eval/pccmat.tsv`.
### Evaluate the region-level accuracy of NanoFreeLunch
1. Filter out CpG sites with depth lower than 10x or score lower than 800 according to `get_dorado_prediction.sh`.
2. Calculate average DNA methylation level for each CpG island predicted by Dorado according to `dorado-cpg.slurm`.
3. Calculate average DNA methylation level for each CpG island predicted by NanoFreeLunch according to `nfl-cpg.slurm`.
4. Calculate the PCCs and heatmaps to compare average methylation level of CpG islands predcited by NanoFreeLunch and Dorado according to `eval-cpg-fb.jl`. The output is in `../results/nfl/eval/scatterplot_cpg/`.
### Evaluate NanoFreeLunch in ICR (including the 4kHz or 5kHz data)
1. Plot boxplots of methylation level in ICR predicted by NanoFreeLunch according to `eval-imprinting-4k5k-fb.jl`. The results are in `../results/nfl/imprinting-4k5k/`.
2. Plot boxplots of methylation level in ICR predicted by Dorado according to `dorado-imprinting.slurm` and `dorado-imprinting-viz-4k5k.jl`.
### Compare DNA methylation predicted by NanoFreeLunch with histone modifications and DNase sensitivity (including the 4kHz or 5kHz data)
1. Plot boxplots of methylation level in DNase hypersensitive regions marked by H3K4me3 and H3K9me3 regions predicted by NanoFreeLunch according to `eval-imprinting-4k5k-fb.jl`. The results are in `../results/nfl/encode-4k5k/boxplot_6_samples.fb.html`.
2. Plot boxplots of methylation level in ICR predicted by Dorado according to `dorado-histonecode.slurm` and `dorado-histonecode-viz-4k5k.jl`. The results are in `../results/dorado_prediction_histonecode-4k5k/`.
# License
The software is under GPL license and the `BED` & `TSV` files in `ref` is under CC0 public domain waiver.