# GLORI-tools

**Repository Path**: sqhan/GLORI-tools

## Basic Information

- **Project Name**: GLORI-tools
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2024-04-10
- **Last Updated**: 2024-04-10

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# GLORI-tools v1.0
**GLORI-tools currently works, but is still being optimized for a better user experience.**

*🔥NEW🔥 Hi, many friends have encountered various issues while building the STAR index. Please make sure you run this step with enough memory. Additionally, you can use chromosomes 1-22 + X, Y, Z in hg38 as your genome. Please ensure that your genome sequence's name begins with "chr" as we have only tested GLORI on human and mouse genomes.

## News
* As of June 15th, 2023, We have updated the format of the .totalCR.txt file to make it more user-friendly and easier to read. Please see Section 5.3 for details.
* As of June 23th, 2023, We have added an ciatiaon information for GLORI-tools. 
* As of June 23th, 2023, an additional run parameter has been added to obtain a comprehensive list of all annotated A citations. Please refer to Section 4.5 for details

## Table of content
* Background
* Pre-processing the raw sequencing data
* Installation and Requirement
* Example and Usage
* Maintainers and Contributing
* License

## Background
### GLORI
We developed an absolute m6A quantification method (“GLORI”) that is conceptually similar bisulfite sequencing-based quantification of DNA 5-methylcytosine.
GLORI relies on glyoxal and nitrite-mediated deamination of unmethylated adenosines while keeping m6A intact, thereby achieving specific and efficient m6A detection.

## Pre-processing the raw sequencing data
### Annotations
* GLORI-tools exclusively accepts single-end sequencing reads. Prior to use, it is critical to ensure that the input reads are A-to-G converted reads. Therefore, it is important to carefully verify the orientation of the reads before processing them.
* Before inputting data into GLORI-tools, data cleaning is necessary. This involves removing sequencing adapters, low-quality bases, PCR duplicates based on unique molecular identifiers (UMIs), and finally, removing the UMIs themselves.
### Installation
* trim_galore
* seqkit
* FASTX-Toolkit (including fastx_trimmer)
### Example code
* ```trim_galore -q 20 --stringency 1 -e 0.3 --length {length(5’UMI) +25nt} --path_to_cutadapt {cutadapter} --dont_gzip -o {output_dir1} {inputfile}```
* ```seqkit rmdup -j {Thread} -s -D {output_dupname} {filtered_file} > {filter_file2}```
* ```fastx_trimmer -Q 33 -f {length(5′UMI) +1nt} -i {filter_file2} -o {filter_file3}```

## GLORI-tools

GLORI-tools is a bioinformatics pipeline tailored for the analysis of high-throughput sequencing data generated by GLORI.
![GLORI pipeline]( https://github.com/liucongcas/GLORI-tools/blob/main/GLORI-pipeline.jpg " GLORI pipeline ")

## Installation and Requirement
### Installation
GLORI-tools is written in Python3 and is executed from the command line. To install GLORI-tools simply download the code and extract the files into a GLORI-tools installation folder.

### GLORI-tools needs the following tools to be installed and ideally available in the PATH environment:
* STAR ≥ v2.7.5c
* bowtie (bowtie1) version ≥ v1.3.0
* samtools ≥ v1.10
* python ≥ v3.8.3

### GLORI-tools needs the following python package to be installed:
pysam,pandas,argparse,time,collections,os,sys,re,subprocess,multiprocessing,copy,numpy,scipy,math,sqlite3,Bio,statsmodels,itertools,heapq,glob,signal

## Example and Usage:

### 1. Generate annotation files (required)
#### 1.1 download files for annotation (required, using hg38 as example): 
* ``` wget https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/annotation_releases/109.20190905/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_assembly_report.txt ```
* ``` wget https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/annotation_releases/109.20190905/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gtf.gz ```

#### 1.2 download reference genome and transcriptome
* ``` wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz ```

* ```wget https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/annotation_releases/109.20190905/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_rna.fna.gz```

#### 1.3 Unify chromosome naming in GTF file and genome file:
 
* ```python ./get_anno/change_UCSCgtf.py -i GCF_000001405.39_GRCh38.p13_genomic.gtf -j GCF_000001405.39_GRCh38.p13_assembly_report.txt -o GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens ```

### 2. get reference for reads alignment (required)

#### 2.1 build genome index using STAR

* ``` python ./pipelines/build_genome_index.py -f $genome_fastafile -p 20 -pre hg38 ```

you will get:
* $ hg38.rvsCom.fa
* $ hg38.AG_conversion.fa
* the corresponding index from STAR

#### 2.2 build transcriptome index using bowtie

2.2.1 get the longest transcript for genes (required)

Get the required annotation table files:

* ```python ./get_anno/gtf2anno.py -i GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens -o GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.tbl``` 

* ```awk '$3!~/_/&&$3!="na"' GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.tbl | sed '/unknown_transcript/d'  > GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.tbl2```

Get the longest transcript:

* ``` python ./get_anno/selected_longest_transcrpts_fa.py -anno GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.tbl2 -fafile GCF_000001405.39_GRCh38.p13_rna.fa --outname_prx GCF_000001405.39_GRCh38.p13_rna2.fa```

2.2.2 build reference with bowtie

* ```python ./pipelines/build_transcriptome_index.py -f $ GCF_000001405.39_GRCh38.p13_rna2.fa -pre GCF_000001405.39_GRCh38.p13_rna2.fa```

you will get:
* $ GCF_000001405.39_GRCh38.p13_rna2.fa.AG_conversion.fa
* the corresponding index from bowtie

### 3. get_base annotation (optional)

#### 3.1 get annotation at single-base resolution

* ```python ./get_anno/anno_to_base.py -i GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.tbl2 -threads 20 -o GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.tbl2.baseanno```

#### 3.2 get required annotation file for further removal of duplicated loci

* ``` python ./get_anno/gtf2genelist.py -i GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens -f GCF_000001405.39_GRCh38.p13_rna.fa -o GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.genelist > output2```

* ```awk '$6!~/_/&&$6!="na"' GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.genelist > GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.genelist2```

#### 3.3 Removal of duplicated loci in the annotation file

* ```python ./get_anno/anno_to_base_remove_redundance_v1.0.py -i GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.tbl2.baseanno -o GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.tbl2.noredundance.base -g GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.genelist2```

Finally, you will get annotation files: 
* GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.tbl2.noredundance.base

### 4. alignment and call sites (required)

GLORI-tools takes cleaned reads as input and finally reports files for the conversion rate (A-to-G) of GLORI for each gene and m6A sites at single-base resolution with corresponding A rate representative for modification level. 

#### 4.1 Example shell scripts

| Used files |
| :--- |
| Thread=1 |
| genomdir=your_dir |
| genome=${genomdir}/hg38.AG_conversion.fa |
| genome2=${genomdir}/hg38.fa |
| rvsgenome=${genomdir}/hg38.revCom.fa |
| TfGenome=${genomdir}/GCF_000001405.39_GRCh38.p13_rna2.fa.AG_conversion.fa |
| annodir=your_dir |
| baseanno=${annodir}/GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.tbl2.noredundance.base |
| anno=${annodir}/GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.tbl2 |
| outputdir=your_dir |
| tooldir=/tool_dir/GLORI -tools |
| prx=your_prefix |
| file=your_cleaned_reads | 

#### 4.2 Call m6A sites annotated the with genes

* ``` python ${tooldir}/run_GLORI.py -i $tooldir -q ${file} -T $Thread -f ${genome} -f2 ${genome2} -rvs ${rvsgenome} -Tf ${TfGenome} -a $anno -b $baseanno -pre ${prx} -o $outputdir --combine --rvs_fac ```

#### 4.3 Call m6A sites without annotated genes.

* ``` python ${tooldir}/run_GLORI.py -i $tooldir -q ${file} -T $Thread -f ${genome} -f2 ${genome2} -rvs ${rvsgenome} -Tf ${TfGenome} -a $anno -pre ${prx} -o $outputdir --combine --rvs_fac ```

* In this situation, the background for each m6A sites is the overall conversion rate.

* The site list obtained by the above two methods is basically the similar, and there may be a few differential sites in the list.

#### 4.4 mapping with samples without GLORI treatment

* ``` python ${tooldir}/run_GLORI.py -i $tooldir -q ${file} -T $Thread -f ${genome} -Tf ${TfGenome} -a $anno -pre ${prx} -o $outputdir --combine --untreated ```

#### 4.5 call all the annotated A cites

1) with annotation:
* ``` python ${tooldir}/run_GLORI.py -i $tooldir -q ${file} -T $Thread -f ${genome} -f2 ${genome2} -rvs ${rvsgenome} -Tf ${TfGenome} -a $anno -b $baseanno -pre ${prx} -o $outputdir --combine --rvs_fac -c 1 -C 0 -r 0 -p 1.1 -adp 1.1 -s 0 ```
  
2) or without annotation
* ``` python ${tooldir}/run_GLORI.py -i $tooldir -q ${file} -T $Thread -f ${genome} -f2 ${genome2} -rvs ${rvsgenome} -Tf ${TfGenome} -a $anno -pre ${prx} -o $outputdir --combine --rvs_fac -c 1 -C 0 -r 0 -p 1.1 -adp 1.1 -s 0 ```

### 5 Resultes:
#### 5.1 Output files of GLORI

| Output files | Interpretation |
| :---: | :---: |
| ${your_prefix}_merged.sorted.bam | The overall mapping of reads in .bam format |
|  ${your_prefix}_referbase.mpi |  The text pileup output from .bam files |
|  ${your_prefix}.totalCR.txt | The text file containing the median value of the overall A-to-G conversion rate for each transcriptome and gene |
|  ${your_prefix}.totalm6A.FDR.csv | The final list of m6A sites obtained, with the A rate serving as the m6A level |

#### 5.2 GLORI sites files

| Columns | Interpretation |
| :---: | :---: |
| Chr | chromosome |
| Sites | genomic loci |
| Strand | strand |
| Gene | annotated gene |
| CR | conversion rate for genes |
| AGcov | reads coverage with A and G |
| Acov | reads coverage with A |
| Genecov | mean coverage for the whole gene |
| Ratio | A rate for the sites/ or methylation level for the sites |
| Pvalue | test for A rate based on the background |
| P_adjust | FDR ajusted P value |

#### 5.3 GLORI conversion rate (.totalCR) files
| Columns | Interpretation |
| :---: | :---: |
| SA | chromosomes or genes |
| A-to-G_ratio |  A-to-G conversion rate for each chromosome and gene |

## Maintainers and Contributing
* GLORI-tools is developed and maintained by Cong Liu (liucong-1112@pku.edu.cn).
* The development of GLORI-tools is inseparable from the open source software RNA-m5C (https://github.com/SYSU-zhanglab/RNA-m5C).

## Licences
* Released under MIT license

## Citation
please cite:
Liu C., Sun H., Yi Y., Shen W., Li K., Xiao Y., et al. (2022). Absolute quantification of single-base m6A methylation in the mammalian transcriptome using GLORI. Nat. Biotechnol. (https://www.nature.com/articles/s41587-022-01487-9)