# Decode_Health_Reditools

**Repository Path**: yongdong323/Decode_Health_Reditools

## Basic Information

- **Project Name**: Decode_Health_Reditools
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2023-12-17
- **Last Updated**: 2023-12-17

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

       RNA Editing
Running REDItools2.0 (parallel)

#### Reference : https://github.com/BioinfoUNIBA/REDItools
# RNA Editing
REDItools are python scripts developed with the aim to study RNA editing at genomic scale by next generation sequencing data. RNA editing is a post-transcriptional phenomenon involving the insertion/deletion or substitution of specific bases in precise RNA localizations.
## Prerequisite
REDItools require python 2.7 (available at the official python web-site) while python 3 is not yet supported. In addition, REDItools need two external modules:

pysam (mandatory) version >= 0.15 available here
fisher (optional) version 0.1.4 (optional) available at python web site
To perform Blat correction and format alignment exchanges (SAM to BAM and vice versa) the following packages should be installed or already present in your path:

Blat package including gfServer and gfClient executables or pblat
Samtools and tabix

## Running REDItools2.0 
## Environment set-up
### Create a virtual environment 
source ENV/bin/activate

1. Environment set-up
Make a virtual environment with python 2.7 and ensure that REDItools was installed properly as shown here: 
The detail installation steps are included in Reditools readme tutorial in this link.
https://github.com/BioinfoUNIBA/REDItools/blob/master/README_1.md

git clone https://github.com/BioinfoUNIBA/REDItools
cd REDItools
python setup.py install

2. Running REDItools 2.0
First, we need to upload  Clone the github repository for REDItools.
git clone https://github.com/BioinfoUNIBA/REDItools
Navigate to this repo.
cd REDItools
python setup.py install

We can run individual sample as well if you want to test script before you run all samples in  parallel. 
python2.7 src/cineca/reditools.py -f MS1-11001-P-NN_seq.sorted.STAR.Ev104.cufflinks.bam -r /datadrive/sprintprep/GRCh38.primary_assembly.genome.fa -o reditools2.0/MS1_outputs/MS1-11001-P-NN_seq-table.txt

Choose which samples to run and edit the script parallel_REDItools2.0.sh. Run from the REDItools github repo folder. 
This will generate tables for each sample of possible RNA editing sites in a directory for each sample as parallel_table.txt.gz. In other words, for each sample you will have the directory and  inside each individual sample folder, your output files are stored.

### Filtering and processing data
 
1.  Next step is to filter data, to include only edited sites by dropping  invariant sites by running script:
 python2.7 RES_main.py

Also, it's better to  update a chromosomal regions from  "1" to  "chr1".  Once you run this step, you will have parallel_table_filtered.txt files in each sample folder.

2. Now  gtf files (like RepeatMasker and RefSeq gene annotations) was used to annotate whether the RNA editing sites (RES) is in an Alu region or not, and whether it is in an exon of a particular gene by using script:
 filtering_RES.sh
 Note: scripts need to be edited or updated based on samples you used. I have used 5 HC and 5 MS samples in this script, if you are planning to add more samples, please go ahead and edit the script accordingly. The output file for each sample is candidates.txt .
 you can set up different filtering cutoff values and  can change it if you want. Here are the cutoff I used,

-c 2: RNA-seq coverage
-v 2: Bases supporting RNA-seq variation
-f 0.1: frequency of variation in RNA-seq
-e: exclude multiple substitutions in RNA-seq
candidates.rmsk.txt Using RepeatMasker, this annotates Alu and non-Alu sites
candidates.rmsk.alu.txt Subsets to just Alu sites (only keeps positions annotated in SINE regions)
candidates.rmsk.alu.ann.txt adds RefSeq gene annotations
 
 3. This is the step where you can run some statistics,  on Alu sites (primate-specific repeats and comprise 11% of the human genome and have wide-ranging influences on gene expression) by distribution of RNA variants . This is done by script: python2.7 get_statistics.py
 This generates a file called editing_statistics.txt inside each sample folder. 
 
 4. Now, all samples were combine to make the downstream analysis, this will merge all candidates.txt files from each sample and then generate "all_candidates.txt". files in the main folder /mnt/iquityazurefileshare1/Test/MiniTestData_HCvMN/redioutput/. This step is generated by running script:

python2.7 combine_samples.py
This  will also generate a file called candidates_alu_annotated.txt inside each sample folder.