# Decode_Health_Reditools **Repository Path**: yongdong323/Decode_Health_Reditools ## Basic Information - **Project Name**: Decode_Health_Reditools - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2023-12-17 - **Last Updated**: 2023-12-17 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README RNA Editing Running REDItools2.0 (parallel) #### Reference : https://github.com/BioinfoUNIBA/REDItools # RNA Editing REDItools are python scripts developed with the aim to study RNA editing at genomic scale by next generation sequencing data. RNA editing is a post-transcriptional phenomenon involving the insertion/deletion or substitution of specific bases in precise RNA localizations. ## Prerequisite REDItools require python 2.7 (available at the official python web-site) while python 3 is not yet supported. In addition, REDItools need two external modules: pysam (mandatory) version >= 0.15 available here fisher (optional) version 0.1.4 (optional) available at python web site To perform Blat correction and format alignment exchanges (SAM to BAM and vice versa) the following packages should be installed or already present in your path: Blat package including gfServer and gfClient executables or pblat Samtools and tabix ## Running REDItools2.0 ## Environment set-up ### Create a virtual environment source ENV/bin/activate 1. Environment set-up Make a virtual environment with python 2.7 and ensure that REDItools was installed properly as shown here: The detail installation steps are included in Reditools readme tutorial in this link. https://github.com/BioinfoUNIBA/REDItools/blob/master/README_1.md git clone https://github.com/BioinfoUNIBA/REDItools cd REDItools python setup.py install 2. Running REDItools 2.0 First, we need to upload Clone the github repository for REDItools. git clone https://github.com/BioinfoUNIBA/REDItools Navigate to this repo. cd REDItools python setup.py install We can run individual sample as well if you want to test script before you run all samples in parallel. python2.7 src/cineca/reditools.py -f MS1-11001-P-NN_seq.sorted.STAR.Ev104.cufflinks.bam -r /datadrive/sprintprep/GRCh38.primary_assembly.genome.fa -o reditools2.0/MS1_outputs/MS1-11001-P-NN_seq-table.txt Choose which samples to run and edit the script parallel_REDItools2.0.sh. Run from the REDItools github repo folder. This will generate tables for each sample of possible RNA editing sites in a directory for each sample as parallel_table.txt.gz. In other words, for each sample you will have the directory and inside each individual sample folder, your output files are stored. ### Filtering and processing data 1. Next step is to filter data, to include only edited sites by dropping invariant sites by running script: python2.7 RES_main.py Also, it's better to update a chromosomal regions from "1" to "chr1". Once you run this step, you will have parallel_table_filtered.txt files in each sample folder. 2. Now gtf files (like RepeatMasker and RefSeq gene annotations) was used to annotate whether the RNA editing sites (RES) is in an Alu region or not, and whether it is in an exon of a particular gene by using script: filtering_RES.sh Note: scripts need to be edited or updated based on samples you used. I have used 5 HC and 5 MS samples in this script, if you are planning to add more samples, please go ahead and edit the script accordingly. The output file for each sample is candidates.txt . you can set up different filtering cutoff values and can change it if you want. Here are the cutoff I used, -c 2: RNA-seq coverage -v 2: Bases supporting RNA-seq variation -f 0.1: frequency of variation in RNA-seq -e: exclude multiple substitutions in RNA-seq candidates.rmsk.txt Using RepeatMasker, this annotates Alu and non-Alu sites candidates.rmsk.alu.txt Subsets to just Alu sites (only keeps positions annotated in SINE regions) candidates.rmsk.alu.ann.txt adds RefSeq gene annotations 3. This is the step where you can run some statistics, on Alu sites (primate-specific repeats and comprise 11% of the human genome and have wide-ranging influences on gene expression) by distribution of RNA variants . This is done by script: python2.7 get_statistics.py This generates a file called editing_statistics.txt inside each sample folder. 4. Now, all samples were combine to make the downstream analysis, this will merge all candidates.txt files from each sample and then generate "all_candidates.txt". files in the main folder /mnt/iquityazurefileshare1/Test/MiniTestData_HCvMN/redioutput/. This step is generated by running script: python2.7 combine_samples.py This will also generate a file called candidates_alu_annotated.txt inside each sample folder.