# RNA-seq_variant-calling **Repository Path**: yongdong323/RNA-seq_variant-calling ## Basic Information - **Project Name**: RNA-seq_variant-calling - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-01-25 - **Last Updated**: 2024-01-25 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # RNA-seq_variant-calling This is the workflow for RNA-seq germline variant calling based on [GATK RNAseq short variant discovery workflows](https://gatk.broadinstitute.org/hc/en-us/articles/360035531192-RNAseq-short-variant-discovery-SNPs-Indels-) and [VAtools](https://vatools.readthedocs.io/en/latest/) for data cleaning. ## Pipeline workflow ![image](https://github.com/Tina04021997/RNA-seq_variant-calling/blob/main/RNA-seq%20variant%20calling%20workflow.jpg) ## Environments - STAR v2.7.8a - Picard v2.23.4 - GATK v4.1.9.0 - VEP v101 - VAtools v4.1.0 ## Input data - Paired-end fastq files ## Output data - tsv files transformed from annotated vcf files ## Reference data **mm10 resource bundle** see [Create GATK mm10 resource bundle](https://github.com/igordot/genomics/blob/master/workflows/gatk-mouse-mm10.md). - If you encounter problems while concatenating dbSNP VCF files, try this: ``` bgzip -c vcf_chr_number.vcf > vcf_chr_number.vcf.gz tabix vcf_chr_number.vcf.gz bcftools concat vcf_chr_number.vcf.gz vcf_chr_number.vcf.gz -Oz -o dbSNP.vcf.gz ``` - If you encounter problems while sorting MGP indels VCF file, try this: ``` grep "^#" mgp.v5.indels.pass.chr.vcf > indels.vcf && grep -v "^#" mgp.v5.indels.pass.chr.vcf | \sort -V -k1,1 -k2,2n >> indels.vcf ``` **GENCODE mm10 fasta (PRI) & GTF files.** ``` wget -c ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M22/GRCm38.primary_assembly.genome.fa.gz gunzip GRCm38.primary_assembly.genome.fa.gz wget -c ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M22/gencode.vM22.annotation.gtf.gz gunzip gencode.vM22.annotation.gtf.gz ``` ## Notes 1. Before running SplitNCigarReads.sh, create your fasta index file beforehead: - ```gatk CreateSequenceDictionary -R ref.fasta``` - ```samtools faidx ref.fasta``` 2. For variant annotation, download the cache file at a new directory before running Annotation.sh: - keep in mind that the cache version should match with your VEP version, - ```mkdir .vep``` - ```curl -O http://ftp.ensembl.org/pub/release-101/variation/indexed_vep_cache/mus_musculus_vep_101_GRCm38.tar.gz``` - ```tar xzf mus_musculus_vep_101_GRCm38.tar.gz``` 3. To make your life easier, use VAtools to make the annotated vcf file human-readable: - download VAtools by ```pip install VAtools``` - run ```vep-annotation-reporter input.vcf vep_fields -o output.tsv``` ## References - https://groups.google.com/g/rna-star/c/Cpsf-_rLK9I - https://yulijia.net/en/howto/bioinformatics/2020/04/09/working-with-VCF-files.html - https://www.biostars.org/p/237829/ - https://asia.ensembl.org/info/docs/tools/vep/vep_formats.html#output