# test-datasets-rnaseq **Repository Path**: bioinfoFungi/test-datasets-rnaseq ## Basic Information - **Project Name**: test-datasets-rnaseq - **Description**: RNA-seq的测试数据 git clone https://github.com/nf-core/test-datasets.git --single-branch --branch rnaseq - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2023-03-01 - **Last Updated**: 2023-03-01 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # test-datasets: `rnaseq` This branch contains test data to be used for automated testing with the [nf-core/rnaseq](https://github.com/nf-core/rnaseq) pipeline. ## Content of this repository `reference/`: Sub-sampled genome reference files (iGenomes **S. cerevisiae** R64-1-1 Ensembl release) `testdata/*.fastq.gz`: Historical single-end test data for pipeline sub-sampled to ~2000 reads `testdata/GSE110004/*.fastq.gz`: Paired-end test data for pipeline sub-sampled to 50000 reads `samplesheet/samplesheet.csv`: Experiment design file for minimal test dataset `samplesheet/samplesheet_full.csv`: Experiment design file for full test dataset ## Minimal test dataset origin *S. cerevisiae* 101bp paired-end strand-specific RNA-seq dataset was obtained from: > Andrew C K Wu, Harshil Patel, Minghao Chia, Fabien Moretto, David Frith, Ambrosius P Snijders, Folkert J van Werven. Repression of Divergent Noncoding Transcription by a Sequence-Specific Transcription Factor. Mol Cell. 2018 Dec 20;72(6):942-954.e7. doi: 10.1016/j.molcel.2018.10.018. [Pubmed](https://pubmed.ncbi.nlm.nih.gov/30576656/) [GEO](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE110004) ### Sampling information | run_accession | experiment_alias | read_count | sample_title | |---------------|------------------|------------|---------------------------------------------------------------------------| | SRR6357070 | GSM2879618 | 47629288 | Wild-type total RNA-Seq biological replicate 1 | | SRR6357071 | GSM2879619 | 68628914 | Wild-type total RNA-Seq biological replicate 2 | | SRR6357072 | GSM2879620 | 54771596 | Wild-type total RNA-Seq biological replicate 3 | | SRR6357073 | GSM2879621 | 56006930 | Rap1-AID degron no induction total RNA-Seq biological replicate 1 | | SRR6357074 | GSM2879622 | 56259979 | Rap1-AID degron no induction total RNA-Seq biological replicate 2 | | SRR6357075 | GSM2879623 | 51876040 | Rap1-AID degron no induction total RNA-Seq biological replicate 3 | | SRR6357076 | GSM2879624 | 54935434 | Rap1-AID degron induction 30 minutes total RNA-Seq biological replicate 1 | | SRR6357077 | GSM2879625 | 57770345 | Rap1-AID degron induction 30 minutes total RNA-Seq biological replicate 2 | | SRR6357078 | GSM2879626 | 47537967 | Rap1-AID degron induction 30 minutes total RNA-Seq biological replicate 3 | | SRR6357079 | GSM2879627 | 56870378 | Rap1-AID degron induction 2 hours total RNA-Seq biological replicate 1 | | SRR6357080 | GSM2879628 | 59113530 | Rap1-AID degron induction 2 hours total RNA-Seq biological replicate 2 | | SRR6357081 | GSM2879629 | 48202638 | Rap1-AID degron induction 2 hours total RNA-Seq biological replicate 3 | ### Sampling procedure 1. If we have a file called "chrI.fa" containing a single chromosome from _S. cerevisiae_ just edit the fasta entry header to include the taxonomy info as suggested in the Kraken2 manual (see [docs](https://github.com/DerrickWood/kraken2/wiki/Manual#custom-databases)) e.g. rename the entry header from `>I` to `>I|kraken:taxid|4932`. > NB: May not have to do this step but I just did it anyway. 2. Build Kraken2 database for custom genome ```console DBNAME='yeast_chrI' kraken2-build --download-taxonomy --db $DBNAME kraken2-build --add-to-library chrI.fa --db $DBNAME kraken2-build --build --db $DBNAME ``` 3. (OPTIONAL) Download publicly available fastq files with [nf-core/rnaseq](https://github.com/nf-core/rnaseq) pipeline (see [docs](https://nf-co.re/rnaseq/3.0/usage#direct-download-of-public-repository-data)). This also auto-generates a samplesheet that can be easily re-formatted to work as input with nf-core/viralrecon in the following next step: ```console nextflow run nf-core/rnaseq \ --public_data_ids ids.txt \ -profile singularity ``` 4. Only run the Kraken2 process from the [nf-core/viralrecon](https://github.com/nf-core/viralrecon) pipeline to get filtered fastq files: ```console nextflow run nf-core/viralrecon \ --input samplesheet.csv \ --kraken2_db yeast_chrI/ \ --fasta chrI.fa \ --platform illumina \ --protocol metagenomic \ --skip_fastqc \ --skip_fastp \ --skip_multiqc \ --skip_assembly \ --skip_variants \ -profile singularity \ -c custom.config \ ``` The contents of `custom.config` are defined below and are used to tell the pipeline to also publish the "fastq.gz" files because this isn't done by default: ```nextflow params { modules { 'illumina_kraken2_run' { publish_files = ['txt':'', 'fastq.gz':''] } } } ``` 5. The example command below was used to sub-sample the raw paired-end FastQ files to 50,000 reads (see [seqtk](https://github.com/lh3/seqtk)): ```console seqtk sample -s100 SRR6357070.classified_1.fastq.gz 50000 | gzip > SRR6357070_1.fastq.gz seqtk sample -s100 SRR6357070.classified_2.fastq.gz 50000 | gzip > SRR6357070_2.fastq.gz ``` ## Full test dataset origin *H. sapiens* paired-end strand-specific RNA-seq dataset was obtained from: > ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 2012 Sep 6;489(7414):57-74. [Pubmed](https://pubmed.ncbi.nlm.nih.gov/22955616/) The GM12878 and K562 ENCODE data was also used to benchmark RNA-seq quantification pipelines in the paper below: > Mingxiang Teng, Michael I. Love, Carrie A. Davis, Sarah Djebali, Alexander Dobin, Brenton R. Graveley, Sheng Li, Christopher E. Mason, Sara Olson, Dmitri Pervouchine, Cricket A. Sloan, Xintao Wei, Lijun Zhan, and Rafael A. Irizarry. A benchmark for RNA-seq quantification pipelines. Genome Biol. 2016; 17: 74. Published online 2016 Apr 23. doi: 10.1186/s13059-016-0940-1. [Pubmed](https://pubmed.ncbi.nlm.nih.gov/27107712/) | study_alias | run_accession | experiment_alias | encode_library_id | sample_description | instrument_model | library_layout | read_count | sex | fastq_ftp | fastq_md5 | |-------------|---------------|------------------|-------------------|--------------------|------------------|----------------|------------|-----|-----------|-----------| | [GSE78551](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE78551) | SRR3192657 | GSM2072350 | ENCLB038ZZZ | Homo sapiens GM12878 immortalized cell line | Illumina HiSeq 2000 | PAIRED | 93555584 | female | [fastq_1](ftp.sra.ebi.ac.uk/vol1/fastq/SRR319/007/SRR3192657/SRR3192657_1.fastq.gz) [fastq_2](ftp.sra.ebi.ac.uk/vol1/fastq/SRR319/007/SRR3192657/SRR3192657_2.fastq.gz) | f3a3aee0e1f0f54dc9afd8f7c0442aba;6bff7e7d944736251cfbc36e35c3f431 | | [GSE78551](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE78551) | SRR3192658 | GSM2072351 | ENCLB037ZZZ | Homo sapiens GM12878 immortalized cell line | Illumina HiSeq 2000 | PAIRED | 97548052 | female | [fastq_1](ftp.sra.ebi.ac.uk/vol1/fastq/SRR319/008/SRR3192658/SRR3192658_1.fastq.gz) [fastq_2](ftp.sra.ebi.ac.uk/vol1/fastq/SRR319/008/SRR3192658/SRR3192658_2.fastq.gz) | f6fdb08100033d98bfcba0801a838bf9;b369f63c5d37e515b4e102fa8c8d75e7 | | [GSE78557](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE78557) | SRR3192408 | GSM2072362 | ENCLB055ZZZ | Homo sapiens K562 immortalized cell line | Illumina HiSeq 2000 | PAIRED | 92172367 | female | [fastq_1](ftp.sra.ebi.ac.uk/vol1/fastq/SRR319/008/SRR3192408/SRR3192408_1.fastq.gz) [fastq_2](ftp.sra.ebi.ac.uk/vol1/fastq/SRR319/008/SRR3192408/SRR3192408_2.fastq.gz) | 53815dcaeeb331459ab72bffe0a9432f;e73d0e7b764d96f08cf2caf4a7e880ff | | [GSE78557](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE78557) | SRR3192409 | GSM2072363 | ENCLB056ZZZ | Homo sapiens K562 immortalized cell line | Illumina HiSeq 2000 | PAIRED | 113327735 | female | [fastq_1](ftp.sra.ebi.ac.uk/vol1/fastq/SRR319/009/SRR3192409/SRR3192409_1.fastq.gz) [fastq_2](ftp.sra.ebi.ac.uk/vol1/fastq/SRR319/009/SRR3192409/SRR3192409_2.fastq.gz) | 5904c8781f4fd6771a5e9a32696cd49b;b23e23639258c93944ff9a64b08b9f67 | | [GSE90237](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE90237) | SRR5048099 | GSM2400174 | ENCLB555AQN | Homo sapiens MCF-7 immortalized cell line | Illumina Genome Analyzer IIx | PAIRED | 128178110 | female | [fastq_1](ftp.sra.ebi.ac.uk/vol1/fastq/SRR504/009/SRR5048099/SRR5048099_1.fastq.gz) [fastq_2](ftp.sra.ebi.ac.uk/vol1/fastq/SRR504/009/SRR5048099/SRR5048099_2.fastq.gz) | c23adfcad78e9162a83e18fc76e7ebfd;fd0c3baabd67659aecf6c88feef30259 | | [GSE90237](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE90237) | SRR5048100 | GSM2400175 | ENCLB555AQO | Homo sapiens MCF-7 immortalized cell line | Illumina Genome Analyzer IIx | PAIRED | 131814222 | female | [fastq_1](ftp.sra.ebi.ac.uk/vol1/fastq/SRR504/000/SRR5048100/SRR5048100_1.fastq.gz) [fastq_2](ftp.sra.ebi.ac.uk/vol1/fastq/SRR504/000/SRR5048100/SRR5048100_2.fastq.gz) | f7e732c768e4080311a49e6048c4d515;5619f168e72c5ca27b1b805a91de4444 | | [GSE90225](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE90225) | SRR5048077 | GSM2400152 | ENCLB555AMA | Homo sapiens H1-hESC stem cell male embryo | Illumina Genome Analyzer IIx | PAIRED | 125395196 | male | [fastq_1](ftp.sra.ebi.ac.uk/vol1/fastq/SRR504/007/SRR5048077/SRR5048077_1.fastq.gz) [fastq_2](ftp.sra.ebi.ac.uk/vol1/fastq/SRR504/007/SRR5048077/SRR5048077_2.fastq.gz) | 6beb20b2cd99542433986b8fe844ef09;4f63ef9e16dc9f0f8be159b02d40f0c6 | | [GSE90225](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE90225) | SRR5048078 | GSM2400153 | ENCLB555AMB | Homo sapiens H1-hESC stem cell male embryo | Illumina Genome Analyzer IIx | PAIRED | 107101340 | male | [fastq_1](ftp.sra.ebi.ac.uk/vol1/fastq/SRR504/008/SRR5048078/SRR5048078_1.fastq.gz) [fastq_2](ftp.sra.ebi.ac.uk/vol1/fastq/SRR504/008/SRR5048078/SRR5048078_2.fastq.gz) | 9c60d407bae58019889b13acb1032116;fc5df7d28daf6df1b212aaac914f1324 | ## Create gff from gtf In case the GTF gene annotation file gets updated, then GFF would also need to get updated. One can use [gffread](https://bioconda.github.io/recipes/gffread/README.html) to perform the conversion: ```console gffread -F --keep-exon-attrs genes.gtf > genes.gff ``` Explanation of flags: - `-F` preserves attributes for genes and transcripts, but doesn't preserve for exon features - `--keep-exon-attrs` is needed as [featureCounts](http://subread.sourceforge.net/) in the [nf-core/rnaseq](https://github.com/nf-core/rnaseq/) pipeline uses the gene type/biotype (e.g. `protein_coding`, `lncRNA`) of the exons to count number of reads per biotype ## Create the gzipped references In case the reference genomes or gene annotations get updated, the gzipped references would need to get updated, too. To make the gzipped references, run the following snippet in the `reference` folder: ```console for F in $(ls -1 | grep -vE '.gz$'); do echo $F ; gzip -c $F > $F.gz ; done ``` This looks for files that don't end in `.gz` and compresses them.