# analysis-TAGADA **Repository Path**: zhangtong-swu/analysis-TAGADA ## Basic Information - **Project Name**: analysis-TAGADA - **Description**: TAGADA，一种用于转录本和基因组装、反卷积和分析的 RNA-seq 管道。给定基因组序列、参考注释和RNA-seq读数，TAGADA通过生成改进的注释来增强现有的基因模型。它还可以计算参考和新注释的表达值，鉴定长链非编码转录本（lncRNA），并提供全面的质量控制报告。TAGADA使用Nextflow DSL2开发，提供用户友好的功能，并通过其容器化环境确保不同计算平台的可重复性。 - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2023-12-15 - **Last Updated**: 2023-12-15 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # TAGADA: Transcript And Gene Assembly, Deconvolution, Analysis TAGADA is a Nextflow pipeline that processes RNA-Seq data. It parallelizes multiple tasks to control reads quality, align reads to a reference genome, assemble new transcripts to create a novel annotation, and quantify genes and transcripts. ## Table of contents - [Dependencies](#dependencies) - [Usage](#usage) - [Nextflow options](#nextflow-options) - [Input and output options](#input-and-output-options) - [Merge options](#merge-options) - [Assembly options](#assembly-options) - [Skip options](#skip-options) - [Resources options](#resources-options) - [Custom resources](#custom-resources) - [Example configuration](#example-configuration) - [Metadata](#metadata) - [Example metadata](#example-metadata) - [Merging inputs](#merging-inputs) - [Merging inputs by a single factor](#merging-inputs-by-a-single-factor) - [Merging inputs by an intersection of factors](#merging-inputs-by-an-intersection-of-factors) - [Workflow and results](#workflow-and-results) - [Novel annotation](#novel-annotation) - [Funding](#funding) ## Dependencies To use this pipeline you will need: - [Nextflow](https://www.nextflow.io/docs/latest/getstarted.html) >= 21.04.1 - [Docker](https://docs.docker.com/engine/install/) >= 19.03.2 or [Singularity](https://sylabs.io/guides/3.5/user-guide/quick_start.html) >= 3.7.3 ## Usage A small dataset is provided to test this pipeline. To try it out, use this command: nextflow run FAANG/analysis-TAGADA -profile test,docker -revision 2.1.2 --output directory ### Nextflow options The pipeline is written in Nextflow, which provides the following default options:

Option	Example	Description	Required
`-profile`	`profile1,profile2,etc.`	Profile(s) to use when running the pipeline. Specify the profiles that fit your infrastructure among `singularity`, `docker`, `kubernetes`, `slurm`.	Required
`-config`	`custom.config`	Configuration file tailored to your infrastructure and dataset. To find a configuration file for your infrastructure, browse nf-core configs. Some large datasets require more computing resources than the pipeline defaults. To specify custom resources for specific processes, see the custom resources section.	Optional
`-revision`	`version`	Version of the pipeline to launch.	Optional
`-work-dir`	`directory`	Work directory where all temporary files are written.	Optional
`-resume`		Resume the pipeline from the last completed process.	Optional

For more Nextflow options, see [Nextflow's documentation](https://www.nextflow.io/docs/latest/cli.html#run). ### Input and output options

Option	Example	Description	Required
`--output`	`directory`	Output directory where all results are written.	Required
`--reads`	`'path/to/reads/*'`	Input `fastq` file(s) and/or `bam` file(s). For single-end reads, your files must end with: `.fq[.gz]` For paired-end reads, your files must end with: `_[R]{1,2}.fq[.gz]` For aligned reads, your files must end with: `.bam` If the provided path includes a wildcard character like `*`, you must enclose it with quotes to prevent Bash glob expansion, as per Nextflow's requirements. If the files are numerous, you may provide a `.txt` sheet with one path or url per line.	Required
`--annotation`	`annotation.gtf`	Input reference annotation file or url.	Required
`--genome`	`genome.fa`	Input genome sequence file or url.	Required
`--index`	`directory`	Input genome index directory or url.	Optional, to skip genome indexing
`--metadata`	`metadata.tsv`	Input tabulated metadata file or url.	Required if `--assemble-by` or `--quantify-by` are provided

### Merge options

Option	Example	Description	Required
`--assemble-by`	`factor1,factor2,etc.`	Factor(s) defining groups in which transcripts are assembled. Aligned reads of identical factors are merged and each resulting merge group is processed individually. See the merging inputs section for details.	Optional
`--quantify-by`	`factor1,factor2,etc.`	Factor(s) defining groups in which transcripts are quantified. Aligned reads of identical factors are merged and each resulting merge group is processed individually. See the merging inputs section for details.	Optional

### Assembly options

Option	Example	Description	Required
`--min-transcript-occurrence`	`2`	After transcripts assembly, rare novel transcripts that appear in few assembly groups are removed from the final novel annotation. By default, if a transcript occurs in less than `2` assembly groups, it is removed. If there is only one assembly group, this option defaults to `1`.	Optional
`--min-monoexonic-occurrence`	`2`	If specified, rare novel monoexonic transcripts are filtered according to the provided threshold. Otherwise, this option takes the value of `--min-transcript-occurrence`.	Optional
`--min-transcript-tpm`	`0.1`	After transcripts assembly, novel transcripts with low TPM values in every assembly group are removed from the final novel annotation. By default, if a transcript's TPM value is lower than `0.1` in every assembly group, it is removed.	Optional
`--min-monoexonic-tpm`	`1`	If specified, novel monoexonic transcripts with low TPM values are filtered according to the provided threshold. Otherwise, this option takes the value of `--min-transcript-tpm * 10`.	Optional
`--coalesce-transcripts-with`	`tmerge`	Tool used to coalesce transcripts assemblies into a non-redundant set of transcripts for the novel annotation. Can be `tmerge` or `stringtie`. Defaults to `tmerge`.	Optional
`--tmerge-args`	`'--endFuzz 10000'`	Custom arguments to pass to tmerge when coalescing transcripts.	Optional
`--feelnc-filter-args`	`'--size 200'`	Custom arguments to pass to FEELnc's filter script when detecting long non-coding transcripts.	Optional
`--feelnc-codpot-args`	`'--mode shuffle'`	Custom arguments to pass to FEELnc's coding potential script when detecting long non-coding transcripts.	Optional
`--feelnc-classifier-args`	`'--window 10000'`	Custom arguments to pass to FEELnc's classifier script when detecting long non-coding transcripts.	Optional

### Skip options

Option	Example	Description	Required
`--skip-assembly`		Skip transcripts assembly with StringTie and skip all subsequent processes working with a novel annotation.	Optional
`--skip-lnc-detection`		Skip detection of long non-coding transcripts in the novel annotation with FEELnc.	Optional

### Resources options

Option	Example	Description	Required
`--max-cpus`	`16`	Maximum number of CPU cores that can be used for each process. This is a limit, not the actual number of requested CPU cores.	Optional
`--max-memory`	`64GB`	Maximum memory that can be used for each process. This is a limit, not the actual amount of allotted memory.	Optional
`--max-time`	`24h`	Maximum time that can be spent on each process. This is a limit and has no effect on the duration of each process.	Optional

## Custom resources With large datasets, some [workflow processes](#workflow-and-results) may require more computing resources than the pipeline defaults. To customize the amount of resources allotted to specific processes, add a [process scope](https://www.nextflow.io/docs/edge/config.html#scope-process) to your configuration file. Resources provided in the configuration file override the [resources options](#resources-options). ### Example configuration -config custom.config `custom.config` process { withName: TRIMGALORE_trim_adapters { cpus = 8 memory = 18.GB time = 36.h } withName: STAR_align_reads { cpus = 16 memory = 64.GB time = 2.d } } ## Metadata Using `--metadata`, you may provide a file describing your inputs with tab-separated factors. The first column must contain file names without file type extensions or paired-end suffixes. There are no constraints on column names or number of columns. ### Example metadata --reads reads.txt --metadata metadata.tsv `reads.txt` path/to/A_R1.fq path/to/A_R2.fq path/to/B.fq.gz path/to/C.bam path/to/D.fq `metadata.tsv` input tissue stage A liver 30 days B liver 30 days C liver 60 days D muscle 60 days ## Merging inputs When using `--assemble-by` and/or `--quantify-by`, your inputs are merged into experiment groups that share common factors. With `--assemble-by`, transcripts assembly is done individually for each assembly group, and consensus transcripts are kept to generate a novel annotation. With `--quantify-by`, quantification values are given individually for each quantification group. ### Merging inputs by a single factor --assemble-by tissue --quantify-by stage

Metadata			Transcripts assembly by tissue	Annotation	Quantification by stage
input	tissue	stage	Transcripts assembly by tissue	Annotation	Quantification by stage
A	liver	30 days	A, B, C ↓ liver	liver, muscle ↓ novel annotation	A, B ↓ 30 days
B	liver	30 days			A, B ↓ 30 days
C	liver	60 days			C, D ↓ 60 days
D	muscle	60 days	D ↓ muscle		C, D ↓ 60 days

### Merging inputs by an intersection of factors --assemble-by tissue,stage

Metadata			Transcripts assembly by tissue and stage	Annotation	Quantification by input
input	tissue	stage	Transcripts assembly by tissue and stage	Annotation	Quantification by input
A	liver	30 days	A, B ↓ liver at 30 days	liver at 30 days, liver at 60 days, muscle at 60 days ↓ novel annotation	A
B	liver	30 days	A, B ↓ liver at 30 days		B
C	liver	60 days	C ↓ liver at 60 days		C
D	muscle	60 days	D ↓ muscle at 60 days		D

## Workflow and results The pipeline executes the following processes: 1. `FASTQC_control_reads` Control reads quality with [FastQC](https://github.com/s-andrews/FastQC). 2. `TRIMGALORE_trim_adapters` Trim adapters with [Trim Galore](https://github.com/FelixKrueger/TrimGalore). 3. `STAR_index_genome` Index genome with [STAR](https://github.com/alexdobin/STAR). The indexed genome is saved to `output/index`. 4. `STAR_align_reads` Align reads to the indexed genome with [STAR](https://github.com/alexdobin/STAR). Aligned reads are saved to `output/alignment` in `.bam` files. 5. `BEDTOOLS_compute_coverage` Compute genome coverage with [Bedtools](https://github.com/arq5x/bedtools2). Coverage information is saved to `output/coverage` in `.bed` files. 6. `SAMTOOLS_merge_reads` Merge aligned reads by factors with [Samtools](https://github.com/samtools/samtools). See the [merging inputs](#merging-inputs) section for details. 7. `STRINGTIE_assemble_transcripts` Assemble transcripts in each individual assembly group with [StringTie](https://github.com/gpertea/stringtie). 8. `TAGADA_filter_transcripts` Filter rare transcripts that appear in few assembly groups and poorly-expressed transcripts with low TPM values. 9. `STRINGTIE_coalesce_transcripts` or `TMERGE_coalesce_transcripts` Create a novel annotation with [StringTie](https://github.com/gpertea/stringtie) or [Tmerge](https://github.com/julienlag/tmerge). The novel annotation is saved to `output/annotation` in a `.gtf` file. 10. `FEELNC_classify_transcripts` Detect long non-coding transcripts with [FEELnc](https://github.com/tderrien/FEELnc). The annotation saved to `output/annotation` is updated with the results. 11. `STRINGTIE_quantify_expression` Quantify genes and transcripts with [StringTie](https://github.com/gpertea/stringtie). Counts and TPM matrices are saved to `output/quantification` in `.tsv` files. 12. `MULTIQC_generate_report` Aggregate quality controls into a report with [MultiQC](https://github.com/ewels/MultiQC). The report is saved to `output/control` in a `.html` file. ## Novel annotation The novel annotation contains information from [StringTie](https://github.com/gpertea/stringtie), [Tmerge](https://github.com/julienlag/tmerge), and [FEELnc](https://github.com/tderrien/FEELnc). It is provided in gtf format with exon, transcript and gene rows. Row attributes vary depending on which tool was used to coalesce transcripts.
--coalesce-transcripts-with tmerge - `gene_id` All rows. The Tmerge `gene_id` starting with LOC. - `ref_gene_id` All rows. A comma-separated list of reference annotation `gene_id` when a Tmerge transcript is made of at least one reference transcript, otherwise a dot. - `transcript_id` Exon and transcript rows. The Tmerge `transcript_id` starting with TM, unless the transcript is exactly identical to a reference transcript, in which case the reference annotation `transcript_id` is provided. - `tmerge_tr_id` Exon and transcript rows. Optional. A comma-separated list of Tmerge `transcript_id` if the current `transcript_id` is from the reference annotation, to list which initial Tmerge transcripts it is made of. - `transcript_biotype` Exon and transcript rows. Optional. The reference annotation `transcript_biotype` of the `transcript_id`. - `feelnc_biotype` Exon and transcript rows. Optional. The transcript biotype determined by FEELnc (lncRNA, mRNA, noORF, or TUCp) if the transcript has been classified. - `contains`, `contains_count`, `3p_dists_to_3p`, `5p_dists_to_5p`, `flrpm`, `longest`, `longest_FL_supporters`, `longest_FL_supporters_count`, `mature_RNA_length`, `meta_3p_dists_to_5p`, `meta_5p_dists_to_5p`, `rpm`, `spliced` Transcript rows. Attributes provided by Tmerge.
--coalesce-transcripts-with stringtie - `gene_id` All rows. The StringTie `gene_id` starting with MSTRG. - `ref_gene_id` All rows. Optional. The reference annotation `gene_id`. - `ref_gene_name` All rows. Optional. The reference annotation `gene_name`. - `transcript_id` Exon and transcript rows. The StringTie `transcript_id` starting with MSTRG, unless the transcript is exactly identical to a reference transcript, in which case the reference annotation `transcript_id` is provided. - `transcript_biotype` Exon and transcript rows. Optional. The reference annotation `transcript_biotype` of the `transcript_id`. - `feelnc_biotype` Exon and transcript rows. Optional. The transcript biotype determined by FEELnc (lncRNA, mRNA, noORF, or TUCp) if the transcript has been classified. - `exon_number` Exon rows. The StringTie `exon_number` starting from 1 within a given transcript. ## Funding The GENE-SWitCH project has received funding from the European Union’s [Horizon 2020](https://ec.europa.eu/programmes/horizon2020/) research and innovation program under Grant Agreement No 817998. This repository reflects only the listed contributors views. Neither the European Commission nor its Agency REA are responsible for any use that may be made of the information it contains.