# fair-crcc-get-data **Repository Path**: mirrors_crs4/fair-crcc-get-data ## Basic Information - **Project Name**: fair-crcc-get-data - **Description**: No description available - **Primary Language**: Unknown - **License**: GPL-3.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2022-10-28 - **Last Updated**: 2026-03-28 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Snakemake workflow: FAIR CRCC - get data [![Snakemake](https://img.shields.io/badge/snakemake-≥6.3.0-brightgreen.svg)](https://snakemake.github.io) [![GitHub actions status](https://github.com/crs4/fair-crcc-get-data/workflows/Tests/badge.svg?branch=main)](https://github.com/crs4/fair-crcc-get-data/actions?query=branch%3Amain+workflow%3ATests) A Snakemake workflow for downloading and decrypting Crypt4GH-encrypted sensitive data from the [CRC Cohort](https://www.bbmri-eric.eu/scientific-collaboration/colorectal-cancer-cohort/) [access request](https://www.bbmri-eric.eu/services/access-policies/). This is a companion to the [fair-crcc-send-data](https://github.com/crs4/fair-crcc-send-data/) workflow. ## What's the CRC Cohort? The CRC Cohort is a collection of clinical data and digital high-resolution digital pathology images pertaining to tumor cases. The collection has been assembled from a number of participating biobanks and other partners through the [ADOPT BBMRI-ERIC](https://www.bbmri-eric.eu/scientific-collaboration/adopt-bbmri-eric/) project. Researchers interested in using the data for science can file an application for access. If approved, the part of the dataset required for the planned and approved work can be copied to the requester's selected secure storage location (using this workflow). ## Usage The workflow needs **three main inputs**, which are all provided via the **configuration file**: 1. The source of the data; 2. The destination for the data; 3. The recipient's private encryption key, which will be used to decrypt the data as it's downloaded. Optionally, you can specify filters to only fetch part of the dataset. For source and destination, all the storage types supported by Snakemake can be used -- from locally mounted storage to remote S3, FTP, etc. ### Configuration file An example configuration is available under [config/example_config.yml](config/example_config.yml), while the full, documented schema is available in the file . ### Filters A list of filter rules can be specified to download only parts of the full collection. The filters follow a logic similar to rsync (for those who are familiar with that tool). * Filter rules specify an `action` (*include* or *exclude*) and a **glob expression** `pattern` for matching files. * The pattern is applied to the *original file names* (the ones with which the files would be saved in the destination storage). This is the name you'll find the in the index.tsv file without the .c4gh extension. * The filter patterns are tried in the order they are specified: * the first one that matches is applied; * if no filters match a file name, the file is included. #### Examples Download everything except "test-file_2.txt": ```yaml filters: - action: exclude pattern: "test-file_2.txt" ``` Only download files matching the specific pattern `sample_123*.tiff` ```yaml filters: - action: include pattern: "sample_123*.tiff" - action: exclude pattern: "*" ``` ### The index file In addition to the data files, the workflow will generate a file called `index.tsv`. This is an index to the dataset at the source location. Format: tab-separated fields. The fields are: 1. file name at the source (this name is a random UUID); 2. original encrypted file name; 3. checksum of the encrypted file. The checksum is used to validated the integrity of the downloaded file. Note that depending on the filters you apply, some or all of the files in the index may not be fetched. **Remember:** the filter patterns are applied to the index file names with the `.c4gh` extension removed. ## Authors If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and its DOI (TBD). * Luca Pireddu