# STRIDE

**Repository Path**: matrix_evolution/STRIDE

## Basic Information

- **Project Name**: STRIDE
- **Description**: STRIDE is a sequencing depth-Insensitive metric for robust comparison between sparse chromosome conformation capture data
- **Primary Language**: Python
- **License**: GPL-3.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-03-19
- **Last Updated**: 2026-03-19

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# STRIDE

## Introduction

STRIDE is a robust dissimilarity measurement for chromatin conformation capture data based on sequencing depth insensitive representation. 

STRIDE can be applied to any technique that primarily outputs a contact frequency matrix, including but not limited to Hi-C and its variants, as well as HiChIP/PLAC-seq.  The performance of STRIDE relies on introducing the concept of mean first passage time (MFPT) from Markov processes into Hi-C data analysis. This transforms the contact matrix into an MFPT representation that is less sensitive to sequencing depth and experimental noise, while preserving biologically meaningful features. 

![logo](https://gitee.com/matrix_evolution/STRIDE/raw/master/images/logo.jpg)

As shown in the figure above, as sequencing depth decreases, a large number of contact frequencies in the contact map decay to 0, resulting in missing distance information between loci. Additionally, noise levels caused by stochastic fluctuations gradually increase. Both of these issues are significantly mitigated in the MFPT representation.

## Installation

### System Requirements

The STRIDE package requires only a standard computer running a typica Linux distribution with enough RAM to support the in-memory operations. A GPU device with CUDA support is required if any GPU accelerated computation is needed. 

### Dependencies

The STRIDE software package is developed and tested on Python version 3.11.7 and depends on the following packages:

```
numpy>=1.23.4
scipy>=1.12.0
pandas>=2.2.1
joblib>=1.3.2
h5py>=3.10.0
torch>=2.10.0
```

In addition, the following optional dependencies are required for certain functions:

*   The package hic-straw (>=1.3.1) is required for the support of .hic format input files. 
*   The pytorch package (>=2.2.1) is required if any accelerated computation is needed. If the package is not installed, the MFPT will be approximately calculated. The computation time will increase, but memory usage will also be reduced.

### Install

The STRIDE package can be installed either through this git repository or through pip. 

* Through this git repository:

  ```shell
  conda create -n stride python=3.11
  conda activate stride
  git clone https://gitee.com/matrix_evolution/STRIDE.git
  cd STRIDE
  python setup.py install
  ```

* Through pip:

  ```shell
  pip install hicstride
  ```

*   Under normal network conditions, installing the STRIDE software package takes no more than five minutes on a typical desktop workstation.

## Usage

There are three major subprograms in the STRIDE package:

* **mfpt**: calculate the MFPT representation of a contact map. 
* **stride**: calculate the STRIDE distance between two contact maps. 
* **batch**: calculate the pair-wise STRIDE distances for a batch of contact maps. 
* **dcc**: performing the differential chromatin conformation (DCC) analysis (testing). 

### Common command line arguments

| Short form | Long form      | Meaning                                                      | Default |
| ---------- | -------------- | ------------------------------------------------------------ | ------- |
| -d         | --device       | The device on which the calculation will be executed if pytorch is available, or it will be omitted. | cpu     |
| -c         | --min-coverage | Proportion of bins with low coverage to be filtered in the KR normalization. | 0.02    |
| -k         | --KR-tolerance | The precision of the KR normalization. When standard deviations of row sums go below it, the normalization will be stopped. | 1e-12   |
| -t         | --file-type    | The format of the input file(s) ({hic, txt}).                | hic     |
|            | --ch           | The chromosomes which should be processed when the format of the input file(s) are .hic. Multiple chromosomes can be provieded through a comma seperated list. It will be omitted if the input format is txt. |         |
| -l         | --chr-length   | The length of the chromosomes. The value will be taken if the input format is txt, or it will be omitted. | 0       |
| -r         | --resolution   | The resolution used when the format of the input file(s) are .hic. It will be omitted if the input format is txt. | 50000   |
| -o         | --output-dir   | The output directory. It will be created automatically if not exist. | .       |
| -n         | -name          | The name of the project. It will be used as the names of the output files. | STRIDE  |

### Sub command specific arguments

#### mfpt

| Name  | Meaning                              | Default         |
| ----- | ------------------------------------ | --------------- |
| input | The path to the input file / folder. | None (required) |

#### stride

| Name   | Meaning                                                      | Default         |
| ------ | ------------------------------------------------------------ | --------------- |
| --norm | The matrix norm used in the calculation. It must be supported by scipy.linalg.norm. | 2               |
| input1 | The path to the first input file.                            | None (required) |
| input2 | The path to the second input file.                           | None (required) |

#### batch

| Name             | Meaning                                                 | Default         |
| ---------------- | ------------------------------------------------------- | --------------- |
| -p / --n-threads | The number of threads used in loading the contact maps. | 1               |
| input            | The path to the input file / folder.                    | None (required) |

#### dcc

| Name           | Meaning                                                      | Default |
| -------------- | ------------------------------------------------------------ | ------- |
| --mfpt-file    | The path to the hdf5 file storing the MFPT transform temporarily. | /       |
| --input-hic-1  | The input .hic file of the first condition.                  | /       |
| --condition-1  | The name of the first condition.                             | /       |
| --input-hic-2  | The input .hic file of the second condition.                 | /       |
| --condition-2  | The name of the second condition.                            | /       |
| --loop-p-value | The p-value that determines whether a bin pair is considered to have significant signal enrichment relative to its neighborhood under a certain condition. | 0.01    |
| --diff-p-value | The p-value that determines whether the difference in signal between two conditions is significant at a given bin pair. | 0.01    |
| --min-dist     | The lower bound of the genomic search distance, in units of bins. | 7       |
| --max-dist     | The upper bound of the genomic search distance, in units of bins. | 500     |
| --batch-size   | Batch size for processing. It has no impact on the final output, but lowering it can speed up execution and decrease the memory footprint. | 3000    |


### Detailed command line usage

#### mfpt

```shell
usage: stride mfpt [-h] [-d DEVICE] [-c MIN_COVERAGE] [-k KR_TOLERANCE] [-t {hic,txt}] [--ch CH] [-l CHRLEN] [-r RESOLUTION] [-o OUTPUT_DIR] [-n NAME] input

Calculate the MFPT representation for a contact map.

positional arguments:
  input                 The path to the input file.

options:
  -h, --help            show this help message and exit
  -d DEVICE, --device DEVICE
                        The device on which the calculation will be executed if pytorch is available, or it will be omitted.
  -c MIN_COVERAGE, --min-coverage MIN_COVERAGE
                        Proportion of bins with low coverage to be filtered in the KR normalization.
  -k KR_TOLERANCE, --KR-tolerance KR_TOLERANCE
                        The precision of the KR normalization. When standard deviations of row sums go below it, the normalization will be stopped.
  -t {hic,txt}, --file-type {hic,txt}
                        The format of the input file(s).
  --ch CH               The chromosomes which should be processed when the format of the input file(s) are .hic. Multiple chromosomes can be provieded through a comma seperated list. It will be
                        omitted if the input format is txt.
  -l CHRLEN, --chr-length CHRLEN
                        The length of the chromosomes. The value will be taken if the input format is txt, or it will be omitted.
  -r RESOLUTION, --resolution RESOLUTION
                        The resolution used when the format of the input file(s) are .hic. It will be omitted if the input format is txt.
  -o OUTPUT_DIR, --output-dir OUTPUT_DIR
                        The output directory. It will be created automatically if not exist.
  -n NAME, --name NAME  The name of the project. It will be used as the names of the output files.
```

#### stride

```Shell
usage: stride stride [-h] [-d DEVICE] [-c MIN_COVERAGE] [-k KR_TOLERANCE] [-t {hic,txt}] [--ch CH] [-l CHRLEN] [-r RESOLUTION] [-o OUTPUT_DIR] [-n NAME] [--norm NORM] input1 input2

Calculate the STRIDE distances for two given contact maps.

positional arguments:
  input1                The path to the first input file.
  input2                The path to the second input file.

options:
  -h, --help            show this help message and exit
  -d DEVICE, --device DEVICE
                        The device on which the calculation will be executed if pytorch is available, or it will be omitted.
  -c MIN_COVERAGE, --min-coverage MIN_COVERAGE
                        Proportion of bins with low coverage to be filtered in the KR normalization.
  -k KR_TOLERANCE, --KR-tolerance KR_TOLERANCE
                        The precision of the KR normalization. When standard deviations of row sums go below it, the normalization will be stopped.
  -t {hic,txt}, --file-type {hic,txt}
                        The format of the input file(s).
  --ch CH               The chromosomes which should be processed when the format of the input file(s) are .hic. Multiple chromosomes can be provieded through a comma seperated list. It will be
                        omitted if the input format is txt.
  -l CHRLEN, --chr-length CHRLEN
                        The length of the chromosomes. The value will be taken if the input format is txt, or it will be omitted.
  -r RESOLUTION, --resolution RESOLUTION
                        The resolution used when the format of the input file(s) are .hic. It will be omitted if the input format is txt.
  -o OUTPUT_DIR, --output-dir OUTPUT_DIR
                        The output directory. It will be created automatically if not exist.
  -n NAME, --name NAME  The name of the project. It will be used as the names of the output files.
  --norm NORM           The matrix norm used in the calculation. It must be supported by scipy.linalg.norm.
```

#### batch

```shell
usage: stride batch [-h] [-d DEVICE] [-c MIN_COVERAGE] [-k KR_TOLERANCE] [-t {hic,txt}] [--ch CH] [-l CHRLEN] [-r RESOLUTION] [-o OUTPUT_DIR] [-n NAME] [-p THREADS] input

Calculate the pairwise STRIDE distances for a batch of contact maps.

positional arguments:
  input                 The path to the input folder.

options:
  -h, --help            show this help message and exit
  -d DEVICE, --device DEVICE
                        The device on which the calculation will be executed if pytorch is available, or it will be omitted.
  -c MIN_COVERAGE, --min-coverage MIN_COVERAGE
                        Proportion of bins with low coverage to be filtered in the KR normalization.
  -k KR_TOLERANCE, --KR-tolerance KR_TOLERANCE
                        The precision of the KR normalization. When standard deviations of row sums go below it, the normalization will be stopped.
  -t {hic,txt}, --file-type {hic,txt}
                        The format of the input file(s).
  --ch CH               The chromosomes which should be processed when the format of the input file(s) are .hic. Multiple chromosomes can be provieded through a comma seperated list. It will be
                        omitted if the input format is txt.
  -l CHRLEN, --chr-length CHRLEN
                        The length of the chromosomes. The value will be taken if the input format is txt, or it will be omitted.
  -r RESOLUTION, --resolution RESOLUTION
                        The resolution used when the format of the input file(s) are .hic. It will be omitted if the input format is txt.
  -o OUTPUT_DIR, --output-dir OUTPUT_DIR
                        The output directory. It will be created automatically if not exist.
  -n NAME, --name NAME  The name of the project. It will be used as the names of the output files.
  -p THREADS, --n-threads THREADS
                        The number of threads used in loading the contact maps.
```

#### dcc

```shell
usage: stride dcc [-h] [-d DEVICE] [-c MIN_COVERAGE] [-k KR_TOLERANCE] [--ch CH] [-r RESOLUTION] [-o OUTPUT_DIR]
                  [-n NAME] --mfpt-file HDFFILE --input-hic-1 INPUTFILE1 --condition-1 COND1 --input-hic-2 INPUTFILE2
                  --condition-2 COND2 [--loop-p-value LOOP_PVALUE] [--diff-p-value DIFF_PVALUE] [--min-dist MIN_DIST]
                  [--max-dist MAX_DIST] [--batch-size BATCH_SIZE]

Performing the differential chromatin conformation (DCC) analysis based on MFPT from two Hi-C libraries (testing).

options:
  -h, --help            show this help message and exit
  -d DEVICE, --device DEVICE
                        The device on which the calculation will be executed if pytorch is available, or it will be
                        omitted.
  -c MIN_COVERAGE, --min-coverage MIN_COVERAGE
                        Proportion of bins with low coverage to be filtered in the KR normalization.
  -k KR_TOLERANCE, --KR-tolerance KR_TOLERANCE
                        The precision of the KR normalization. When standard deviations of row sums go below it, the
                        normalization will be stopped.
  --ch CH               The chromosomes which should be processed when the format of the input file(s) are .hic.
                        Multiple chromosomes can be provieded through a comma seperated list. It will be omitted if
                        the input format is txt.
  -r RESOLUTION, --resolution RESOLUTION
                        The resolution used when the format of the input file(s) are .hic. It will be omitted if the
                        input format is txt.
  -o OUTPUT_DIR, --output-dir OUTPUT_DIR
                        The output directory. It will be created automatically if not exist.
  -n NAME, --name NAME  The name of the project. It will be used as the names of the output files.
  --mfpt-file HDFFILE   The path to the hdf5 file storing the MFPT transform temporarily.
  --input-hic-1 INPUTFILE1
                        The input .hic file of the first condition.
  --condition-1 COND1   The name of the first condition.
  --input-hic-2 INPUTFILE2
                        The input .hic file of the second condition.
  --condition-2 COND2   The name of the second condition.
  --loop-p-value LOOP_PVALUE
                        The p-value that determines whether a bin pair is considered to have significant signal
                        enrichment relative to its neighborhood under a certain condition.
  --diff-p-value DIFF_PVALUE
                        The p-value that determines whether the difference in signal between two conditions is
                        significant at a given bin pair.
  --min-dist MIN_DIST   The lower bound of the genomic search distance, in units of bins.
  --max-dist MAX_DIST   The upper bound of the genomic search distance, in units of bins.
  --batch-size BATCH_SIZE
                        Batch size for processing. It has no impact on the final output, but lowering it can speed up
                        execution and decrease the memory footprint.
```

### Input file formats

There are tow input file formats supported in stride.

* The .hic format generated from juicer tools which contains contact maps of multiple chromosomes and resolutions. 
* The three column text (txt) file which only records the bin pairs with positive number of contacts. The start position of the first bin, second bin and the number of contacts between them should be placed in each row, respectively, separated by any kind of white space. Position of the first bin is supposed to be smaller than that of the second bin. 
* The batch command only accepts input in "txt" format. All contact maps to be processed must be placed in the input folder in individual files with a .txt extension. The remaining part of the filenames will be used as the name for the corresponding contact map.

### Output format

* The output of the mfpt command is an HDF5 file named according to the value of the --name argument, stored in the specified output folder. If the input is in "txt" format, the output is saved under an object named "STRIDE" within the file. If the input is in ".hic" format, the outputs are sequentially saved under objects named after the chromosome names within the file.
* The output of the stride command is placed in the output folder as the *_score.txt file. 
* The output of the batch command is saved in the output folder as a dense distance matrix in a file named *_batch.txt. The row and column names correspond to the names of the respective contact maps. Contact maps that fail during MFPT representation calculation are automatically excluded.

### Example

There are two .hic files in the "test" directory, derived from the two HindIII-digested hESC libraries in dataset GSE35156. We extracted only the segments with a resolution of 50 Kb. The stride command can be run on this test set using the following command line. 

```
stride stride -o demo -n STRIDE --ch chr1 -t hic --norm 2 -r 50000 GSM862723.hic GSM892306.hic
```

A "demo" directory will be created automatically and the STRIDE score of chromosome 1 calculated with all arguments set as defaults will be stored in the file "STRIDE_score.txt". Multiple chromosomes can be calculated at a time by providing a comma separated name list to the "--ch" argument. Other subcommands work in a similar way. 

In a workstation equipped with dual Intel Xeon Gold 6230 processers (40 cores) and 256GiB RAM, such a calculation in all euchromosomes will be accomplished in about 5 minutes without using any GPU acceleration. 

## Notation

In our publication, we used a specific argument combination to calculate the stride distances of the single-cell Hi-C data in batch mode. The "-c" parameter was set to 0 to disable the filter of bins with low coverage. The "-k" parameter was set to 1e5 to accelerate the calculation of matrix balancing. The resolution ("-r") was set to 500Kb. 

## citation

[TBC]