# Multimodal-cfDNA-AI

**Repository Path**: ByteDance/Multimodal-cfDNA-AI

## Basic Information

- **Project Name**: Multimodal-cfDNA-AI
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-01-27
- **Last Updated**: 2026-01-28

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Multimodal-cfDNA-cfAI

**Author:** Song Liyang — [SonglyPKU@163.com](mailto:songlyPKU@163.com) **License:** Apache-2.0

---

## Overview

**cfAI** is a Transformer-based framework that represents each cfDNA (cell-free DNA) fragment as tokenized, multimodal vectors and uses a MultiTask Transformer to score ctDNA (circulating tumor DNA) likelihoods at molecule, gene, and sample levels.

The model was developed to: 1) profile multi‑omic cross‑talk at single‑cfDNA‑molecule resolution, 2) increase signal‑to‑noise ratio (SNR) by ctDNA enrichment, 3) enhance early cancer screening and liquid biopsies. cfAI achieved ~10‑fold enrichment of cancer‑derived signals over background noise and reached strong multi‑cancer discrimination performance.

Key highlights:

- **Single-molecule multi-omics & vectorization:** each cfDNA fragment is tokenized and vectorized across methylation, fragmentomics, end-motifs, histone-mark proxies, gene semantics, and 3D features.
- **Genome2Vec annotation:** prebuilt vectorized genomic and epigenomic annotations provide semantic context attached to reads.
- **Transformer-based modeling:** a MultiTask Transformer integrates cross-modal signals to produce molecule-, gene-, and sample-level scores.

This repository provides the code for processing NGS reads, performing cfDNA annotation, and running model pretraining and inference. It also includes the prebuilt vector annotation database and the trained model.

---

## Repository layout (key files)

```
Multimodal-cfDNA-cfAI/
├─ LICENSE
├─ README.md
├─ paper/
├─ src/
│  ├─ bam2vec.py
│  ├─ reads_anno.py
│  ├─ data_preparing.py
│  └─ ai_dev.py
├─ embeds/
├─ configs/
├─ models/
└─ examples/
   ├─ sample_info.tsv
   └─ example.bam
```

---

## 0. Prepare environment

cfAI runs on `Python 3.9`. Install dependencies including `pysam`, `numpy`, `pandas`, `bedtools`, `pybedtools`, `torch`, `umap-learn`, `scikit-learn`, etc. Versions follow the provided YAML file.  
`torch` and `pytorch-cuda` are required; choose versions based on your GPU.  
This code is compatible with multi‑GPU.

## 1. Convert cfDNA NGS reads to multi‑omics reads file

This step converts paired-end BAM files into a structured BED6+ file that contains per-read multi-omic features used downstream by Genome2Vec and the model pipeline.

**Convert BAM -> BED6+**

```bash
python src/bam2vec.py -i data/sample.bam -r data/hg38.fa -o output/sample_bed/ --motif_len 4 --min_mapq 30 --unmeth_clip 34
```

`bam2vec.py` extracts: genomic coordinates, insert size, strand, base one-hot (A/T/C/G), mismatches/indels (E), CpG methylation (M/U from XM tag), and 5'/3' end motifs. Output is a tabular BED-like file with the following 15 columns: `chr, start, end, read_name, insert_size, strand, A, T, C, G, E, M, U, motif_up, motif_down`.

Requirements:

- Coordinate-sorted and indexed BAM
- Bisulfite XM tags compatible with Bismark when methylation features are expected

---

## 2. Annotate reads with Genome2Vec embeddings

`reads_anno.py` maps reads to the Genome2Vec annotation database stored in `embeds/` and appends contextual embeddings/values to each read. Genome2Vec is a collection of vectorized genomic and epigenomic annotations designed to provide compact semantic context for each cfDNA fragment.

**Genome2Vec contents (files stored in `embeds/`)**

| Feature                               | Filename                       | Resolution / Notes | Description                                                                                  |
| ------------------------------------- | ------------------------------ | ------------------ | -------------------------------------------------------------------------------------------- |
| Gene name & coordinates               | `gene_name.bed`                | gene TSS / loci    | Nearest gene name, strand and distance to TSS. Optionally attach scGPT 512-d gene embedding. |
| Chromatin state (UMAP)                | `chromHMM_200bp_UMAPembed.bed` | 200 bp; 4-d UMAP   | chromHMM emission matrix reduced via UMAP → 4-dim embedding per state.                       |
| Insulation score (INS)                | `40k_is.sort.bed`              | 40 kb              | Local insulation metric.                                                                     |
| Directionality index (DI)             | `40k_di.sort.bed`              | 40 kb              | DI for TAD/interaction directionality.                                                       |
| FIRE (frequently interacting regions) | `40k_fire.sort.bed`            | 40 kb              | FIRE score per bin.                                                                          |
| A/B compartment                       | `250k_hesc_ab.sort.bed`        | 250 kb             | AB compartment call and score (hESC).                                                        |
| Hi-C 3D coordinates                   | `20k_hic.sort.bed`             | 20 kb / diploid    | Mat/pat 3D coordinates from Hi-C/Dip-C processed files.                                      |

**Usage**


Appends annotation fields such as `near_gene_name, near_gene_strand, dist_TSS, chromHMM_name, chromHMM_UMAPemb_1..4, is_value, di_value, fi_value, ab_value, hic_mat{x,y,z}, hic_fat{x,y,z}` and, if enabled, scGPT gene embedding columns (e.g. `scGPT_emb_1..512`).

Run:

```bash
python src/reads_anno.py -i output/sample_bed/sample.bed -r embeds -o output/anno/
```


Then, annotated output of multiple samples can be processed with `data_preparing.py` for batch data handling, including filtering, feature calculation, standardization, and metadata integration.

Run:

```bash
python src/data_preparing.py -s examples/sample_info.tsv -i output/anno/ -o data/prep/
```

`data_preparing.py` performs cfDNA filtering by TSS-range (default `dist_TSS ∈ [-8192,8192]`), methylation ratio calculation, and feature standardization.


---

## 3. Model training & inference (`ai_dev.py`)

`ai_dev.py` contains the full training, resume and test logic. Configuration (paths, hyperparameters such as `d_model`, `nhead`, `num_layers`, `seq_length`, `batch_size`, `lr`, `dna_length`, `proj_dims`, etc.) is defined at the top of `ai_dev.py` in the `Config` class; modify those fields directly in the script prior to large runs.

Commands:

```bash
# train
python src/ai_dev.py train

# resume (provide --ckpt)
python src/ai_dev.py resume --ckpt models/ckpt_10000.pt

# test / inference
python src/ai_dev.py test --ckpt models/best.pt
```

Outputs and logs are controlled by paths set in `ai_dev.py::Config` (e.g. `checkpoint_dir`, `log_dir`). Test mode produces `test_predictions.csv` (location controlled by `Config`) with per-batch predictions and columns including: `sample_id, gene, prdicted_origin, origin_confidence, health, reads_scores, last layer <cls> embeds`.

The `reads_scores` shows the tumor-derived prediction, while the `prdicted_origin` contains the distinctions of tissue of origin of the cancer. The `last layer <cls> embeds` contains the hidden representation of this batch of cfDNA at corresponding gene.

---

## Credits and citation

cfAI was written by Song Liyang in ByteDance. Please follow [https://www.linkedin.com/in/liyang-song/](https://www.linkedin.com/in/liyang-song/).

Please cite the work:

> Song Liyang et al., *Multimodal AI for Single cfDNA Profiling and Cancer Screening* (manuscript).

---

## Contributing & contact

Please open issues or pull requests for bugs or feature requests. For data requests and direct correspondence contact: [SonglyPKU@163.com](mailto:songlypku@163.com).

---

## License

Apache License 2.0