# OcBSA **Repository Path**: Bioinformaticslab/OcBSA ## Basic Information - **Project Name**: OcBSA - **Description**: OcBSA specifically for QTL mapping in F1 populations. Developed by: zhanglk960127@163.com - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 13 - **Forks**: 1 - **Created**: 2023-11-05 - **Last Updated**: 2026-04-26 ## Categories & Tags **Categories**: Uncategorized **Tags**: QTL, BSA, F1, outcross ## README # OcBSA: an NGS-based Bulk Segregant Analysis Tool for Outcross Populations **Please cite:** "OcBSA: an NGS-based Bulk Segregant Analysis Tool for Outcross Populations, Molecular Plant, 2024. https://doi.org/10.1016/j.molp.2024.02.011" **Contact:** zhanglingkui@caas.cn --- ## Quick start — Windows Desktop App A graphical desktop application is available for Windows users who prefer not to use the command line. **[⬇ Download OcBSA.exe](http://www.bioinformaticslab.cn/files/OcBSA/)** No Python installation required. Download `OcBSA.exe` and double-click to launch — no setup needed. Features: - Point-and-click interface for OC-BSA, F2-BSA, and Visualization - Native file browser dialogs for selecting input/output files - Real-time output log showing analysis progress - One-click to open interactive HTML results in the browser - All dependencies bundled (numpy, scipy, matplotlib, plotly) > Note: the first launch may take 10–20 seconds as the bundled runtime initialises. Subsequent launches are faster. For advanced use, primer design, or scripting workflows, see the command-line tools below. --- ## Requirements ``` python >= 3.8 numpy scipy matplotlib primer3-py # only for primer_design.py blast # only for primer_design.py ``` Install dependencies: ```bash conda install numpy scipy matplotlib conda install primer3-py # optional ``` --- ## OC_BSA.py — QTL mapping for F1 outcross populations OC_BSA implements the **OcValue** statistic, designed for populations derived from crosses between two heterozygous parents (e.g., F1 segregation populations in outcrossing species such as potato, cassava, etc.). It identifies QTL regions by measuring allele-frequency differences between two bulk pools. ### Parameters | Parameter | Description | |-----------|-------------| | `-vcf` | Path to the **raw, unfiltered** VCF file. Do NOT pre-filter the VCF. | | `-table` | Path to a pre-formatted table file (see format below). Alternative to `-vcf`. | | `-OcValue` | Path to an `.OcValue` intermediate file from a previous run. Use this to re-run only the sliding-window step (e.g., to try a different window size) without recalculating per-marker values. | | `-p1` | Column number of the **dominant parent** in the VCF (1-based; negative counts from the end). | | `-p2` | Column number of the **other parent** in the VCF. | | `-b1` | Column number of the **pool with the dominant trait** in the VCF. | | `-b2` | Column number of the **pool with the recessive trait** in the VCF. | | `-d1` | Minimum sequencing depth for parents (default: **auto-calculated**). | | `-d2` | Minimum sequencing depth for pools (default: **auto-calculated**). | | `-d3` | Maximum sequencing depth for parents (default: **auto-calculated**). | | `-d4` | Maximum sequencing depth for pools (default: **auto-calculated**). | | `-w` | Sliding window size in bp (default: 1,000,000). | | `-p` | Percentile for background significance threshold (default: 99). | | `--min-markers` | Minimum number of informative markers required per chromosome (default: 200). | | `--save-table` | Save a `.table` file alongside the VCF (speeds up future re-runs). | | `--threads` | Number of parallel worker processes (default: number of CPU cores). | | `-o` | Output file name. | > **Tip — counting columns:** Open your VCF and count from left (starting at 1). The fixed VCF columns are: CHROM(1) POS(2) ID(3) REF(4) ALT(5) QUAL(6) FILTER(7) INFO(8) FORMAT(9). Your first sample is column 10. You can also count from the right using negative numbers: `-1` is the last column, `-2` is second-to-last, etc. > **Auto depth thresholds:** If any of `-d1`–`-d4` are omitted, the tool computes the depth distribution from the data (mean ± 3 standard deviations). This prevents the common mistake of using fixed defaults that filter out all markers in datasets with unusually high or low coverage. ### Sample assignment verification When running with a VCF file, the tool prints the resolved sample names before processing, for example: ``` --- Sample assignment (please verify) --- P1 dominant parent : Solanum_tuberosum_cv_A (VCF column -2) P2 other parent : Solanum_tuberosum_cv_B (VCF column -1) Pool1 dominant : Bulk_high (VCF column -4) Pool2 recessive : Bulk_low (VCF column -3) ----------------------------------------- ``` **If the sample names look wrong, stop and recheck your `-p1`, `-p2`, `-b1`, `-b2` column numbers before proceeding.** ### Running OC_BSA ```bash # IMPORTANT: The VCF must NOT be pre-filtered in any way! # Run with a VCF file — counting columns from the end (recommended) python OC_BSA.py -vcf potato_example.vcf -p1 -2 -p2 -1 -b1 -3 -b2 -4 \ -w 200000 -o potato_example_200k_OcBSA.txt # Run with a VCF file — counting columns from the start python OC_BSA.py -vcf potato_example.vcf -p1 12 -p2 13 -b1 11 -b2 10 \ -w 200000 -o potato_example_200k_OcBSA.txt # Run with a pre-formatted table file python OC_BSA.py -table potato_example.table -w 200000 \ -o potato_example_200k_OcBSA.txt # Re-run with a different window size (no recalculation of per-marker values) python OC_BSA.py -OcValue potato_example_200k_OcBSA.txt.OcValue \ -w 100000 -o potato_example_100k_OcBSA.txt # Specify depth thresholds manually python OC_BSA.py -vcf potato_example.vcf -p1 -2 -p2 -1 -b1 -3 -b2 -4 \ -d1 10 -d3 80 -d2 20 -d4 200 -w 200000 -o potato_example_200k_OcBSA.txt ``` ### Table file format The table file is a tab-separated file with no mandatory header. Columns: | chr | pos | ref | alt | P1 | P2 | Pool1 | Pool2 | |-----|-----|-----|-----|----|----|-------|-------| | chr1 | 12345 | A | T | 30\|2 | 0\|28 | 45\|20 | 18\|40 | Each allele depth cell is written as `ref_count|alt_count`. --- ## F2_BSA.py — BSA for biparental populations (F2, RIL, F3, etc.) F2_BSA provides three methods for biparental BSA analysis: | Mode | Description | Requires parents? | |------|-------------|-------------------| | `-snpindex` | ΔSNP-index: difference in allele frequency between pools | Yes | | `-ED` | Euclidean Distance (ED²): squared allele-frequency difference | Yes | | `-ED2` | ED² computed directly from pool allele frequencies, no parental genotype filtering | **No** | Use `-snpindex` or `-ED` when you have parental sequencing data. Use `-ED2` when you only have the two pool samples (no parents sequenced). ### Parameters | Parameter | Description | |-----------|-------------| | `-snpindex` / `-ED` / `-ED2` | Select the analysis method (required, mutually exclusive). | | `-vcf` | Path to the raw VCF file. | | `-table` | Path to a pre-formatted table file. Alternative to `-vcf`. | | `-infile` | Previously generated intermediate file (`.snpindex`, `.ED`, or `.ED2`). Use this to re-run only the window step. | | `-p1` | Column of parent1 in the VCF (required for `-snpindex` and `-ED`). | | `-p2` | Column of parent2 in the VCF (required for `-snpindex` and `-ED`). | | `-b1` | Column of pool1 in the VCF. | | `-b2` | Column of pool2 in the VCF. | | `-d1` | Minimum parent depth (ignored for `-ED2`; default: auto). | | `-d2` | Minimum pool depth (default: auto). | | `-d3` | Maximum parent depth (ignored for `-ED2`; default: auto). | | `-d4` | Maximum pool depth (default: auto). | | `-w` | Sliding window size in bp (default: 1,000,000). | | `-p` | Percentile for background significance threshold (default: 99). | | `--min-markers` | Minimum markers per chromosome (default: 100). | | `--save-table` | Save a `.table` file alongside the VCF. | | `--threads` | Number of parallel worker processes (default: cpu count). | | `-o` | Output file name. | ### Sample assignment verification Same as OC_BSA: when using a VCF file, the resolved sample names are printed before processing: ``` --- Sample assignment (please verify) --- P1 parent1 : Parent_susceptible (VCF column -4) P2 parent2 : Parent_resistant (VCF column -3) Pool1 : Bulk_susceptible (VCF column -2) Pool2 : Bulk_resistant (VCF column -1) ----------------------------------------- ``` ### Table file format **4-sample format** (for `-snpindex` / `-ED`): | chr | pos | ref | alt | P1 | P2 | Pool1 | Pool2 | |-----|-----|-----|-----|----|----|-------|-------| | Chr10 | 21462 | T | C | 2\|29 | 41\|0 | 53\|21 | 35\|32 | **Pool-only format** (for `-ED2`, 6 columns): | chr | pos | ref | alt | Pool1 | Pool2 | |-----|-----|-----|-----|-------|-------| | Chr10 | 21462 | T | C | 53\|21 | 35\|32 | The `-ED2` mode also accepts the 8-column format and ignores the parent columns automatically. ### Running F2_BSA ```bash # --- SNP-index method --- # With VCF python F2_BSA.py -snpindex -p1 -4 -p2 -3 -b1 -2 -b2 -1 \ -vcf example.vcf -o example_1M_snpindex.txt # With table python F2_BSA.py -snpindex -table example_file/F2_test.table \ -o example_1M_snpindex.txt # Re-run with different window size python F2_BSA.py -snpindex -infile example_1M_snpindex.txt.snpindex \ -w 500000 -o example_500k_snpindex.txt # --- ED method --- # With VCF python F2_BSA.py -ED -p1 -4 -p2 -3 -b1 -2 -b2 -1 \ -vcf example.vcf -o example_1M_ED.txt # With table python F2_BSA.py -ED -table example_file/F2_test.table \ -o example_1M_ED.txt # --- ED2 method (no parental data needed) --- # With VCF (only pool columns required) python F2_BSA.py -ED2 -b1 -2 -b2 -1 \ -vcf example.vcf -o example_1M_ED2.txt # With 6-column table (no parents) python F2_BSA.py -ED2 -table example_pools_only.table \ -o example_1M_ED2.txt # With 8-column table (parents present but ignored) python F2_BSA.py -ED2 -table example_file/F2_test.table \ -o example_1M_ED2.txt ``` --- ## bsa_fig.py — Visualization Draw Manhattan-style scatter plots from OC_BSA or F2_BSA output. Supports both static images (PNG/PDF) and interactive HTML figures. ### Requirements ``` matplotlib # for static PNG/PDF output (already required) plotly # only for interactive HTML output ``` Install plotly: ```bash pip install plotly # or conda install plotly ``` ### Parameters | Parameter | Description | |-----------|-------------| | `-f` | Input file (output of OC_BSA.py or F2_BSA.py). | | `-OcValue` | Plot OcValue results (use with OC_BSA output). | | `-snpindex` | Plot SNP-index / ED / ED2 results (use with F2_BSA output); also overlays Pool1 and Pool2 index lines. | | `-ED` | Plot ED / ED2 results without individual pool lines. | | `-p` | Zoom into a genomic region, e.g. `chr10,50000000,60000000` (static output only). | | `-c` | Color map for the dot plot (default: `plasma_r`). See [matplotlib colormap reference](https://matplotlib.org/stable/gallery/color/colormap_reference.html). | | `-o` | Output file: `.png` or `.pdf` for static image; `.html` for interactive figure. | | `--genome-ticks` | Add per-chromosome coordinate ticks (10 M, 20 M…) to the static whole-genome plot x-axis. | | `--tick-interval MB` | Tick spacing in Mb for `--genome-ticks` (default: 10). | | `--embed-js` | Embed Plotly JS (~3 MB) in the HTML for offline use. Default: load from CDN (requires internet). | ### Static output examples ```bash # Whole-genome plot of OcValue results python bsa_fig.py -f potato_example_200k_OcBSA.txt -OcValue -o whole_genome.png # Whole-genome plot with per-chromosome coordinate ticks every 20 Mb python bsa_fig.py -f potato_example_200k_OcBSA.txt -OcValue \ --genome-ticks --tick-interval 20 -o whole_genome_ticks.png # Zoom into chromosome 10, 50–60 Mb python bsa_fig.py -f potato_example_200k_OcBSA.txt -OcValue \ -p chr10,50000000,60000000 -o chr10_zoom.png # Plot SNP-index result python bsa_fig.py -f example_1M_snpindex.txt -snpindex -o snpindex.png # Plot ED2 result with a different color map python bsa_fig.py -f example_1M_ED2.txt -snpindex -c viridis -o ed2.png # Zoom into a region of the SNP-index result python bsa_fig.py -f example_1M_snpindex.txt -snpindex \ -p Chr01,1000000,16000000 -o snpindex_chr01_zoom.png ``` ### Interactive HTML output Specify `-o` with a `.html` extension to generate an interactive figure powered by [Plotly](https://plotly.com/). The HTML file can be opened directly in any modern browser — no server or Python required. **Features:** - Zoom and pan freely across the whole genome - Chromosome dropdown to jump to a single chromosome instantly - Hover over any point to see exact position, OcValue / index, background threshold, and marker count - Load a GFF3 or BED annotation file directly in the browser to overlay gene models on the plot ```bash # Interactive HTML for OcValue results (CDN Plotly, requires internet) python bsa_fig.py -f potato_example_200k_OcBSA.txt -OcValue -o whole_genome.html # Self-contained HTML for offline use (embeds ~3 MB of Plotly JS) python bsa_fig.py -f potato_example_200k_OcBSA.txt -OcValue \ --embed-js -o whole_genome_offline.html # Interactive HTML for SNP-index results python bsa_fig.py -f example_1M_snpindex.txt -snpindex -o snpindex.html ``` ### Gene annotation overlay (HTML only) After opening the HTML in a browser, use the **Gene Annotations** panel in the top-right corner to load a GFF3 or BED file from your local disk. Genes are parsed entirely in the browser — no data is uploaded anywhere. **How it works:** 1. Click **Load GFF3 or BED file** and select your annotation file. 2. Select the feature type to display (gene / mRNA / CDS / all) — GFF3 only. 3. Zoom into a region ≤ the configured threshold (default 15 Mb) and gene bars will appear automatically below the BSA data track. 4. Gene bars are colored by strand: blue (+), orange (−), grey (unknown). 5. Hover over a gene bar to see its ID, coordinates, and strand. 6. Click **Clear** to remove the gene overlay. **Performance notes:** Genes are rendered lazily based on the current viewport — the full annotation file is never drawn at once. Rendering is triggered automatically on every zoom or pan. | Setting | Default | Description | |---------|---------|-------------| | Show genes when view ≤ X Mb | 15 Mb | Genes are hidden when the visible range exceeds this threshold. Increase it for smaller genomes or decrease it for better responsiveness. | | Max genes rendered | 2000 | Hard cap on the number of gene bars drawn in any single view. Zoom in further if the cap is reached. | | Label threshold | 200 | Gene name labels are suppressed when more than 200 genes are visible; zoom in to see individual names. | > **Chromosome name matching:** The tool tolerates `chr`/`Chr` prefix differences between the VCF and the annotation file (e.g., `Chr10` in the plot matches `chr10` or `10` in the GFF3). --- ## primer_design.py — Primer design for candidate markers Designs PCR primers for InDel markers in the candidate QTL region. > **Requirements:** BLAST and the `primer3-py` Python package must be installed. > ```bash > conda install primer3-py > ``` ### Parameters | Parameter | Description | |-----------|-------------| | `-g` | Path to the reference genome FASTA file. | | `-OcValue` | Path to the `.OcValue` file (output of OC_BSA.py). | | `-i` | Target genomic region, e.g. `chr11,0,10000`. | | `-f` | Output folder. | | `-o` | Output file name (default: `output.primer.extracted`). | | `-n` | Number of candidate primer pairs to pick (default: 10). | | `-k` | Flanking length around each InDel (default: 200). | | `-s` | Shortest acceptable primer length (default: 18). | | `-O` | Optimal primer length (default: 20). | | `-l` | Longest acceptable primer length (default: 24). | | `-S` | Shortest acceptable PCR product (default: 70). | | `-L` | Longest acceptable PCR product (default: 200). | | `-m` | Minimum primer Tm in °C (default: 50). | | `-x` | Maximum primer Tm in °C (default: 65). | | `-M` | Minimum primer GC% (default: 35). | | `-X` | Maximum primer GC% (default: 65). | | `-D` | Maximum acceptable Tm difference within a primer pair (default: 0.5). | ### Example ```bash python primer_design.py -g reference_genome.fa \ -OcValue potato_example_200k_OcBSA.txt.OcValue \ -i chr10,56000000,57000000 \ -f ./primer_output/ ``` --- ## cluster_vcf.py — Build pools from individual sequencing data If you have individual-sample VCF files and want to merge them into pool-level allele depth tables (simulating pooled sequencing), use this tool. ```bash python cluster_vcf.py -c individual_sequencing.vcf \ -f pool_config.txt -o out_table.txt ``` ### pool_config.txt format Define which individuals belong to each pool. `B1` = Pool1, `B2` = Pool2, `P1` = Parent1, `P2` = Parent2. ``` B1:sampleID1,sampleID2,sampleID3,sampleID4,sampleID5 B2:sampleID6,sampleID7,sampleID8,sampleID9,sampleID10 P1:sampleID_P1 P2:sampleID_P2 ``` --- ## Windows desktop app — details **Download:** [http://www.bioinformaticslab.cn/files/OcBSA/](http://www.bioinformaticslab.cn/files/OcBSA/) `OcBSA.exe` is a self-contained executable for Windows 10 / 11. All Python dependencies (numpy, scipy, matplotlib, plotly) are bundled — no Python or conda installation is needed. > **First launch:** Windows may show a SmartScreen warning for unsigned executables. Click "More info → Run anyway" to proceed. To use the command-line scripts directly, install the required packages as described in the [Requirements](#requirements) section above.