# pygapit **Repository Path**: dawngogo/pygapit ## Basic Information - **Project Name**: pygapit - **Description**: https://github.com/Lalitgis/pygapit - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-04-30 - **Last Updated**: 2026-04-30 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # pyGAPIT — Genome Association and Prediction Integrated Tool (Python) A complete Python reimplementation of the R [GAPIT](https://github.com/jiabowang/GAPIT) package by Jiabo Wang & Zhiwu Zhang. Supports **all GWAS models** (GLM, MLM, CMLM, MLMM, FarmCPU, BLINK) and **genomic selection** methods (gBLUP, cBLUP, sBLUP) with the same interface as R GAPIT. --- ## Installation ```bash pip install -e . # from source (this repo) ``` **Dependencies** are automatically installed: `numpy`, `scipy`, `pandas`, `matplotlib`, `seaborn`, `plotly`, `scikit-learn`, `joblib`, `biopython`, `jinja2`. --- ## Quick start ```python import pandas as pd from pygapit import GAPIT # Load data (same format as R GAPIT) Y = pd.read_csv("mdp_traits.txt", sep="\t") # phenotype GD = pd.read_csv("mdp_numeric.txt", sep="\t") # numeric genotype GM = pd.read_csv("mdp_SNP_information.txt", sep="\t") # SNP map # Run GWAS (BLINK = default, highest power) result = GAPIT(Y=Y, GD=GD, GM=GM, model="BLINK", PCA_total=3) print(result.GWAS.head()) # full GWAS results table print(f"h² = {result.h2:.3f}") # heritability print(f"λ = {result.lambda_gc:.3f}") # genomic inflation factor print(f"QTNs = {len(result.QTNs)}") # multi-locus hits ``` **Equivalent R code:** ```r myGAPIT <- GAPIT(Y=myY, GD=myGD, GM=myGM, model="Blink", PCA.total=3) ``` --- ## Input data formats pyGAPIT accepts the same file formats as R GAPIT: ### Phenotype file (`Y`) Tab-delimited. First column = Taxa names, remaining columns = trait values. ``` Taxa EarHT dpoll 33-16 64.75 64.5 38-11 69.12 61.0 4226 65.5 59.5 ``` ### Numeric genotype (`GD`) + map (`GM`) `GD`: First column = taxa names, remaining = SNP dosages (0/1/2). ``` taxa PZB00859.1 PZA01271.1 ... 33-16 2 0 ... 38-11 2 2 ... ``` `GM`: Three columns: SNP name, Chromosome, Position (bp). ``` SNP Chromosome Position PZB00859.1 1 157104 PZA01271.1 1 1947984 ``` ### HapMap genotype (`G`) Standard HapMap format with IUPAC allele codes. ```python result = GAPIT(Y=Y, G=hapmap_df, model="BLINK") ``` --- ## GWAS models | Model | Method type | Uses kinship | Multi-QTN | Power | Speed | |----------|-------------|-------------|-----------|---------|----------| | `GLM` | Single-locus | No (PCs) | No | Low | Fastest | | `MLM` | Single-locus | Yes (global) | No | Medium | Fast | | `CMLM` | Single-locus | Compressed | No | Medium+ | Fast | | `MLMM` | Multi-locus | Yes (global) | Yes | High | Moderate | | `FarmCPU`| Multi-locus | Pseudo-QTN | Yes | High | Moderate | | `BLINK` | Multi-locus | No | Yes | Highest | Fast | ```python # Run multiple models simultaneously result = GAPIT(Y=Y, GD=GD, GM=GM, model=["GLM", "MLM", "FarmCPU", "BLINK"]) # Returns a dict keyed by "EarHT_GLM", "EarHT_MLM", etc. ``` --- ## Genomic selection ```python # gBLUP — best for polygenic traits result = GAPIT(Y=Y, GD=GD, GM=GM, model="gBLUP") # sBLUP — best for oligogenic traits (uses GWAS-identified QTNs) result = GAPIT(Y=Y, GD=GD, GM=GM, model="BLINK", buspred=True) # Access prediction results print(result.Pred) # Taxa BLUE BLUP PEV gBreedingValue Prediction # 0 33-16 67.4 -2.65 89.3 -2.65 64.75 ``` --- ## Output files When `file_output=True` (default), pyGAPIT writes to `output_dir`: | File | Content | |------|---------| | `GAPIT.BLINK.EarHT.GWAS.Results.csv` | Full GWAS table: SNP, Chr, Pos, P.value, maf, effect, FDR | | `GAPIT.BLINK.EarHT.Prediction.csv` | BLUE, BLUP, PEV, GEBV per individual | | `GAPIT.Kinship.csv` | VanRaden kinship matrix | | `GAPIT.PCA.csv` | PC scores per individual | | `GAPIT.BLINK.EarHT.Manhattan.pdf` | Manhattan plot | | `GAPIT.BLINK.EarHT.QQ.pdf` | QQ plot with λ annotation | | `GAPIT.Kinship.pdf` | Kinship heatmap | | `GAPIT.PCA.pdf` | 2D PCA scatter | --- ## Parameter reference All R GAPIT parameters are supported with underscores replacing dots: | R parameter | Python parameter | Default | Description | |-------------|-----------------|---------|-------------| | `model` | `model` | `"BLINK"` | Model(s) to run | | `PCA.total` | `PCA_total` | `3` | Number of PCs as covariates | | `maf.threshold` | `maf_threshold` | `0.05` | Minimum MAF filter | | `SNP.impute` | `SNP_impute` | `"middle"` | Missing genotype imputation | | `file.output` | `file_output` | `True` | Write result files | | `cutOff` | `cutOff` | Bonferroni | Significance threshold | | `LD` | `LD` | `0.7` | LD threshold for BLINK pruning | | `group.from` | `group_from` | `1` | Min groups for CMLM | | `group.to` | `group_to` | n | Max groups for CMLM | | `bin.size` | `bin_size` | `5000000` | Bin size (bp) for FarmCPU | | `h2` | `h2` | `None` | Heritability for simulation | | `NQTN` | `NQTN` | `None` | QTNs for simulation | | `buspred` | `buspred` | `False` | Run GS after GWAS | --- ## Command-line interface ```bash # Basic GWAS pygapit --Y traits.txt --GD geno.txt --GM map.txt --model BLINK # Multiple models, custom output directory pygapit --Y traits.txt --GD geno.txt --GM map.txt \ --model GLM MLM BLINK FarmCPU \ --PCA_total 5 --output_dir results/ # Genomic prediction pygapit --Y traits.txt --GD geno.txt --GM map.txt --model gBLUP # Phenotype simulation pygapit --Y traits.txt --GD geno.txt --GM map.txt \ --model BLINK --h2 0.7 --NQTN 20 ``` --- ## Using individual functions ```python from pygapit import ( vanraden_kinship, compute_pca, build_covariate_matrix, emma_remle, bonferroni_threshold, genomic_inflation_factor, glm_gwas, mlm_gwas, blink_gwas, farmcpu_gwas, gblup, manhattan_plot, qq_plot, ) import numpy as np # Compute kinship K = vanraden_kinship(GD_array) # (n, n) VanRaden matrix # PCA for structure control pca = compute_pca(GD_array, n_components=3) X0 = build_covariate_matrix(pca, n_pcs=3) # REML variance components remle = emma_remle(y, X0, K) print(f"h² = {remle.h2:.3f}") # Run BLINK GWAS result = blink_gwas(y, X0, GD_array, max_iterations=10, ld_threshold=0.7) lam = genomic_inflation_factor(result.p_values) thresh = bonferroni_threshold(len(result.p_values)) sig = (result.p_values <= thresh).sum() print(f"λ = {lam:.3f}, {sig} significant SNPs") # Genomic prediction gs = gblup(y, X0, K) print(f"Prediction accuracy (r): {np.corrcoef(y, gs.prediction)[0,1]:.3f}") # Plots manhattan_plot(snp_names, chromosomes, positions, result.p_values, save_path="manhattan.pdf") qq_plot(result.p_values, save_path="qq.pdf") ``` --- ## Mathematical models ### Mixed Linear Model (MLM) ``` y = X·β + u + e u ~ N(0, K·σ²g), e ~ N(0, I·σ²e) ``` Variance components estimated by **REML via EMMA** (Kang et al. 2008): spectral decomposition of K → grid search + Brent's method for optimal δ = σ²e/σ²g. **P3D approximation**: δ estimated once from null model, fixed for all m SNP tests. ### VanRaden Kinship (2009) ``` K = ZZ' / [2 · Σⱼ pⱼ(1-pⱼ)] Z = GD - 1 - P (centered 0/1/2 coding) p = allele frequencies ``` ### BLINK iteration ``` Loop until convergence: 1. GLM-1: sort markers by p-value LD-prune candidates (r² > threshold) select cofactors by BIC minimization 2. GLM-2: test all m markers with cofactor set as fixed effects → updated p-values ``` BIC = -2·logL + k·log(n) — replaces expensive REML from FarmCPU. ### Henderson's MME (gBLUP) ``` [X'X X'Z ] [β] [X'y] [Z'X Z'Z + δ·K⁻¹ ] [u] = [Z'y] BLUP = û, BLUE = X·β̂ PEV = diag(C⁻¹)ᵤᵤ · σ²g ``` --- ## Citation If you use pyGAPIT, please also cite the original GAPIT papers: - Wang J., Zhang Z. (2021) GAPIT Version 3. *Genomics, Proteomics & Bioinformatics* https://doi.org/10.1016/j.gpb.2021.08.005 - Huang M. et al. (2019) BLINK. *GigaScience* https://doi.org/10.1093/gigascience/giy154 - Liu X. et al. (2016) FarmCPU. *PLOS Genetics* https://doi.org/10.1371/journal.pgen.1005767 - Kang H.M. et al. (2008) EMMA. *Genetics* 178:1709–1723 - VanRaden P.M. (2009) Kinship. *J. Dairy Sci.* 91:4414–4423 --- ## License GPL-3.0 — consistent with original R GAPIT license.