# CDK-Model **Repository Path**: forest-AI/CDK-Model ## Basic Information - **Project Name**: CDK-Model - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-09-23 - **Last Updated**: 2025-09-23 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README ## Project Overview This project develops and validates machine learning models to predict chronic kidney disease (CKD) progression risk using clinical and laboratory features. The core script `mian.py` performs data preprocessing, model training, ensemble prediction, and comprehensive visualization (ROC, calibration, confusion matrix, feature importance, risk stratification, cost-effectiveness, etc.), and automatically exports figures. The repository also includes sample data files and exported figure names to facilitate reproduction of the workflow and paper-quality figures. --- ## Repository Structure - `mian.py`: Main script for data loading, modeling, and visualization - `ckd_complete_raw_data.csv`: Required input dataset (must contain fields used by the script) - `ckd_data_dictionary.csv` / `data_dictionary.txt`: Data dictionary (if differing, prefer the CSV) - `study1_complete_data.csv`, `study2_complete_data.csv`: Sub-study datasets (optional) - `ckd_complete_data_categorized.xlsx`, `ckd_numeric_summary.csv`: Summary/statistics (optional) - `figureX_*.png`: Figures exported by the script --- ## Environment Requirements Recommended: Python 3.9+ Main dependencies: - numpy, pandas - matplotlib, seaborn - scikit-learn - xgboost, lightgbm - shap Install example: ```bash pip install numpy pandas matplotlib seaborn scikit-learn xgboost lightgbm shap ``` --- ## Data Requirements `mian.py` reads `ckd_complete_raw_data.csv` from the working directory and expects the following columns: - `age`, `gender` (M/F), `bmi`, `hypertension`, `diabetes`, `cvd` - `egfr_baseline`, `serum_creatinine_baseline`, `albumin_baseline`, `upcr_baseline` - `hemoglobin_baseline`, `sodium_baseline`, `potassium_baseline` - `systolic_bp_baseline`, `diastolic_bp_baseline` - `progression` (binary target: 1=progression, 0=non-progression) Notes: - `gender` is mapped to numeric values (M→1, F→0) - Rows with missing values are dropped; features are standardized before training --- ## Modeling Overview - Split: `train_test_split(test_size=0.2, stratify=y, random_state=42)` - Scaling: `StandardScaler` - Base learners: - `RandomForestClassifier(n_estimators=300, max_depth=6, random_state=42)` - `xgboost.XGBClassifier(n_estimators=300, max_depth=5, learning_rate=0.05, random_state=42)` - `lightgbm.LGBMClassifier(n_estimators=300, num_leaves=50, random_state=42)` - Ensemble: Simple mean of predicted probabilities from the three base models --- ## How to Run 1) Place `ckd_complete_raw_data.csv` in the project root. 2) Install dependencies and run: ```bash python mian.py ``` 3) By default, the following figures are generated in the project root: - `figure5_model_performance.png`: Composite performance (ROC, calibration, confusion matrix) - `figure6_feature_importance.png`: Feature importance and interactions - `figure7_clinical_risk.png`: Clinical risk interpretation and stratification - `figure8_comprehensive_analysis.png`: Comprehensive performance analysis - `figure9_clinical_advantages.png`: Advanced clinical application advantages To also generate Figures 1–4 (study flow, data pipeline, model architecture, baseline characteristics), uncomment the corresponding function calls in the `__main__` section of `mian.py`. ## Reproducibility and Customization - Change feature set: edit `key_features` in `mian.py` - Adjust models: tune hyperparameters or swap algorithms - Modify ensembling: replace simple averaging with weighted averaging or stacking - Control which figures export: comment/uncomment `plot_*` calls in `__main__` --- ## FAQ - Error: "Please generate 'ckd_complete_raw_data.csv' first" - Ensure the file exists in the project root with the exact filename - Or add a data-generation step to produce the CSV before running - Chinese font not displayed correctly on macOS/Linux/servers - The script sets `SimHei`; install a compatible font or change `plt.rcParams['font.sans-serif']` at the top of the script - Difficulty installing `xgboost`/`lightgbm` - Refer to official docs or prebuilt wheels; if unavailable, temporarily comment out those models and validate with Random Forest only --- ## License Unless otherwise specified, this project is intended primarily for academic research. For commercial or clinical use, ensure compliance with local regulations and ethics, and contact the authors for permission.