# CDK-Model

**Repository Path**: forest-AI/CDK-Model

## Basic Information

- **Project Name**: CDK-Model
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-09-23
- **Last Updated**: 2025-09-23

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

## Project Overview

This project develops and validates machine learning models to predict chronic kidney disease (CKD) progression risk using clinical and laboratory features. The core script `mian.py` performs data preprocessing, model training, ensemble prediction, and comprehensive visualization (ROC, calibration, confusion matrix, feature importance, risk stratification, cost-effectiveness, etc.), and automatically exports figures.

The repository also includes sample data files and exported figure names to facilitate reproduction of the workflow and paper-quality figures.

---

## Repository Structure

- `mian.py`: Main script for data loading, modeling, and visualization
- `ckd_complete_raw_data.csv`: Required input dataset (must contain fields used by the script)
- `ckd_data_dictionary.csv` / `data_dictionary.txt`: Data dictionary (if differing, prefer the CSV)
- `study1_complete_data.csv`, `study2_complete_data.csv`: Sub-study datasets (optional)
- `ckd_complete_data_categorized.xlsx`, `ckd_numeric_summary.csv`: Summary/statistics (optional)
- `figureX_*.png`: Figures exported by the script

---

## Environment Requirements

Recommended: Python 3.9+

Main dependencies:

- numpy, pandas
- matplotlib, seaborn
- scikit-learn
- xgboost, lightgbm
- shap

Install example:

```bash
pip install numpy pandas matplotlib seaborn scikit-learn xgboost lightgbm shap
```

---

## Data Requirements

`mian.py` reads `ckd_complete_raw_data.csv` from the working directory and expects the following columns:

- `age`, `gender` (M/F), `bmi`, `hypertension`, `diabetes`, `cvd`
- `egfr_baseline`, `serum_creatinine_baseline`, `albumin_baseline`, `upcr_baseline`
- `hemoglobin_baseline`, `sodium_baseline`, `potassium_baseline`
- `systolic_bp_baseline`, `diastolic_bp_baseline`
- `progression` (binary target: 1=progression, 0=non-progression)

Notes:

- `gender` is mapped to numeric values (M→1, F→0)
- Rows with missing values are dropped; features are standardized before training

---

## Modeling Overview

- Split: `train_test_split(test_size=0.2, stratify=y, random_state=42)`
- Scaling: `StandardScaler`
- Base learners:
  - `RandomForestClassifier(n_estimators=300, max_depth=6, random_state=42)`
  - `xgboost.XGBClassifier(n_estimators=300, max_depth=5, learning_rate=0.05, random_state=42)`
  - `lightgbm.LGBMClassifier(n_estimators=300, num_leaves=50, random_state=42)`
- Ensemble: Simple mean of predicted probabilities from the three base models

---

## How to Run

1) Place `ckd_complete_raw_data.csv` in the project root.

2) Install dependencies and run:

```bash
python mian.py
```

3) By default, the following figures are generated in the project root:

- `figure5_model_performance.png`: Composite performance (ROC, calibration, confusion matrix)
- `figure6_feature_importance.png`: Feature importance and interactions
- `figure7_clinical_risk.png`: Clinical risk interpretation and stratification
- `figure8_comprehensive_analysis.png`: Comprehensive performance analysis
- `figure9_clinical_advantages.png`: Advanced clinical application advantages

To also generate Figures 1–4 (study flow, data pipeline, model architecture, baseline characteristics), uncomment the corresponding function calls in the `__main__` section of `mian.py`.


## Reproducibility and Customization

- Change feature set: edit `key_features` in `mian.py`
- Adjust models: tune hyperparameters or swap algorithms
- Modify ensembling: replace simple averaging with weighted averaging or stacking
- Control which figures export: comment/uncomment `plot_*` calls in `__main__`

---

## FAQ

- Error: "Please generate 'ckd_complete_raw_data.csv' first"
  - Ensure the file exists in the project root with the exact filename
  - Or add a data-generation step to produce the CSV before running
- Chinese font not displayed correctly on macOS/Linux/servers
  - The script sets `SimHei`; install a compatible font or change `plt.rcParams['font.sans-serif']` at the top of the script
- Difficulty installing `xgboost`/`lightgbm`
  - Refer to official docs or prebuilt wheels; if unavailable, temporarily comment out those models and validate with Random Forest only

---

## License

Unless otherwise specified, this project is intended primarily for academic research. For commercial or clinical use, ensure compliance with local regulations and ethics, and contact the authors for permission.