# GraphGen
**Repository Path**: zaneray/GraphGen
## Basic Information
- **Project Name**: GraphGen
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-04-29
- **Last Updated**: 2025-05-10
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
[](https://github.com/open-sciencelab/GraphGen)
[](https://github.com/open-sciencelab/GraphGen)
[](https://github.com/open-sciencelab/GraphGen/issues)
[](https://github.com/open-sciencelab/GraphGen/issues)
[](https://graphgen-cookbook.readthedocs.io/en/latest/)
GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation
π Table of Contents
- π [What is GraphGen?](#-what-is-graphgen)
- π [Quick Start](#-quick-start)
- π [Latest Updates](#-latest-updates)
- ποΈ [System Architecture](#-system-architecture)
- π [Acknowledgements](#-acknowledgements)
- π [Citation](#-citation)
- π [License](#-license)
[//]: # (- π [Key Features](#-key-features))
[//]: # (- π
[Roadmap](#-roadmap))
[//]: # (- π° [Cost Analysis](#-cost-analysis))
[//]: # (- βοΈ [Configurations](#-configurations))
## π What is GraphGen?
GraphGen is a framework for synthetic data generation guided by knowledge graphs. Here is our [**paper**](https://github.com/open-sciencelab/GraphGen/tree/main/resources/GraphGen.pdf) and [best practice](https://github.com/open-sciencelab/GraphGen/issues/17).
It begins by constructing a fine-grained knowledge graph from the source textοΌthen identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge.
Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data.
## π Quick Start
Experience it on the [OpenXLab Application Center](https://g-app-center-000704-6802-aerppvq.openxlab.space) and [FAQ](https://github.com/open-sciencelab/GraphGen/issues/10).
### Gradio Demo
```bash
python webui/app.py
```

### Run from PyPI
1. Install GraphGen
```bash
pip install graphg
```
2. Run in CLI
```bash
SYNTHESIZER_MODEL=your_synthesizer_model_name \
SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model \
SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model \
TRAINEE_MODEL=your_trainee_model_name \
TRAINEE_BASE_URL=your_base_url_for_trainee_model \
TRAINEE_API_KEY=your_api_key_for_trainee_model \
graphg --output_dir cache
```
### Run from Source
1. Install dependencies
```bash
pip install -r requirements.txt
```
2. Configure the environment
- Create an `.env` file in the root directory
```bash
cp .env.example .env
```
- Set the following environment variables:
```bash
# Synthesizer is the model used to construct KG and generate data
SYNTHESIZER_MODEL=your_synthesizer_model_name
SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model
SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model
# Trainee is the model used to train with the generated data
TRAINEE_MODEL=your_trainee_model_name
TRAINEE_BASE_URL=your_base_url_for_trainee_model
TRAINEE_API_KEY=your_api_key_for_trainee_model
```
3. (Optional) If you want to modify the default generated configuration, you can edit the content of the configs/graphgen_config.yaml file.
```yaml
# configs/graphgen_config.yaml
# Example configuration
data_type: "raw"
input_file: "resources/examples/raw_demo.jsonl"
# more configurations...
```
4. Run the generation script
```bash
bash scripts/generate.sh
```
5. Get the generated data
```bash
ls cache/data/graphgen
```
## π Latest Updates
- **2025.04.21**: We have released the initial version of GraphGen.
## ποΈ System Architecture
See [analysis](https://deepwiki.com/open-sciencelab/GraphGen) by deepwiki for a technical overview of the GraphGen system, its architecture, and core functionalities.
### Workflow

## π Acknowledgements
- [SiliconCloud](https://siliconflow.cn) Abundant LLM API, some models are free
- [LightRAG](https://github.com/HKUDS/LightRAG) Simple and efficient graph retrieval solution
- [ROGRAG](https://github.com/tpoisonooo/ROGRAG) ROGRAG: A Robustly Optimized GraphRAG Framework
## π Citation
If you find this repository useful, please consider citing our work:
```bibtex
@software{Chen_GraphGen_2025,
author = {Chen, Zihong and Jiang, Wanli and Li, Jingzhe and Yuan, Zhonghang and Wang, Chenyang and Kong, Huanjun and Dong, Nanqing},
month = apr,
title = {{GraphGen}},
url = {https://github.com/open-sciencelab/GraphGen},
year = {2025}
}
```
## π License
This project is licensed under the [Apache License 2.0](LICENSE).