# GraphGen **Repository Path**: zaneray/GraphGen ## Basic Information - **Project Name**: GraphGen - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-04-29 - **Last Updated**: 2025-05-10 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

[![stars](https://img.shields.io/github/stars/open-sciencelab/GraphGen.svg)](https://github.com/open-sciencelab/GraphGen) [![forks](https://img.shields.io/github/forks/open-sciencelab/GraphGen.svg)](https://github.com/open-sciencelab/GraphGen) [![open issues](https://img.shields.io/github/issues-raw/open-sciencelab/GraphGen)](https://github.com/open-sciencelab/GraphGen/issues) [![issue resolution](https://img.shields.io/github/issues-closed-raw/open-sciencelab/GraphGen)](https://github.com/open-sciencelab/GraphGen/issues) [![documentation](https://img.shields.io/badge/docs-latest-blue)](https://graphgen-cookbook.readthedocs.io/en/latest/) GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation
πŸ“š Table of Contents - πŸ“ [What is GraphGen?](#-what-is-graphgen) - πŸš€ [Quick Start](#-quick-start) - πŸ“Œ [Latest Updates](#-latest-updates) - πŸ—οΈ [System Architecture](#-system-architecture) - πŸ€ [Acknowledgements](#-acknowledgements) - πŸ“š [Citation](#-citation) - πŸ“œ [License](#-license) [//]: # (- 🌟 [Key Features](#-key-features)) [//]: # (- πŸ“… [Roadmap](#-roadmap)) [//]: # (- πŸ’° [Cost Analysis](#-cost-analysis)) [//]: # (- βš™οΈ [Configurations](#-configurations))
## πŸ“ What is GraphGen? GraphGen is a framework for synthetic data generation guided by knowledge graphs. Here is our [**paper**](https://github.com/open-sciencelab/GraphGen/tree/main/resources/GraphGen.pdf) and [best practice](https://github.com/open-sciencelab/GraphGen/issues/17). It begins by constructing a fine-grained knowledge graph from the source text,then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge. Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data. ## πŸš€ Quick Start Experience it on the [OpenXLab Application Center](https://g-app-center-000704-6802-aerppvq.openxlab.space) and [FAQ](https://github.com/open-sciencelab/GraphGen/issues/10). ### Gradio Demo ```bash python webui/app.py ``` ![ui](https://github.com/user-attachments/assets/3024e9bc-5d45-45f8-a4e6-b57bd2350d84) ### Run from PyPI 1. Install GraphGen ```bash pip install graphg ``` 2. Run in CLI ```bash SYNTHESIZER_MODEL=your_synthesizer_model_name \ SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model \ SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model \ TRAINEE_MODEL=your_trainee_model_name \ TRAINEE_BASE_URL=your_base_url_for_trainee_model \ TRAINEE_API_KEY=your_api_key_for_trainee_model \ graphg --output_dir cache ``` ### Run from Source 1. Install dependencies ```bash pip install -r requirements.txt ``` 2. Configure the environment - Create an `.env` file in the root directory ```bash cp .env.example .env ``` - Set the following environment variables: ```bash # Synthesizer is the model used to construct KG and generate data SYNTHESIZER_MODEL=your_synthesizer_model_name SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model # Trainee is the model used to train with the generated data TRAINEE_MODEL=your_trainee_model_name TRAINEE_BASE_URL=your_base_url_for_trainee_model TRAINEE_API_KEY=your_api_key_for_trainee_model ``` 3. (Optional) If you want to modify the default generated configuration, you can edit the content of the configs/graphgen_config.yaml file. ```yaml # configs/graphgen_config.yaml # Example configuration data_type: "raw" input_file: "resources/examples/raw_demo.jsonl" # more configurations... ``` 4. Run the generation script ```bash bash scripts/generate.sh ``` 5. Get the generated data ```bash ls cache/data/graphgen ``` ## πŸ“Œ Latest Updates - **2025.04.21**: We have released the initial version of GraphGen. ## πŸ—οΈ System Architecture See [analysis](https://deepwiki.com/open-sciencelab/GraphGen) by deepwiki for a technical overview of the GraphGen system, its architecture, and core functionalities. ### Workflow ![workflow](resources/images/flow.png) ## πŸ€ Acknowledgements - [SiliconCloud](https://siliconflow.cn) Abundant LLM API, some models are free - [LightRAG](https://github.com/HKUDS/LightRAG) Simple and efficient graph retrieval solution - [ROGRAG](https://github.com/tpoisonooo/ROGRAG) ROGRAG: A Robustly Optimized GraphRAG Framework ## πŸ“š Citation If you find this repository useful, please consider citing our work: ```bibtex @software{Chen_GraphGen_2025, author = {Chen, Zihong and Jiang, Wanli and Li, Jingzhe and Yuan, Zhonghang and Wang, Chenyang and Kong, Huanjun and Dong, Nanqing}, month = apr, title = {{GraphGen}}, url = {https://github.com/open-sciencelab/GraphGen}, year = {2025} } ``` ## πŸ“œ License This project is licensed under the [Apache License 2.0](LICENSE).