# TreeSynth **Repository Path**: data_factory/TreeSynth ## Basic Information - **Project Name**: TreeSynth - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-10-08 - **Last Updated**: 2025-11-03 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README
# 🌳 [TreeSynth: Synthesizing Diverse Data from Scratch via Tree-Guided Subspace Partitioning](https://arxiv.org/pdf/2503.17195)
arXiv Data License Code License
## πŸš€ News - **[5/16/2025]** πŸš€ We're excited to announce the release of our new paper and the initial public availability of its corresponding code. ## πŸ’‘ Introduction We introduce **TREESYNTH**, a tree-guided subspace-based data synthesis approach inspired by decision trees. It constructs a spatial partitioning tree to recursively divide a task-specific full data space (i.e., root node) into numerous atomic subspaces (i.e., leaf nodes) with mutually exclusive and exhaustive attributes to ensure both distinctiveness and comprehensiveness before synthesizing samples within each atomic subspace.
This globally dividing-and-synthesizing method finally collects subspace samples into a comprehensive dataset, effectively circumventing repetition and space collapse to ensure the diversity of large-scale data synthesis. ### πŸ”§ Algorithm Overview The core TreeSynth algorithm follows a systematic tree-guided approach for data synthesis:
Pseudocode of the TreeSynth algorithm showing the tree construction and data synthesis process.
Schematic illustrating TreeSynth’s advantage in spatial partitioning. ## πŸ“Š Performance ### 🌈 Excellent Data Diversity TreeSynth exhibits substantially better data diversity and more comprehensive coverage across various tasks and models than both human-curated datasets and peer synthetic methods.
t-SNE visualization of LLaMA3.3-70B-Instruct-synthesized datasets for various methods across GSM8K, MATH, and Code Alpaca styles. ### πŸ† Superior Downstream Performance Models trained on TreeSynth data consistently outperform those trained on both human-crafted datasets and synthetic baselines across all the tasks, foundation and generation models.
Model performance and data diversity comparison of various methods with GPT-4o-powered data synthesis. ### πŸ“ˆ Scalable Data Synthesis with Quality Preservation With the global data spatial perspective guided by tree structure, TreeSynth effectively scales datasets while preserving data quality, suggesting great scalability wherein downstream performance consistently improves with increased data volume.
Model performance trends across data scales for different methods powered by GPT-4o.
Click to expand performance results of other models using TreeSynth #### Performance results using TreeSynth with LLaMA 3.3 70B Instruct for data synthesize
Model performance and data diversity comparison of various methods with LLaMA3.3 70b Instruct-powered data synthesis.
Model performance trends across data scales for different methods powered by LLaMA3.3 70b Instruct. #### Performance results using TreeSynth with Qwen 2.5 72B Instruct for data synthesize
Model performance and data diversity comparison of various methods with Qwen2.5 72b Instruct-powered data synthesis.
Model performance trends across data scales for different methods powered by Qwen2.5 70b Instruct.
## ⚑ Quick Start ### πŸ› οΈ Prerequisites - Python 3.8+ - API access to OpenAI/Azure or local model deployment via vLLM - Required Python dependencies (see installation below) ### πŸ“₯ Installation ```bash git clone https://github.com/cpa2001/TreeSynth.git cd TreeSynth # Install dependencies pip install -r requirements.txt ``` ### πŸ“ Project File Structure ``` TreeSynth/ β”œβ”€β”€ config.py # Configuration: API settings, generation parameters β”œβ”€β”€ main.py # Main entry point: data generation workflow control β”œβ”€β”€ generator_math_async.py # MATH style dataset generator β”œβ”€β”€ generator_code_async.py # Code Alpaca style dataset generator β”œβ”€β”€ generator_gsm_async.py # GSM8K style dataset generator β”œβ”€β”€ generator_mental_async.py # SimpleToM style dataset generator β”œβ”€β”€ generator_mental_positive.py # Positive SimpleToM style dataset generator β”œβ”€β”€ balance_mental.py # SimpleToM style data balancing tool β”œβ”€β”€ vllm_engine.sh # vLLM local deployment script β”œβ”€β”€ vllm_chat_template_llama3.1_json.jinja # vLLM chat template └── image/ # Documentation images directory ``` ### βš™οΈ Configuration #### 1️⃣ Backend Setup Edit `config.py` to configure your API backend: **Option A: OpenAI API** ```python BACKEND_CONFIGS = { "openai": { "api_key": "your-openai-api-key", "model_name": "gpt-4o", } } ``` **Option B: Azure OpenAI** ```python BACKEND_CONFIGS = { "azure": { "api_key": "your-azure-api-key", "model_name": "gpt-4o", "azure_endpoint": "https://your-endpoint.openai.azure.com/", "api_version": "2024-10-21", } } ``` **Option C: vLLM (Local Deployment)** ```python BACKEND_CONFIGS = { "vllm": { "api_base": "http://localhost:8000/v1", "api_key": "your-vllm-api-key", "model_name": "/path/to/model/qwen2_5-72b-instruct", } } ``` For vLLM local deployment, use the provided script: ```bash bash vllm_engine.sh ``` #### 2️⃣ Tree Generation Parameters Configure core TreeSynth parameters in `DEFAULT_CONFIG` within `config.py`: ```python DEFAULT_CONFIG = { "max_depth": 4, # Maximum tree depth "num_samples_per_node": 10, # Samples generated per node "max_attribute_count": 50, # Maximum attributes per node "max_sample_infinite_attribute": 50, # Threshold for infinite attributes "infinite_path_samples": 10, # Samples for infinite paths "max_workers": 64, # Concurrent workers "max_concurrent_requests": 64, # Concurrent API requests "max_retries": 5, # Maximum retry attempts } ``` #### 3️⃣ API Pool Configuration (Optional) For high-throughput data generation, configure multiple API endpoints: ```python # Multiple Azure endpoints for load balancing OPENAI_API_POOL = [ { "endpoint": "https://endpoint1.openai.azure.com/", "key": "azure-api-key1", "version": "2024-10-21", "model": "gpt-4o", }, { "endpoint": "https://endpoint2.openai.azure.com/", "key": "azure-api-key2", "version": "2024-10-21", "model": "gpt-4o", } ] ``` ### πŸ”„ Switching Data Generation Styles TreeSynth supports multiple data styles. You need to manually modify the import statement in `main.py`: #### 1️⃣ MATH Dataset Style ```python # Modify import in main.py from generator_math_async import LLMInference, APIPool, TreeNode, inject_runtime_to_tree ``` #### 2️⃣ Code Alpaca Style ```python # Modify import in main.py from generator_code_async import LLMInference, APIPool, TreeNode, inject_runtime_to_tree ``` #### 3️⃣ GSM8K Style ```python # Modify import in main.py from generator_gsm_async import LLMInference, APIPool, TreeNode, inject_runtime_to_tree ``` ### ▢️ Running TreeSynth #### Step-by-Step Guide: 1. **Configure API Keys**: - Edit `BACKEND_CONFIGS` in `config.py` - Replace API keys, endpoints with your actual configurations 2. **Select Backend**: - Edit lines 175-184 in `main.py` - Comment out unused backends, enable the one you want to use ```python # Use vLLM backend # asyncio.run(run_generation_process(..., backend="vllm", ...)) # Use OpenAI backend asyncio.run(run_generation_process(..., backend="openai", ...)) ``` 3. **Choose Data Style**: - Modify the import statement on line 13 of `main.py`: ```python # Math competition problems from generator_math_async import LLMInference, APIPool, TreeNode, inject_runtime_to_tree # Or code generation # from generator_code_async import LLMInference, APIPool, TreeNode, inject_runtime_to_tree ``` 4. **Run Generation**: ```bash # Execute data generation pipeline python main.py ``` #### Output Files: The program will generate files in `output/timestamp/` directory: - `generation.log` - Detailed generation logs - `output.json` - Complete tree structure and intermediate results - `tree_structure.txt` - Human-readable tree structure visualization - `result.jsonl` - Final generated data samples (one JSON object per line) ### 🧩 Custom Data Domains To create generators for new task domains, reference existing generator files: 1. **Copy Existing Generator**: Use `generator_math_async.py` as template 2. **Modify Prompt Templates**: Edit prompts in `format_dim_prompt()`, `format_expand_prompt` and `format_gen_prompt()` 3. **Adjust Parsing Logic**: Modify regex pattern matching 4. **Update Imports**: Import the new generator in `main.py` ## πŸ—ΊοΈ Overview of TreeSynth TreeSynth consists of two key stages: data space partitioning and subspace data synthesis. During the former phase, TreeSynth employs a spatial partitioning tree to recursively divide a task-specific whole data space (i.e., root node defined by textual descriptions) into numerous atomic subspaces (i.e., leaf nodes). These subspaces are characterized by mutually exclusive and exhaustive attribute values to ensure both distinctiveness and diversity. In the subsequent subspace data synthesis phase, samples are generated within each subspace separately, before collecting them as a diverse and comprehensive dataset. By employing this globally divide-and-synthesize methodology, TreeSynth effectively prevents repetition and space collapse to ensure the diversity and completeness of large-scale data synthesis, successfully avoiding the drawbacks of previous methods.
A spatial partitioning tree visualization of TreeSynth, exemplified through GSM8K-style data synthesis. ## πŸ“š Citation If you found our work useful, please consider starring and citing. Thank you! ```latex @article{wang2025treesynth, title={TreeSynth: Synthesizing Diverse Data from Scratch via Tree-Guided Subspace Partitioning}, author={Wang, Sheng and Chen, Pengan and Zhou, Jingqi and Li, Qintong and Dong, Jingwei and Gao, Jiahui and Xue, Boyang and Jiang, Jiyue and Kong, Lingpeng and Wu, Chuan}, journal={arXiv preprint arXiv:2503.17195}, year={2025} } ``` [//]: # () [//]: # (## Star History) [//]: # () [//]: # ([TreeSynth](https://github.com/cpa2001/TreeSynth) 当前 star ζ•°δΈΊ 0οΌŒζš‚ζ— ε’žι•ΏθΆ‹εŠΏγ€‚)