# ParallelComp **Repository Path**: 910024445/ParallelComp ## Basic Information - **Project Name**: ParallelComp - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-06-26 - **Last Updated**: 2025-06-26 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # ParallelComp: Parallel Long-Context Compressor for Length Extrapolation [![arXiv](https://img.shields.io/badge/arXiv-2502.14317-b31b1b.svg)](https://arxiv.org/abs/2502.14317) [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE) This repository contains the official implementation of **ParallelComp**, a novel training-free method for long-context extrapolation that extends Large Language Models' (LLMs) context length from 8K to 128K while maintaining high throughput and preserving perplexity. ## 📄 Paper **ParallelComp: Parallel Long-Context Compressor for Length Extrapolation** *Jing Xiong, Jianghan Shen, Chuanyang Zheng, Zhongwei Wan, Chenyang Zhao, Chiwun Yang, Fanghua Ye, Hongxia Yang, Lingpeng Kong, Ngai Wong* 📖 [Paper Link](https://arxiv.org/abs/2502.14317) ## 🚀 Key Features - **Training-free**: No costly fine-tuning required for length extrapolation - **High Performance**: Achieves 91.17% of GPT-4's performance on long-context tasks using an 8B model - **Scalable**: Extends context length from 4K to 128K tokens - **Efficient**: Integrates seamlessly with Flash Attention - **Fast**: 23.50x acceleration in the prefilling stage with 1.76x improvement in chunk throughput - **Memory Efficient**: Manages ultra-long contexts on a single A100 80GB GPU ## 🛠️ Installation ### Requirements - Python 3.9+ - PyTorch 2.5.1 - CUDA compatible GPU(s) - Transformers 4.43.2 - 80GB A100 GPU memory for ultra-long contexts ### Setup 1. Clone the repository: ```bash git clone https://github.com/your-username/ParallelComp.git cd ParallelComp ``` 2. Install dependencies: ```bash pip install -r requirements.txt ``` 3. For development, install from requirements.in: ```bash pip-compile requirements.in pip install -r requirements.txt ``` ## 📊 Usage ### Quick Start #### Single GPU Evaluation ```bash bash run_test_longbench_multi_gpu_window8_llama.sh \ --parallel_pattern parallel_comp --gpu_nums 1_0 \ --kv_cache_eviction false --capacity 512 \ --kv_cache_dynamic false --stage_eviction false \ --recent_token 8 \ --topk_windows -3 --query_rank true \ --query_recent_tokens 0 --reduce_factor 0 \ --calibration_stage None --calibration_mode 0 \ --special_token true \ --model meta-llama/Llama-2-7b-chat-hf ``` #### Multi-GPU Evaluation ```bash export CUDA_VISIBLE_DEVICES=0,1,2,3 bash run_test_longbench_multi_gpu_window8_llama.sh \ --parallel_pattern parallel_comp --gpu_nums 4 \ --kv_cache_eviction false --capacity 512 \ --kv_cache_dynamic false --stage_eviction false \ --recent_token 8 \ --topk_windows -3 --query_rank true \ --query_recent_tokens 0 --reduce_factor 0 \ --calibration_stage None --calibration_mode 0 \ --special_token true \ --model meta-llama/Llama-2-7b-chat-hf \ ``` #### Batch Evaluation ```bash bash run_test_longbench_multi_gpu_window8_llama.sh \ --parallel_pattern parallel_comp_batches --gpu_nums 1_0 \ --kv_cache_eviction false --capacity 512 \ --kv_cache_dynamic false --stage_eviction false \ --recent_token 8 \ --topk_windows -3 --query_rank true \ --query_recent_tokens 0 --reduce_factor 0 \ --calibration_stage None --calibration_mode 0 \ --special_token true \ --model meta-llama/Llama-2-7b-chat-hf ``` ### Supported Models - **LLaMA** family models - **Qwen2.5** models ### Supported Datasets #### Long Context Benchmarks - **LongBench**: narrativeqa, qasper, multifieldqa_en, hotpotqa, 2wikimqa, musique, gov_report, qmsum, multi_news, trec, triviaqa, samsum, passage_count, passage_retrieval_en, lcc, repobench-p - **InfiniteBench**: passkey, number_string,kv_retrieval, math_find, code_debug, longbook_choice_eng, longdialogue_qa_eng ## 📈 Evaluation Scripts ### Evaluation Metrics Calculate metrics for evaluation results: ```bash # For LongBench results bash scripts/longbench_metrics.sh --results_dir ./results --new_method your_method_name --switch true # For InfiniteBench results bash scripts/infinitebench_metrics.sh --results_dir ./results --new_method your_method_name --switch true ``` ### GPU Configuration Multi-GPU setups use accelerate with YAML configs in `scripts/`: ```bash # 4 GPU setup accelerate launch --config_file scripts/gpu_4.yaml your_script.py # 8 GPU setup accelerate launch --config_file scripts/gpu_8.yaml your_script.py ``` Available configurations: `gpu_1.yaml`, `gpu_2.yaml`, `gpu_3.yaml`, `gpu_4.yaml`, `gpu_5.yaml`, `gpu_6.yaml`, `gpu_7.yaml`, `gpu_8.yaml` ## 🔧 Advanced Configuration ### Attention Calibration ParallelComp includes attention calibration strategies to mitigate attention sink issues: #### Single GPU Evaluation ```bash bash run_test_longbench_multi_gpu_window8_llama.sh \ --parallel_pattern parallel_comp --gpu_nums 1_0 \ --kv_cache_eviction false --capacity 512 \ --kv_cache_dynamic false --stage_eviction false \ --recent_token 8 \ --topk_windows -3 --query_rank true \ --query_recent_tokens 0 --reduce_factor 0 \ --calibration_stage prefill_2_calibration_head_{sink/recent/middle/all}_{layer_i}_{layer_j} --calibration_mode 1 \ --special_token true \ --model meta-llama/Llama-2-7b-chat-hf \ ``` ## 🚧 TODO & Roadmap - [❎] **Code Organization**: Currently organizing and cleaning up the codebase for better usability - [❎] **Gemma Support**: Adding full support for Gemma model family - [❎] **Baselines**: Adding full support for Evaluation of Baselines - [❎] **SGLang Integration**: Adding support for SGLang inference engine for improved performance - [❎] **Documentation**: Expanding documentation with more detailed examples - [❎] **Quantization Support**: Adding support for model quantization (INT8/INT4) to reduce memory usage and accelerate inference - [❎] **Benchmarks**: Adding more comprehensive benchmark results - [✅] **FlashAttention Support** - [✅] **Multi-GPU Inference Support** - [✅] **Batch Inference Support** - [✅] **AMD GPU Support** ## 📁 Project Structure ``` ParallelComp/ ├── run_evaluation_multi_gpu.py # Multi-GPU evaluation script ├── model_loaders.py # Model loading utilities ├── experiment_manager.py # Experiment management ├── pcw_wrapper.py # Parallel Context Window wrapper ├── modeling_llama_with_pcw_kv_cache_FlashAttention_longbench.py # Llama model implementation with PCW ├── modeling_qwen2_with_pcw_kv_cache_FlashAttention_longbench.py # Qwen2 model implementation with PCW ├── metrics.py # Evaluation metrics ├── eval_longbench.py # LongBench dataset evaluation ├── eval_infinitebench.py # InfiniteBench dataset evaluation ├── utils.py # General utilities ├── constants.py # Project constants ├── run_test_longbench_multi_gpu_window8_llama.sh # Llama evaluation script ├── run_test_longbench_multi_gpu_window8_qwen.sh # Qwen evaluation script ├── scripts/ # GPU configuration files │ ├── gpu_*.yaml # GPU configuration files │ ├── longbench_metrics.sh # LongBench metrics script │ └── infinitebench_metrics.sh # InfiniteBench metrics script ├── longbench_config/ # LongBench configurations │ ├── dataset2*.json # Dataset configuration files │ ├── model2maxlen*.json # Model configuration files │ └── past/ # Historical configurations ├── datasets/ # Dataset storage │ ├── LongBench/ # LongBench dataset │ └── gsm8k/ # GSM8K dataset ├── my_utils/ # Utilities │ ├── logger.py # Logging utilities │ ├── entropy_utils.py # Entropy calculation utilities │ ├── cache_revise.py # Cache revision utilities │ └── priorityqueue.py # Priority queue implementation ├── requirements.txt # Python dependencies └── requirements.in # Dependency source file ``` ## 📊 Results ParallelComp achieves significant improvements in long-context tasks: - **91.17%** of GPT-4's performance using 8B model trained on 8K context - **23.50x** acceleration in prefilling stage - **1.76x** improvement in chunk throughput - Outperforms Claude-2 and Kimi-Chat on long-context benchmarks ## 🤝 Contributing We welcome contributions! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change. ## 📄 License This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details. ## 📚 Citation If you find this work useful, please cite our paper: ```bibtex @article{xiong2025parallelcomp, title={ParallelComp: Parallel Long-Context Compressor for Length Extrapolation}, author={Xiong, Jing and Shen, Jianghan and Zheng, Chuanyang and Wan, Zhongwei and Zhao, Chenyang and Yang, Chiwun and Ye, Fanghua and Yang, Hongxia and Kong, Lingpeng and Wong, Ngai}, journal={arXiv preprint arXiv:2502.14317}, year={2025} } ``` ## 📞 Contact For questions and support, please open an issue in this repository or contact the authors. --- **Note**: This implementation will be fully released soon. Stay tuned for updates!