# LCEG **Repository Path**: dragon515/LCEG ## Basic Information - **Project Name**: LCEG - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-08-02 - **Last Updated**: 2025-08-02 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README
[📜 Paper] • [🤗 HF HUB]
Repo for "A Controlled Study on Long Context Extension and Generalization in LLMs"
## TABLE OF CONTENTS
1. [News](#news)
2. [Installation and Quick Guide](#installation-and-quick-guide)
3. [Long Context Methods Implementation](#long-context-methods-implementation)
- [Training Data](#training-data)
- [Models](#models)
- [Continuous Training](#continuous-training)
4. [Evaluation](#evaluation)
- [Perplexity Validation](#perplexity-validation)
- [Needle in A Haystack](#needle-in-a-haystack)
- [LongBench & Manyshots Trec](#longbench-manyshots-trec)
- [Ruler](#ruler)
5. [Acknowledgement](#acknowledgement)
6. [Citation](#citation)
7. [License](#license)
## 🔥 News
- [2024/09/19] LCEG paper is available on arxiv.
## 🚀 Installation and Quick Guide
To install and run the evaluation:
1. Clone the repository on your local machine, using git clone and pasting the url of this project.
2. Run the following code:
```
conda create -n lceg python=3.10
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2
pip install -r requirements.txt
```
## Long Context Methods Inplementation
### Training Data
We follow [Long-Context-Data-Engineering](https://github.com/FranxYao/Long-Context-Data-Engineering) to create our training data.
| Data | Tokens | Examples | Length | Download |
|:---------------|----------|----------|----------|----------|
| Slimpajama_downsample_32k_1B | 1B | 30774 | 32k | [Link](https://huggingface.co/datasets/Leooyii/Slimpajama_downsample_32k_1B) |
| Slimpajama_downsample_64k_1B | 1B | 15386 | 64k | [Link](https://huggingface.co/datasets/Leooyii/Slimpajama_downsample_32k_1B) |
| Slimpajama_downsample_64k_2B | 2B | 30780 | 64k | [Link](https://huggingface.co/datasets/Leooyii/Slimpajama_downsample_32k_1B) |
### Models
Models with continuous fine-tuning.
| Model | Size | Context | Training Tokens | Link |
|:------------------------------------|------|---------|-----------------|-------------------------------------------------------------------|
| Llama2-7b-hf-slimpajama1B-ntk-32k | 7B | 32768 | 1B | [Model](https://huggingface.co/Leooyii/NTK_32k_Slimpajama_1B) |
| Llama2-7b-hf-slimpajama1B-ntk-64k | 7B | 65536 | 1B | [Model](https://huggingface.co/Leooyii/NTK_64k_Slimpajama_1B) |
| Llama2-7b-hf-slimpajama2B-ntk-64k | 7B | 65536 | 2B | [Model](https://huggingface.co/Leooyii/NTK_64k_Slimpajama_2B) |
| Llama2-7b-hf-slimpajama1B-pi-32k | 7B | 32768 | 1B | [Model](https://huggingface.co/Leooyii/PI_32k_Slimpajama_1B) |
| Llama2-7b-hf-slimpajama1B-yarn-32k | 7B | 32768 | 1B | [Model](https://huggingface.co/Leooyii/YaRN_32k_Slimpajama_1B) |
| Llama2-7b-hf-slimpajama1B-longlora-32k | 7B | 32768 | 1B | [Model](https://huggingface.co/Leooyii/Longlora_32k_Slimpajama_1B) |
| Llama2-7b-hf-slimpajama1B-CLEX-32k | 7B | 32768 | 1B | [Model](TODO) |
| Llama2-7b-hf-slimpajama1B-landmark-512 | 7B | - | 1B | [Model](https://huggingface.co/Leooyii/Landmark_512_Slimpajama_1B) |
### Continuous Training
We provide our scripts for continuous fine-tuning on these long-context methods in `finetune.sh`.
To train the models, please enable DeepSpeed acceleration. continuous_finetuning/ds_configs/stage3_offload.json was the configuration file used for training.
**Setup `finetune.sh`**
```
cd continuous_finetuning
# set the methods and training config in finetune.sh
bash finetune.sh
```
In `finetune.sh`, we provide 3 scripts for continuous fine-tuning on 6 methods: `origin`, `pi`, `ntk`, `yarn`, `longlora`, and `landmark`. Here is an example:
```bash
torchrun --nproc_per_node=8 fine-tune.py \
--model_name_or_path "meta-llama/Llama-2-7b-hf" \
--bf16 True \
--output_dir ckpts/llama2-7b-hf-slimpajama-pi-32k \
--model_max_length 32768 \
--use_flash_attn True \
--low_rank_training False \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 2 \
--gradient_accumulation_steps 32 \
--evaluation_strategy "no" \
--save_strategy "epoch" \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0.0 \
--warmup_steps 20 \
--deepspeed ds_configs/stage3_offload.json \
--lr_scheduler_type "constant_with_warmup" \
--logging_steps 1 \
--tf32 True \
--report_to "wandb" \
--use_wandb True \
--dataset_dir Leooyii/Slimpajama_downsample_32k_1B \
--method_name pi # option:[origin, pi, ntk, yarn]
```
- You can train different long-context methods by changing `--method_name`.
- You can change `--model_name_or_path`, `--output_dir` to your own directory.
- Note that you can change `model_max_length` to other values.
- To train Longlora, please refer to the 'Scripts for Longlora' section in `finetune.sh` for training.
- To train Landmark Attention, please refer to the 'Scripts for Landmark Attention' section in `finetune.sh` for training.
## Evaluation
### Perplexity validation
We provide our scripts for Perplexity validation on PG19 and Proof-pile in `eval_perplexity/scripts`. We use the tokenized test splits of PG19 and Proof-pile dataset processed by longlora. The raw data and tokenized data are in `eval_perplexity/data` folder.
```
cd eval_perplexity
python eval_pi.py \
--seq_len 32768 \
--batch_size 1 \
--base_model path_to_checkpoints \
--data_path data/pg19/test.bin \
--output_dir results/pg19/pi_pg19.json
```
- Please note that `--seq_len` is used to set the sequence length for evaluation.
- Remember to change `--base_model`, `--output_dir` to your own directory.
### Needle in A Haystack
**Setup `eval.sh`**
```
cd needle
bash eval.sh
```
- The evaluation on 64k context length requires 1 * 80G A100 and on 128k context requires 4 * 80G A100.
- Set the method name and sequence length in `eval.sh`.
### LongBench & ManyShots TREC
The data to evaluate LongBench and ManyShots TREC is available at [LongBench](https://huggingface.co/datasets/Leooyii/longbench) and [ManyShots TREC](https://huggingface.co/datasets/Leooyii/manyshots_trec).
We provide our scripts to evaluate LongBench and ManyShots TREC in `longbench/scripts/eval_llama2.sh`.
**Setup `eval_llama2.sh`**
To eval LongBench, set the datasets in `longbench/scripts/eval_llama2.sh`:
```
# longbench
datasets=("narrativeqa" "qasper" "multifieldqa_en" "hotpotqa" "2wikimqa" "musique" \
"gov_report" "qmsum" "multi_news" "trec" "triviaqa" "samsum" \
"passage_count" "passage_retrieval_en" "lcc" "repobench-p")
```
To eval ManyShots TREC, , set the datasets in `longbench/scripts/eval_llama2.sh`:
```
datasets=("trec_1000shots" "trec_875shots" "trec_750shots" "trec_625shots" "trec_500shots" \
"trec_400shots" "trec_300shots" "trec_200shots" "trec_100shots" "trec_75shots" \
"trec_50shots" "trec_25shots" "trec_10shots" "trec_5shots" "trec_1shots")
```
After setting up the datasets and models, run `eval_llama2.sh`:
```
cd longbench
bash scripts/eval_llama2.sh
```
You can obtain the output of the model under the selected datasets under the `longbench/pred/` folder.
**Get the score using `score.sh`**
Run `longbench/scripts/score.sh` to evaluate all the long-context methods.
```
bash scripts/score.sh
```
### Ruler
**Requirements**
To evaluate RULER, please follow their guidance to create a new environment for evaluation. More details can be found at [RULER Requirements](https://github.com/hsiehjackson/RULER?tab=readme-ov-file#-requirements).
**Setup `run.sh`**
```
GPUS="" # number of GPUs
ROOT_DIR="" # the path that stores generated task samples and model predictions.
MODEL_DIR="" # the path that contains individual model folders from Huggingface.
ENGINE_DIR="" # the path that contains individual engine folders from TensorRT-LLM.
```
- The evaluation on 32k context length requires 1 * 80G A100 and on 64k context requires 2 * 80G A100.
**Setup `config_models.sh`**
```
case $MODEL_NAME in
llama2-7b-hf-lminfinite)
MODEL_PATH=YOUR_MODEL_FOLDER
MODEL_TEMPLATE_TYPE="base"
MODEL_FRAMEWORK="hf"
TOKENIZER_PATH=${MODEL_PATH}
TOKENIZER_TYPE="hf"
;;
llama-2-7b-hf-slimpajama-pi-32k)
MODEL_PATH=YOUR_MODEL_FOLDER
MODEL_TEMPLATE_TYPE="base"
MODEL_FRAMEWORK="vllm"
TOKENIZER_PATH=${MODEL_PATH}
TOKENIZER_TYPE="hf"
;;
llama-2-7b-hf-slimpajama-ntk-32k)
MODEL_PATH=YOUR_MODEL_FOLDER
MODEL_TEMPLATE_TYPE="base"
MODEL_FRAMEWORK="hf"
TOKENIZER_PATH=${MODEL_PATH}
TOKENIZER_TYPE="hf"
;;
```
For NTK, LM-Infinite, and Landmark Attention methods, please set `MODEL_FRAMEWORK="hf"`.
**Start evaluation**
```
bash run.sh YOUR_MODEL_NAME synthetic
```
**Get the score using `eval.sh`**
```
eval_methods=("llama2-7b-hf" "llama2-7b-hf-lminfinite" "llama2-7b-hf-ntk-frozen" "llama-2-7b-hf-slimpajama-pi-32k" \
"llama-2-7b-hf-slimpajama-ntk-32k" "llama2-7b-hf-slimpajama-ntk-64k" "llama2-7b-hf-slimpajama-ntk-64k-2B" \
"llama2-7b-hf-slimpajama-yarn-32k" "llama2-7b-hf-slimpajama-longlora-32k" "llama2-7b-hf-slimpajama-landmark")
eval_length=(4096 8192 16384 32768 65536)
for method in "${eval_methods[@]}";
do
for length in "${eval_length[@]}";
do
python eval/evaluate.py \
--data_dir /results/${method}/synthetic/${length}/pred \
--benchmark synthetic
done
done
```
## Acknowledgements
We sincerely appreciate the assistance provided by the following people (works):
- We thank Yao Fu, Yue Yu, Tianyu Gao, Celine Lee, Woojeong Kim, Jack Morris, Junxiong Wang, and Oscar Yin for their suggestions and feedback.
- Some evaluation code is modified upon [Landmark Attention](https://github.com/epfml/landmark-attention), [LongLora](https://github.com/dvlab-research/LongLoRA), [LongBench](https://github.com/THUDM/LongBench), [Long Context Data Engerneering](https://github.com/FranxYao/Long-Context-Data-Engineering) and [RULER](https://github.com/hsiehjackson/RULER).
## Citation
If you find it helpful, please kindly cite the paper.