# staged-training

**Repository Path**: mirrors_allenai/staged-training

## Basic Information

- **Project Name**: staged-training
- **Description**: Staged Training for Transformer Language Models
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2022-03-11
- **Last Updated**: 2026-02-14

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# staged-training

In our paper [**Staged Training for Transformer Language Models**](https://arxiv.org/abs/2203.06211), we propose a staged training setup that begins with a small model and incrementally increases the amount of compute used for training by applying a "growth operator" to increase the model depth and width. By initializing each stage with the output of the previous one, the training process effectively re-uses the compute from prior stages and becomes more efficient. 

We release the reproducible code for the growth operator and evaluation scripts here.

## Setup

The scripts in this repository require Python 3.7 or newer.
Once you have a suitable Python environment, first install PyTorch v1.9.0 according the [official instructions](https://pytorch.org/get-started/previous-versions/#v190). Then run
```
pip install -r requirements.txt
```

## Growth Operator

Our growth operators (width/depth) each take as input the entire training state (including model parameters, optimizer state, learning rate schedule, etc.) and output a new training state from which training continues.

Please see the `scripts/cheatsheet.txt` for more examples on how to use the corresponding scripts. 

For example, you can apply the width operator with:
```
CUDA_VISIBLE_DEVICES=0,1,2,3 python scripts/gpt_pretrain.py \
  --save_prefix final_gpt2_large_div2_width_check_bs512_lr0.0020_warmup3k_seqlen1024_debug \
  --gpu_count -1 \
  --model gpt2  \
  --tokenizer gpt2 \
  --batch_size 4 \
  --grad_accum 32  \
  --lr 0.002006911598778545  \
  --warmup_steps 3000 \  \
  --train_steps 250000  \
  --val_every 50  \
  --val_batches 50 \
  --fp16 \
  --seqlen 1024 \
  --log_rate 10 \
  --num_workers 4 \
  --size GPT2_large_div2_width \
  --random \
  --resume final_runs/final_gpt2_large_div2_width_check_bs512_lr0.0021_warmup3k_seqlen1024_debug/checkpoint-xxx.ckpt \
  --doubling weights
```

Or the depth operator with:
```
CUDA_VISIBLE_DEVICES=0,1,2,3 python scripts/gpt_pretrain.py \
  --save_prefix final_gpt2_large_div2_depthx2_check_bs512_lr0.0020_warmup3k_seqlen1024_debug \
  --gpu_count -1 \
  --model gpt2  \
  --tokenizer gpt2 \
  --batch_size 4 \
  --grad_accum 32 \
  --lr 0.002006911598778545 \
  --warmup_steps 3000 \
  --train_steps 250000 \
  --val_every 50 \
  --val_batches 50 \
  --fp16 \
  --seqlen 1024 \
  --log_rate 10 \
  --num_workers 4 \
  --size GPT2_large_div2_depth \
  --random \
  --resume final_runs/final_gpt2_large_div2_depth_check_bs512_lr0.0020_warmup3k_seqlen1024_debug/checkpoint-epoch=0-step=6499.ckpt \
  --doubling layers
``` 

## Evaluation

Use `evaluation/eval_wikitext.py` or `evaluation/eval_lambada.py` to evaluate [GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) on one of the supported datasets. For example:

```bash
python evaluation/eval_wikitext.py
```

Or using Docker:

```bash
docker build -t evaluation:latest .
docker run --rm --gpus all evaluation:latest evaluation/eval_wikitext.py
```

## Reference

If you use staged training in your research or wish to refer to the baseline results published here, 
please use the following BibTeX entry. 
```
@misc{shen2022staged,
    title={Staged Training for Transformer Language Models},
    author={Sheng Shen and Pete Walsh and Kurt Keutzer and Jesse Dodge and Matthew Peters and Iz Beltagy},
    year={2022},
    eprint={2203.06211},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
```