# lite-transformer

**Repository Path**: bobosui/lite-transformer

## Basic Information

- **Project Name**: lite-transformer
- **Description**: [ICLR 2020] Lite Transformer with Long-Short Range Attention
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2020-07-16
- **Last Updated**: 2024-06-01

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Lite Transformer with Long-Short Range Attention
```
@inproceedings{Wu2020LiteTransformer,
  title={Lite Transformer with Long-Short Range Attention},
  author={Zhanghao Wu* and Zhijian Liu* and Ji Lin and Yujun Lin and Song Han},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2020}
}
```

## Overview
We release the PyTorch code for the Lite Transformer. [[Paper](https://arxiv.org/abs/2004.11886)|[Website](https://hanlab.mit.edu/projects/litetransformer/)|[Slides](https://zhanghaowu.me/assets/pdf/Presentation_LiteTransformer.pdf)]:
![overview](figures/overview.png?raw=true "overview")

### Consistent Improvement by Tradeoff Curves
![tradeoff](figures/tradeoff.png?raw=true "tradeoff")
### Save 20000x Searching Cost of Evolved Transformer
![et](figures/et.png?raw=true "et")
### Further Compress Transformer by 18.2x
![compression](figures/compression.png?raw=true "compression")

## How to Use

### Prerequisite

* Python version >= 3.6
* [PyTorch](http://pytorch.org/) version >= 1.0.0
* configargparse >= 0.14
* For training new models, you'll also need an NVIDIA GPU and [NCCL](https://github.com/NVIDIA/nccl)

### Installation

1. Codebase
    
    To install fairseq from source and develop locally:
    ```bash
    pip install --editable .
    ```

2. Costumized Modules

    We also need to build the `lightconv` and `dynamicconv` for GPU support.

    Lightconv_layer
    ```bash
    cd fairseq/modules/lightconv_layer
    python cuda_function_gen.py
    python setup.py install
    ```
    Dynamicconv_layer
    ```bash
    cd fairseq/modules/dynamicconv_layer
    python cuda_function_gen.py
    python setup.py install
    ```

### Data Preparation
#### IWSLT'14 De-En
We follow the data preparation in [fairseq](github.com/pytorch/fairseq). To download and preprocess the data, one can run
```bash
bash configs/iwslt14.de-en/prepare.sh
```

#### WMT'14 En-Fr
We follow the data pre-processing in [fairseq](github.com/pytorch/fairseq).  To download and preprocess the data, one can run
```bash
bash configs/wmt14.en-fr/prepare.sh
```

#### WMT'16 En-De
We follow the data pre-processing in [fairseq](github.com/pytorch/fairseq). One should first download the preprocessed data from the [Google Drive](https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8) provided by Google. To binarized the data, one can run
```bash
bash configs/wmt16.en-de/prepare.sh [path to the downloaded zip file]
```

#### WIKITEXT-103
As the language model task has many additional codes, we place it in another branch: `language-model`.
We follow the data pre-processing in [fairseq](github.com/pytorch/fairseq).  To download and preprocess the data, one can run
```bash
git checkout language-model
bash configs/wikitext-103/prepare.sh
```

### Testing

For example, to test the models on WMT'14 En-Fr, one can run
```bash
configs/wmt14.en-fr/test.sh [path to the model checkpoints] [gpu-id] [test|valid]
```
For instance, to evaluate Lite Transformer on GPU 0 (with the BLEU score on test set of WMT'14 En-Fr), one can run
```bash
configs/wmt14.en-fr/test.sh embed496/ 0 test
```
We provide several pretrained models at the bottom. You can download the model and extract the file by
```bash
tar -xzvf [filename]
```

### Training
We provided several examples to train Lite Transformer with this repo:

To train Lite Transformer on WMT'14 En-Fr (with 8 GPUs), one can run
```bash
python train.py data/binary/wmt14_en_fr --configs configs/wmt14.en-fr/attention/multibranch_v2/embed496.yml
```
To train Lite Transformer with less GPUs, e.g. 4 GPUS, one can run
```bash
CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py data/binary/wmt14_en_fr --configs configs/wmt14.en-fr/attention/multibranch_v2/embed496.yml --update-freq 32
```
In general, to train a model, one can run
```bash
python train.py [path to the data binary] --configs [path to config file] [override options]
```
Note that `--update-freq` should be adjusted according to the GPU numbers (16 for 8 GPUs, 32 for 4 GPUs).

### Distributed Training (optional)

To train Lite Transformer in distributed manner. For example on two GPU nodes with totally 16 GPUs.
```bash
# On host1
python -m torch.distributed.launch \
        --nproc_per_node=8 \
        --nnodes=2 --node_rank=0 \
        --master_addr=host1 --master_port=8080 \
        train.py data/binary/wmt14_en_fr \
        --configs configs/wmt14.en-fr/attention/multibranch_v2/embed496.yml \
        --distributed-no-spawn \
        --update-freq 8
# On host2
python -m torch.distributed.launch \
        --nproc_per_node=8 \
        --nnodes=2 --node_rank=1 \
        --master_addr=host1 --master_port=8080 \
        train.py data/binary/wmt14_en_fr \
        --configs configs/wmt14.en-fr/attention/multibranch_v2/embed496.yml \
        --distributed-no-spawn \
        --update-freq 8
```

## Models
We provide the checkpoints for our Lite Transformer reported in the paper:
| Dataset | \#Mult-Adds | Test Score | Model and Test Set |
|:--:|:--:|:--:|:--:|
| [WMT'14 En-Fr](http://statmt.org/wmt14/translation-task.html#Download) | 90M | 35.3 |[download](https://drive.google.com/open?id=10Iotg0dnt9sJTqEghtNhIIwJL1R3LYBe) |
| | 360M | 39.1 | [download](https://drive.google.com/open?id=10WMpIrdnDRWa_7afYJsqiiONdWlTLrJs) |
| | 527M | 39.6 | [download](https://drive.google.com/open?id=10Wfv80wOTkL-hkXNyxM8IVlcroHuuUvA) |
| [WMT'16 En-De](https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8) | 90M | 22.5 | [download](https://drive.google.com/open?id=10ArxzUsMZ8gDe6zw5d3xTHYmeUasys1q) |
| | 360M | 25.6 | [download](https://drive.google.com/open?id=10Fd1iXFiOtuwjxm1K8S2RqiEeCuDhxYn) |
| | 527M | 26.5 | [download](https://drive.google.com/open?id=10HYj-rcJ4CIPp-BtpckkmYIgzH5Urrz0)|
| [CNN / DailyMail](https://github.com/abisee/cnn-dailymail) | 800M | 38.3 (R-L) | [download](https://drive.google.com/open?id=14sQZ_H7HMQGhL7Ko1WkktWUvbEslOeu9)|
| [WIKITEXT-103](https://einstein.ai/research/the-wikitext-long-term-dependency-language-modeling-dataset) | 1147M | 22.2 (PPL) | [download](https://drive.google.com/file/d/14gT1j5VERgtDFfo2Ef1yOiliT9Y2eKe_/view?usp=sharing)|