# LF-ViT

**Repository Path**: khanx/LF-ViT

## Basic Information

- **Project Name**: LF-ViT
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2024-09-24
- **Last Updated**: 2024-09-24

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

## LF-ViT: Reducing Spatial Redundancy in Vision Transformer for Efficient Image Recognition (AAAI 2024)

This is Pytorch implementation of our paper "LF-ViT: Reducing Spatial Redundancy in Vision Transformer for Efficient Image Recognition".

## Pre-trained Models

| Backbone | # of Location Stage | Accuracy	                | Checkpoints Google Links | Checkpoints Baidu Links                                                 |
|--- |---------------------|--------------------------|--------------------------|-------------------------------------------------------------------------|
| Deit-S   | 7x7                 | 80.8(m=5,threshold=0.76) | [Google Drive ](https://drive.google.com/file/d/1Pb9xgZ46orJ3C-D5MD0i1cLzd-EqTvzv/view?usp=sharing)        | [Baidu Drive ](https://pan.baidu.com/s/1u2mJ05NSNJxJ6IJJkU--eg)  (v435) |
| Deit-S   | 9x9                 | 82.2(m=8,threshold=0.75) | [Google Drive ](https://drive.google.com/file/d/1d94vVUqHSA1taqFqd_2xnwM964P2YCzL/view?usp=sharing)        | [Baidu Drive ](https://pan.baidu.com/s/1QB30WmG1rG2uKiW5aRYsxA)  (b69i) |

- What are contained in the checkpoints:

```
**.pth
├── model: state dictionaries of the model
├── flops: a list containing the GFLOPs corresponding to exiting at each stage
├── anytime_classification: Top-1 accuracy of each stage
├── budgeted_batch_classification: results of budgeted batch classification (a two-item list, [0] and [1] correspond to the two coordinates of a curve)

```

## Requirements
- python 3.9.7
- pytorch 1.10.1
- torchvision 0.11.2

## Data Preparation
- The ImageNet dataset should be prepared as follows:
```
ImageNet
├── train
│   ├── folder 1 (class 1)
│   ├── folder 2 (class 2)
│   ├── ...
├── val
│   ├── folder 1 (class 1)
│   ├── folder 2 (class 2)
│   ├── ...

```
## Evaluate Pre-trained Models
- Get accuracy of each stage
```
CUDA_VISIBLE_DEVICES=0 python dynamic_inference.py --eval-mode 0 --data_url PATH_TO_IMAGENET  --batch_size 64 --model lf_deit_small --checkpoint_path PATH_TO_CHECKPOINT  --location-stage-size {7,9} 

```

- Infer the model on the validation set with various threshold([0.01:1:0.01])
```
CUDA_VISIBLE_DEVICES=0 python dynamic_inference.py --eval-mode 1 --data_url PATH_TO_IMAGENET  --batch_size 64 --model lf_deit_small --checkpoint_path PATH_TO_CHECKPOINT  --location-stage-size {7,9} 

```

- Infer the model on the validation set with one threshold and meature the throughput

```
CUDA_VISIBLE_DEVICES=0 python dynamic_inference.py --eval-mode 2 --data_url PATH_TO_IMAGENET  --batch_size 1024 --model lf_deit_small --checkpoint_path PATH_TO_CHECKPOINT  --location-stage-size {7,9} --threshold THRESHOLD

```

- Read the evaluation results saved in pre-trained models
```
CUDA_VISIBLE_DEVICES=0 python dynamic_inference.py --eval-mode 3 --data_url PATH_TO_IMAGENET  --batch_size 64 --model lf_deit_small --checkpoint_path PATH_TO_CHECKPOINT  --location-stage-size {7,9} 

```

## Train
- Train LF-ViT on ImageNet 
```
python -m torch.distributed.launch --use_env --nproc_per_node=4 main_deit.py  --model lf_deit_small --batch-size 256 --data-path PATH_TO_IMAGENET --location-stage-size {7,9} --dist-eval --output PATH_TO_LOG

```


## Visualization
- Visualization of images correctly classified at location stage and focus stage.
```
python visualize.py --model lf_deit_small --resume  PATH_TO_CHECKPOINT --output_dir PATH_TP_SAVE --data-path PATH_TO_IMAGENET --batch-size 64 

```
## Citation
```
@inproceedings{LFViT, 
title={LF-ViT: Reducing Spatial Redundancy in Vision Transformer for Efficient Image Recognition}, 
volume={38}, 
author={Youbing Hu and Yun Cheng and Anqi Lu and Zhiqiang Cao and Dawei Wei and Jie Liu and Zhijun Li}, 
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
year={2024}, 
pages={2274-2284} 
}
```


## Acknowledgment
Our code of DeiT is from [here](https://github.com/facebookresearch/deitzhe). The visualization code is modified from [Evo-ViT](https://github.com/YifanXu74/Evo-ViT). The dynamic inference with early-exit code is modified from [DVT](https://github.com/blackfeather-wang/Dynamic-Vision-Transformer). Thanks to these authors.