# MobileViTv3

**Repository Path**: zhouweic36/MobileViTv3

## Basic Information

- **Project Name**: MobileViTv3
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2024-08-12
- **Last Updated**: 2024-08-12

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# MobileViTv3 : Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features [[arXiv](https://arxiv.org/abs/2209.15159)]

This repository contains MobileViTv3's source code for training and evaluation.
It uses the [CVNets](https://arxiv.org/pdf/2206.02002.pdf) library and is inspired by MobileViT ([paper](https://arxiv.org/abs/2110.02178?context=cs.LG), [code](https://github.com/apple/ml-cvnets)).

## Installation and Training Models:
We recommend to use Python 3.8+ and [PyTorch](https://pytorch.org) (version >= v1.8.0) with `conda` environment.
For setting-up the python environment with conda, see [here](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html).


### MobileViTv3\-S,XS,XXS
Download [MobileViTv1](https://github.com/apple/ml-cvnets/tree/d38a116fe134a8cd5db18670764fdaafd39a5d4f) and replace the files provided in [MobileViTv3-v1](MobileViTv3-v1).
Conda environment used for training: [environment_cvnet.yml](MobileViTv3-v1).
Then install according to instructions provided in the downloaded repository.
For training, use `training-and-evaluation readme` provided in the downloaded repository.


### MobileViTv3\-1.0,0.75,0.5
Download [MobileViTv2](https://github.com/apple/ml-cvnets/tree/84d992f413e52c0468f86d23196efd9dad885e6f) and replace the files provided in [MobileViTv3-v2](MobileViTv3-v2).
Conda environment used for training: [environment_mbvt2.yml](MobileViTv3-v2)
Then install according to instructions provided in the downloaded repository.
For training, use `training-and-evaluation readme` provided in the downloaded repository.


## Trained models:

Download the trained MobileViTv3 models from [here](https://github.com/micronDLA/MobileViTv3/releases/tag/v1.0.0).
`checkpoint_ema_best.pt` files inside the model folder is used to generated the accuracy of models.
Low-latency models are build by reducing the number of MobileViTv3-blocks in 'layer4' from 4 to 2.
Please refer to the paper for more details.
Note that for the segmentation and detection, only the backbone architecture parameters are listed.

## Classification 

### ImageNet-1K:
| Model name | Accuracy (%) | Parameters (Million) | FLOPs (Million) | Foldername  |
| :---: | :---: | :---: | :---: | :---: |
| MobileViTv3\-S | 79.3 | 5.8 | 1841 | mobilevitv3\_S\_e300\_7930 |
| MobileViTv3\-XS | 76.7 | 2.5 | 927 | mobilevitv3\_XS\_e300\_7671 |
| MobileViTv3\-XXS | 70.98 | 1.2 | 289 | mobilevitv3\_XXS\_e300\_7098 |
| MobileViTv3\-1.0 | 78.64 | 5.1 | 1876 | mobilevitv3\_1\_0\_0 |
| MobileViTv3\-0.75 | 76.55 | 3.0 | 1064 | mobilevitv3\_0\_7\_5 |
| MobileViTv3\-0.5 | 72.33 | 1.4 | 481 | mobilevitv3\_0\_5\_0 |


### ImageNet-1K using low-latency models:
| Model name | Accuracy (%) | Parameters (Million) | FLOPs (Million) | Foldername  |
| :---: | :---: | :---: | :---: | :---: |
| MobileViTv3\-S-L2 | 79.06 | 5.2 | 1651 | mobilevitv3\_S\_L2\_e300\_7906 |
| MobileViTv3\-XS-L2 | 76.10 | 2.3 | 853 | mobilevitv3\_XS\_L2\_e300\_7610 |
| MobileViTv3\-XXS-L2 | 70.23 | 1.1 | 256 | mobilevitv3\_XXS\_L2\_e300\_7023 |

## Segmentation

### PASCAL VOC 2012:
| Model name | mIoU | Parameters (Million) | Foldername  |
| :---: | :---: | :---: | :---: |
| MobileViTv3\-S | 79.59 | 7.2 | mobilevitv3\_S\_voc\_e50\_7959 |
| MobileViTv3\-XS | 78.77 | 3.3 | mobilevitv3\_XS\_voc\_e50\_7877 |
| MobileViTv3\-XXS | 74.04 | 2.0 | mobilevitv3\_XXS\_voc\_e50\_7404 |
| MobileViTv3\-1.0 | 80.04 | 13.6 | mobilevitv3\_voc\_1\_0\_0 |
| MobileViTv3\-0.5 | 76.48 | 6.3 | mobilevitv3\_voc\_0\_5\_0 |

### ADE20K:
| Model name | mIoU | Parameters (Million) | Foldername  |
| :---: | :---: | :---: | :---: |
| MobileViTv3\-1.0 | 39.13 | 13.6 | mobilevitv3\_ade20k\_1\_0\_0 |
| MobileViTv3\-0.75 | 36.43 | 9.7 | mobilevitv3\_ade20k\_0\_7\_5  |
| MobileViTv3\-0.5 | 33.57 | 6.4 | mobilevitv3\_ade20k\_0\_5\_0 |

## Detection MS-COCO:
| Model name | mAP | Parameters (Million) | Foldername  |
| :---: | :---: | :---: | :---: |
| MobileViTv3\-S | 27.3 | 5.5 | mobilevitv3\_S\_coco\_e200\_2730 |
| MobileViTv3\-XS | 25.6 | 2.7 | mobilevitv3\_XS\_coco\_e200\_2560 |
| MobileViTv3\-XXS | 19.3 | 1.5 | mobilevitv3\_XXS\_coco\_e200\_1930 |
| MobileViTv3\-1.0 | 27.0 | 5.8 | mobilevitv3\_coco\_1\_0\_0 |
| MobileViTv3\-0.75 | 25.0 | 3.7 | mobilevitv3\_coco\_0\_7\_5 |
| MobileViTv3\-0.5 | 21.8 | 2.0 | mobilevitv3\_coco\_0\_5\_0 |


## Citation

If you find this repository useful, please consider giving a star :star: and citation :mega::
```
@inproceedings{wadekar2022mobilevitv3,
  title = {MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features},
  author = {Wadekar, Shakti N. and Chaurasia, Abhishek},
  doi = {10.48550/ARXIV.2209.15159},
  year = {2022}
}
```