As a sanity check, run evaluation using our ImageNet fine-tuned models:
ViT-Base | ViT-Large | ViT-Huge | |
---|---|---|---|
fine-tuned checkpoint | download | download | download |
md5 | 1b25e9 | 51f550 | 2541f2 |
reference ImageNet accuracy | 83.664 | 85.952 | 86.928 |
Evaluate ViT-Base in a single GPU (${IMAGENET_DIR}
is a directory containing {train, val}
sets of ImageNet):
python main_finetune.py --eval --resume mae_finetuned_vit_base.pth --model vit_base_patch16 --batch_size 16 --data_path ${IMAGENET_DIR}
This should give:
* Acc@1 83.664 Acc@5 96.530 loss 0.731
Evaluate ViT-Large:
python main_finetune.py --eval --resume mae_finetuned_vit_large.pth --model vit_large_patch16 --batch_size 16 --data_path ${IMAGENET_DIR}
This should give:
* Acc@1 85.952 Acc@5 97.570 loss 0.646
Evaluate ViT-Huge:
python main_finetune.py --eval --resume mae_finetuned_vit_huge.pth --model vit_huge_patch16 --batch_size 16 --data_path ${IMAGENET_DIR}
This should give:
* Acc@1 86.928 Acc@5 98.088 loss 0.584
Get our pre-trained checkpoints from here.
To fine-tune with multi-node distributed training, run the following on 4 nodes with 8 GPUs each:
python submitit_finetune.py \
--job_dir ${JOB_DIR} \
--nodes 4 \
--batch_size 32 \
--model vit_base_patch16 \
--finetune ${PRETRAIN_CHKPT} \
--epochs 100 \
--blr 5e-4 --layer_decay 0.65 \
--weight_decay 0.05 --drop_path 0.1 --reprob 0.25 --mixup 0.8 --cutmix 1.0 \
--dist_eval --data_path ${IMAGENET_DIR}
pip install submitit
) first.batch_size
per gpu) * 4 (nodes
) * 8 (gpus per node) = 1024.blr
is the base learning rate. The actual lr
is computed by the linear scaling rule: lr
= blr
* effective batch size / 256.Script for ViT-Large:
python submitit_finetune.py \
--job_dir ${JOB_DIR} \
--nodes 4 --use_volta32 \
--batch_size 32 \
--model vit_large_patch16 \
--finetune ${PRETRAIN_CHKPT} \
--epochs 50 \
--blr 1e-3 --layer_decay 0.75 \
--weight_decay 0.05 --drop_path 0.2 --reprob 0.25 --mixup 0.8 --cutmix 1.0 \
--dist_eval --data_path ${IMAGENET_DIR}
Script for ViT-Huge:
python submitit_finetune.py \
--job_dir ${JOB_DIR} \
--nodes 8 --use_volta32 \
--batch_size 16 \
--model vit_huge_patch14 \
--finetune ${PRETRAIN_CHKPT} \
--epochs 50 \
--blr 1e-3 --layer_decay 0.75 \
--weight_decay 0.05 --drop_path 0.3 --reprob 0.25 --mixup 0.8 --cutmix 1.0 \
--dist_eval --data_path ${IMAGENET_DIR}
To fine-tune our pre-trained ViT-Base with single-node training, run the following on 1 node with 8 GPUs:
OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node=8 main_finetune.py \
--accum_iter 4 \
--batch_size 32 \
--model vit_base_patch16 \
--finetune ${PRETRAIN_CHKPT} \
--epochs 100 \
--blr 5e-4 --layer_decay 0.65 \
--weight_decay 0.05 --drop_path 0.1 --mixup 0.8 --cutmix 1.0 --reprob 0.25 \
--dist_eval --data_path ${IMAGENET_DIR}
batch_size
per gpu) * 4 (accum_iter
) * 8 (gpus) = 1024. --accum_iter 4
simulates 4 nodes.The pre-trained models we provide are trained with normalized pixels --norm_pix_loss
(1600 epochs, Table 3 in paper). The fine-tuning hyper-parameters are slightly different from the default baseline using unnormalized pixels.
The original MAE implementation was in TensorFlow+TPU with no explicit mixed precision. This re-implementation is in PyTorch+GPU with automatic mixed precision (torch.cuda.amp
). We have observed different numerical behavior between the two platforms. In this repo, we use --global_pool
for fine-tuning; using --cls_token
performs similarly, but there is a chance of producing NaN when fine-tuning ViT-Huge in GPUs. We did not observe this issue in TPUs. Turning off amp could solve this issue, but is slower.
Here we use RandErase following DeiT: --reprob 0.25
. Its effect is smaller than random variance.
Run the following on 4 nodes with 8 GPUs each:
python submitit_linprobe.py \
--job_dir ${JOB_DIR} \
--nodes 4 \
--batch_size 512 \
--model vit_base_patch16 --cls_token \
--finetune ${PRETRAIN_CHKPT} \
--epochs 90 \
--blr 0.1 \
--weight_decay 0.0 \
--dist_eval --data_path ${IMAGENET_DIR}
batch_size
per gpu) * 4 (nodes
) * 8 (gpus per node) = 16384.blr
is the base learning rate. The actual lr
is computed by the linear scaling rule: lr
= blr
* effective batch size / 256.To train ViT-Large or ViT-Huge, set --model vit_large_patch16
or --model vit_huge_patch14
. It is sufficient to train 50 epochs --epochs 50
.
This PT/GPU code produces better results for ViT-L/H (see the table below). This is likely caused by the system difference between TF and PT.
ViT-Base | ViT-Large | ViT-Huge | |
---|---|---|---|
paper (TF/TPU) | 68.0 | 75.8 | 76.6 |
this repo (PT/GPU) | 67.8 | 76.0 | 77.2 |
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。