# UnIVAL **Repository Path**: dengtijin_admin/UnIVAL ## Basic Information - **Project Name**: UnIVAL - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-11-19 - **Last Updated**: 2024-11-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README
 Project Page   |  Paper   |  Demo  |  Checkpoints 
**UnIVAL** is a 0.25B-parameter unified model that is multitask pretrained on image and video-text data and target image, video and audio-text downstream tasks.
# Online Demos
Check out our demo on Huggingface Spaces: [Spaces](https://huggingface.co/spaces/mshukor/UnIVAL)
`General` means the pretrained model before finetuning.
To easily play with our model we also provide several notebooks: `VG.ipynb`, `VQA.ipynb`, `Captioning.ipynb`, `Video_Captioning.ipynb`, and `Audio_Captioning.ipynb`
# News
* **[2023.12]**: paper is accepted at [TMLR](https://openreview.net/forum?id=4uflhObpcp&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DTMLR%2FAuthors%23your-submissions))!
* **[2023.8.12]**: we provide the scripts to train UnIVAL for audio/video-text tasks.
* **[2023.7.31]**: we provide [here](rewarded_soups.md) more details to reproduce the results with UnIVAL on Visual Grounding used in our [Rewarded soups](https://github.com/alexrame/rewardedsoups) work.
* **[2023.7.31]**: Released of UnIVAL code and model weights! We will release the scripts to train and evaluate audio/video tasks later.
# Table of Content
* [Quantitative Results](#results)
* [Installation](#installation)
* [Datasets and Checkpoints](#datasets-and-checkpoints)
* [Training and Inference](#training-and-inference)
* [Zero-shot Evaluation](#zero-shot-evaluation)
* [Parameter Efficient Finetuning (PEFT): Training only the linear layer](#parameter-efficient-finetuning)
* [Multimodal Model Merging/Weight Interpolation](#multimodal-model-merging)
* [Qualitative results](#qualitative-results)
* [Citation](#citation)
* [Acknowledgment](#acknowledgment)
# Results
Here are some results on several multimodal tasks.
Task | Visual Grounding | Image Captioning | VQA | Visual Entailment | VideoQA | Video Captioning | Audio Captioning | |||
---|---|---|---|---|---|---|---|---|---|---|
Dataset | RefCOCO | RefCOCO+ | RefCOCOg | COCO | VQA v2 | SNLI-VE | MSRVTT-QA | MSRVTT | AudioCaps | |
Split | val/test-a/test-b | val/test-a/test-b | val-u/test-u | Karpathy test | test-dev/test-std | val/test | test | test | test | |
Metric | Acc. | CIDEr | Acc. | Acc. | Acc. | CIDEr | CIDEr | |||
UnIVAL | 89.1 / 91.5 / 85.2 | 82.2 / 86.9 / 75.3 | 84.7 / 85.2 | 137.0 | 77.0 / 77.1 | 78.2 / 78.6 | 43.5 | 60.5 | 71.3 |
checkpoint_last.pt
to --restore-file
, and pass --reset-dataloader --reset-meters --reset-optimizer
as argument.
We use slurm to launch the training/evaluation.
## Image Processing
In some datasets, the images are encoded to base64 strings.
To do this transformation you can use the following code:
```python
from PIL import Image
from io import BytesIO
import base64
img = Image.open(file_name) # path to file
img_buffer = BytesIO()
img.save(img_buffer, format=img.format)
byte_data = img_buffer.getvalue()
base64_str = base64.b64encode(byte_data) # bytes
base64_str = base64_str.decode("utf-8") # str
```
## Pretraining
The format for pretraining tsv files are as follows:
negative_sample
contains three files all_captions.txt
, object.txt
and type2ans.json
. The data in these files are used as negative samples for the image/video-text matching task.
There is 3 scripts to train UnIVAL. unival_s1.sh
for stage 1 training initialized from BART weights, unival_s2.sh
for stage 2 training, initialized from the weights after stage 1, and unival_s2_hs.sh
for high-resolution training during 1 epoch, initialized from the weights of stage 2. For example to launch for stage 1:
cd run_scripts/pretraining bash unival_s1.sh
Each image corresponds to only 1 caption in caption_stage1_train.tsv
and corresponds to multiple captions in other TSV files (about 5 captions per image). Each line of the dataset represents a caption sample with the following format. The information of uniq-id, image-id, caption, predicted object labels (taken from VinVL, not used), image base64 string are separated by tabs.
162365 12455 the sun sets over the trees beyond some docks. sky&&water&&dock&&pole /9j/4AAQSkZJ....UCP/2Q==
To finetune for image captioning:
cd run_scripts/caption sh unival_caption_stage_1.sh > unival_caption_stage_1.out
You can use the following code for inference, after setting the right weights path:
cd run_scripts/caption/eval ; sh eval_caption.sh # inference & evaluate
Following common practice, VG-QA samples are also included in the training data. To adapt to the seq2seq paradigm of OFA, we transform original VQA training questions with multiple golden answers into multiple training samples. For the original VQA validation set, we keep around 10k samples for our validation and utilize the other samples for training. Each line of the dataset represents a VQA sample with the following format. The information of question-id, image-id, question, answer (with confidence), predicted object labels (taken from VinVL, slightly brings around +0.1 accuracy improvement), image base64 string are separated by tabs.
79459 79459 is this person wearing shorts? 0.6|!+no house&&short&&...&&sky /9j/4AAQS...tigZ/9k=
(Optional, but achieves better finetuning accuracy): If the disk storage is sufficient, we recommend to prepare the shuffled training data for each epoch in advance.
cd dataset/vqa_data ln vqa_train.tsv vqa_train_1.tsv for idx in `seq 1 9`;do shuf vqa_train_${idx}.tsv > vqa_train_$[${idx}+1].tsv;done # each file is used for an epoch
If you have shuffled the training data in the previous step, please correctly specify the training data path following the guide in the script comments.
cd run_scripts/vqa bash unival_vqa.sh
We use beam-search during inference.
cd run_scripts/vqa/eval bash evaluate_vqa.sh # specify 'val' or 'test' in the script
We use RefCOCO (split by UNC), RefCOCO+ (split by UNC) and RefCOCOg (split by UMD) datasets. See RefCOCO and Refer for more details. Note that in the original dataset, each region-coord (or bounding box) may corresponds to multiple descriptive texts. We split these texts into multiple samples so that the region-coord in each sample corresponds to only one text. Each line of the processed dataset represents a sample with the following format. The information of uniq-id, image-id, text, region-coord (separated by commas), image base64 string are separated by tabs.
79_1 237367 A woman in a white blouse holding a glass of wine. 230.79,121.75,423.66,463.06 9j/4AAQ...1pAz/9k=
cd run_scripts/refcoco sh unival_refcoco.sh > train_refcoco.out & # finetune for refcoco sh unival_refcocoplus.sh > train_refcocoplus.out & # finetune for refcoco+ sh unival_refcocog.sh > train_refcocog.out & # finetune for refcocog
Run the following commands for the evaluation.
cd run_scripts/refcoco/eval ; sh eva_refcoco.sh # eva_refcocog.sh, eva_refcocoplus.sh
Each line of the processed dataset represents a sample with the following format. The information of uniq-id, image-id, image base64 string, hypothesis, caption (or text premise), label are separated by tabs.
252244149.jpg#1r1n 252244149 /9j/4AAQ...MD/2Q== a man in pink and gold is chewing on a wooden toothpick. a man in pink is chewing a toothpick on the subway. neutral
Contrary to previous work (e.g. OFA) we do not use the text premise for this task.
cd run_scripts/snli_ve nohup sh unival_snli_ve.sh > train_snli_ve.out & # finetune for snli_ve
Run the following command to obtain the results.
cd run_scripts/snli_ve/eval ; sh eval_snli_ve.sh # specify 'dev' or 'test' in the script
The dataset zipfile coco_image_gen.zip
contains coco_vqgan_train.tsv
, coco_vqgan_dev.tsv
and coco_vqgan_full_test.tsv
. Each line of the dataset represents a sample with the following format. The information of uniq-id, image-code (produced by vqgan, a list of integers separated by single-whitespaces), lowercased caption are separated by tabs.
1 6674 4336 4532 5334 3251 5461 3615 2469 ...4965 4190 1846 the people are posing for a group photo.
The checkpoint zipfile image_gen_large_best.zip
contains image_gen_large_best.pt
, vqgan/last.ckpt
, vqgan/model.yaml
and clip/Vit-B-16.pt
.
We divide the finetuning process of image generating into two stages. In stage 1, we finetune OFA with cross-entropy loss. In stage 2, we select the last checkpoint of stage 1 and train with CLIP Score optimization. During the validation, the generated image will be dumped into _GEN_IMAGE_PATH_
.
cd run_scripts/image_gen nohup sh unival_image_gen_stage_1.sh # stage 1, train with cross-entropy loss nohup sh unival_image_gen_stage_2.sh # stage 2, load the last ckpt of stage1 and train with CLIP Score optimization
Run the command below to generate your images.
cd run_scripts/image_gen/eval ; sh eval_image_gen.sh # inference & evaluate (FID, IS and CLIP Score)
caption/eval/eval_nocaps.sh
* VQA on VizWiz: vqa/eval/eval_vizwiz.sh
* VQA on Nocaps: vqa/eval/eval_okvqa.sh
caption/onlylinear/unival_caption_stage_s2_onlylinear.sh
* Video Caption on MSRVTT: caption/onlylinear/unival_video_caption_stage_s2_onlylinear.sh
* Audio Caption on Audiocaps: caption/onlylinear/unival_audio_caption_stage_s2_onlylinear.sh
* VQA on VQAv2: vqa/onlylinear/unival_vqa_s2_onlylinear.sh
* Video QA on MSRVTT: vqa/onlylinear/unival_video_vqa_s2_onlylinear.sh
To finetune the stage-1 pretrained model, you can use the scripts with `s1`.
## Image Captioning
## Open-Ended VQA
# Citation
If you find the work helpful, you can cite it using the following citation:
```
@article{
shukor2023unival,
title={Un{IVAL}: Unified Model for Image, Video, Audio and Language Tasks},
author={Mustafa Shukor and Corentin Dancette and Alexandre Rame and Matthieu Cord},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=4uflhObpcp},
note={}
}
```
# Aknowledgment
This code is based mainly on the following repos:
* [OFA](https://github.com/OFA-Sys/OFA)
* [Fairseq](https://github.com/pytorch/fairseq)
* [taming-transformers](https://github.com/CompVis/taming-transformers)
We thank the authors for releasing their code.