# UnIVAL **Repository Path**: dengtijin_admin/UnIVAL ## Basic Information - **Project Name**: UnIVAL - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-11-19 - **Last Updated**: 2024-11-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

Project Page | Paper | Demo | Checkpoints

**UnIVAL** is a 0.25B-parameter unified model that is multitask pretrained on image and video-text data and target image, video and audio-text downstream tasks.

# Online Demos Check out our demo on Huggingface Spaces: [Spaces](https://huggingface.co/spaces/mshukor/UnIVAL)

`General` means the pretrained model before finetuning. To easily play with our model we also provide several notebooks: `VG.ipynb`, `VQA.ipynb`, `Captioning.ipynb`, `Video_Captioning.ipynb`, and `Audio_Captioning.ipynb`

# News * **[2023.12]**: paper is accepted at [TMLR](https://openreview.net/forum?id=4uflhObpcp&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DTMLR%2FAuthors%23your-submissions))! * **[2023.8.12]**: we provide the scripts to train UnIVAL for audio/video-text tasks. * **[2023.7.31]**: we provide [here](rewarded_soups.md) more details to reproduce the results with UnIVAL on Visual Grounding used in our [Rewarded soups](https://github.com/alexrame/rewardedsoups) work. * **[2023.7.31]**: Released of UnIVAL code and model weights! We will release the scripts to train and evaluate audio/video tasks later.

# Table of Content * [Quantitative Results](#results) * [Installation](#installation) * [Datasets and Checkpoints](#datasets-and-checkpoints) * [Training and Inference](#training-and-inference) * [Zero-shot Evaluation](#zero-shot-evaluation) * [Parameter Efficient Finetuning (PEFT): Training only the linear layer](#parameter-efficient-finetuning) * [Multimodal Model Merging/Weight Interpolation](#multimodal-model-merging) * [Qualitative results](#qualitative-results) * [Citation](#citation) * [Acknowledgment](#acknowledgment)

# Results Here are some results on several multimodal tasks.

Task Visual Grounding Image Captioning VQA Visual Entailment VideoQA Video Captioning Audio Captioning

Dataset RefCOCO RefCOCO+ RefCOCOg COCO VQA v2 SNLI-VE MSRVTT-QA MSRVTT AudioCaps

Split val/test-a/test-b val/test-a/test-b val-u/test-u Karpathy test test-dev/test-std val/test test test test

Metric Acc. CIDEr Acc. Acc. Acc. CIDEr CIDEr

UnIVAL 89.1 / 91.5 / 85.2 82.2 / 86.9 / 75.3 84.7 / 85.2 137.0 77.0 / 77.1 78.2 / 78.6 43.5 60.5 71.3

# Installation ## Requirements * python 3.7.4 * pytorch 1.13+ * torchvision 0.14.1+ * JAVA 1.8 (for COCO evaluation) We recommend to first install pytorch before other libraries: ```bash git clone https://github.com/mshukor/UnIVAL.git pip install -r requirements.txt ``` Download the following model for captioning evaluation: ``` python -c "from pycocoevalcap.spice.spice import Spice; tmp = Spice()" ```

# Datasets and Checkpoints See [datasets.md](datasets.md) and [checkpoints.md](checkpoints.md).

# Training and Inference The scripts to launch pretraining, finetuning and evaluation can be found in `run_scripts/` folder. Below we provide more details. The data are stored in `.tsv` files with different format depending on the training task. To restore training you need to provide the last checkpoint checkpoint_last.pt to --restore-file, and pass --reset-dataloader --reset-meters --reset-optimizer as argument. We use slurm to launch the training/evaluation. ## Image Processing In some datasets, the images are encoded to base64 strings. To do this transformation you can use the following code: ```python from PIL import Image from io import BytesIO import base64 img = Image.open(file_name) # path to file img_buffer = BytesIO() img.save(img_buffer, format=img.format) byte_data = img_buffer.getvalue() base64_str = base64.b64encode(byte_data) # bytes base64_str = base64_str.decode("utf-8") # str ``` ## Pretraining

Task	Visual Grounding	Image Captioning	VQA	Visual Entailment	VideoQA	Video Captioning	Audio Captioning
Dataset	RefCOCO	RefCOCO+	RefCOCOg	COCO	VQA v2	SNLI-VE	MSRVTT-QA	MSRVTT	AudioCaps
Split	val/test-a/test-b	val/test-a/test-b	val-u/test-u	Karpathy test	test-dev/test-std	val/test	test	test	test
Metric	Acc.	CIDEr	Acc.	Acc.	Acc.	CIDEr	CIDEr
UnIVAL	89.1 / 91.5 / 85.2	82.2 / 86.9 / 75.3	84.7 / 85.2	137.0	77.0 / 77.1	78.2 / 78.6	43.5	60.5	71.3

1. Prepare the Dataset

The format for pretraining tsv files are as follows:

Each line contains uniq-id, image/video path, caption, question, answer, ground-truth objects (objects appearing in the caption or question), dataset name (source of the data) and task type (caption, qa or visual gronunding). Prepared for the pretraining tasks of visual grounding, grounded captioning, image-text matching, image captioning and visual question answering. In addition, the folder negative_sample contains three files all_captions.txt, object.txt and type2ans.json. The data in these files are used as negative samples for the image/video-text matching task.

2. Pretraining

There is 3 scripts to train UnIVAL. unival_s1.sh for stage 1 training initialized from BART weights, unival_s2.sh for stage 2 training, initialized from the weights after stage 1, and unival_s2_hs.sh for high-resolution training during 1 epoch, initialized from the weights of stage 2. For example to launch for stage 1:

cd run_scripts/pretraining
bash unival_s1.sh

## Image Captioning

1. Prepare the Dataset & Checkpoints

Each image corresponds to only 1 caption in caption_stage1_train.tsv and corresponds to multiple captions in other TSV files (about 5 captions per image). Each line of the dataset represents a caption sample with the following format. The information of uniq-id, image-id, caption, predicted object labels (taken from VinVL, not used), image base64 string are separated by tabs.

162365  12455   the sun sets over the trees beyond some docks.  sky&&water&&dock&&pole  /9j/4AAQSkZJ....UCP/2Q==

2. Finetuning

To finetune for image captioning:

cd run_scripts/caption
sh unival_caption_stage_1.sh > unival_caption_stage_1.out

3. Inference

You can use the following code for inference, after setting the right weights path:

cd run_scripts/caption/eval ; sh eval_caption.sh  # inference & evaluate

## Visual Question Answering

1. Prepare the Dataset & Checkpoints

Following common practice, VG-QA samples are also included in the training data. To adapt to the seq2seq paradigm of OFA, we transform original VQA training questions with multiple golden answers into multiple training samples. For the original VQA validation set, we keep around 10k samples for our validation and utilize the other samples for training. Each line of the dataset represents a VQA sample with the following format. The information of question-id, image-id, question, answer (with confidence), predicted object labels (taken from VinVL, slightly brings around +0.1 accuracy improvement), image base64 string are separated by tabs.

79459   79459   is this person wearing shorts?  0.6|!+no    house&&short&&...&&sky  /9j/4AAQS...tigZ/9k=

2. Shuffle the Training Data

(Optional, but achieves better finetuning accuracy): If the disk storage is sufficient, we recommend to prepare the shuffled training data for each epoch in advance.

cd dataset/vqa_data
ln vqa_train.tsv vqa_train_1.tsv
for idx in `seq 1 9`;do shuf vqa_train_${idx}.tsv > vqa_train_$[${idx}+1].tsv;done # each file is used for an epoch

3. Finetuning

If you have shuffled the training data in the previous step, please correctly specify the training data path following the guide in the script comments.

cd run_scripts/vqa
bash unival_vqa.sh

4. Inference

We use beam-search during inference.

cd run_scripts/vqa/eval
bash evaluate_vqa.sh  # specify 'val' or 'test' in the script

## Visual Grounding

1. Prepare the Dataset & Checkpoints

We use RefCOCO (split by UNC), RefCOCO+ (split by UNC) and RefCOCOg (split by UMD) datasets. See RefCOCO and Refer for more details. Note that in the original dataset, each region-coord (or bounding box) may corresponds to multiple descriptive texts. We split these texts into multiple samples so that the region-coord in each sample corresponds to only one text. Each line of the processed dataset represents a sample with the following format. The information of uniq-id, image-id, text, region-coord (separated by commas), image base64 string are separated by tabs.

79_1    237367  A woman in a white blouse holding a glass of wine.  230.79,121.75,423.66,463.06 9j/4AAQ...1pAz/9k=

2. Finetuning

cd run_scripts/refcoco
sh unival_refcoco.sh > train_refcoco.out &  # finetune for refcoco
sh unival_refcocoplus.sh > train_refcocoplus.out &  # finetune for refcoco+
sh unival_refcocog.sh > train_refcocog.out &  # finetune for refcocog

3. Inference

Run the following commands for the evaluation.

cd run_scripts/refcoco/eval ; sh eva_refcoco.sh  # eva_refcocog.sh, eva_refcocoplus.sh

## Visual Entailment

1. Prepare the Dataset & Checkpoints

Each line of the processed dataset represents a sample with the following format. The information of uniq-id, image-id, image base64 string, hypothesis, caption (or text premise), label are separated by tabs.

252244149.jpg#1r1n  252244149   /9j/4AAQ...MD/2Q==   a man in pink and gold is chewing on a wooden toothpick.   a man in pink is chewing a toothpick on the subway.   neutral

2. Finetuning

Contrary to previous work (e.g. OFA) we do not use the text premise for this task.

cd run_scripts/snli_ve
nohup sh unival_snli_ve.sh > train_snli_ve.out &  # finetune for snli_ve

3. Inference

Run the following command to obtain the results.

cd run_scripts/snli_ve/eval ; sh eval_snli_ve.sh  # specify 'dev' or 'test' in the script

## Text-to-Image Generation

1. Prepare the Dataset & Checkpoints

The dataset zipfile coco_image_gen.zip contains coco_vqgan_train.tsv, coco_vqgan_dev.tsv and coco_vqgan_full_test.tsv. Each line of the dataset represents a sample with the following format. The information of uniq-id, image-code (produced by vqgan, a list of integers separated by single-whitespaces), lowercased caption are separated by tabs.

1	6674 4336 4532 5334 3251 5461 3615 2469 ...4965 4190 1846	the people are posing for a group photo.

The checkpoint zipfile image_gen_large_best.zip contains image_gen_large_best.pt, vqgan/last.ckpt, vqgan/model.yaml and clip/Vit-B-16.pt.

2. Finetuning

We divide the finetuning process of image generating into two stages. In stage 1, we finetune OFA with cross-entropy loss. In stage 2, we select the last checkpoint of stage 1 and train with CLIP Score optimization. During the validation, the generated image will be dumped into _GEN_IMAGE_PATH_.

cd run_scripts/image_gen
nohup sh unival_image_gen_stage_1.sh # stage 1, train with cross-entropy loss
nohup sh unival_image_gen_stage_2.sh # stage 2, load the last ckpt of stage1 and train with CLIP Score optimization

4. Inference

Run the command below to generate your images.

cd run_scripts/image_gen/eval ; sh eval_image_gen.sh  # inference & evaluate (FID, IS and CLIP Score)

# Zero-shot Evaluation Here we provide the scripts for zero-shot evaluation on image-text tasks. You need to specify the path to pretrained model in each of these scripts: * Image Caption on Nocaps: caption/eval/eval_nocaps.sh * VQA on VizWiz: vqa/eval/eval_vizwiz.sh * VQA on Nocaps: vqa/eval/eval_okvqa.sh

# Parameter Efficient Finetuning ## Training only the linear connection Following [eP-ALM](https://github.com/mshukor/eP-ALM), we experiment with efficient finetuning by training only the linear connection between the modality spcific-encoders and the language model, while keeping all other parameters frozen: * Image Caption on COCO: caption/onlylinear/unival_caption_stage_s2_onlylinear.sh * Video Caption on MSRVTT: caption/onlylinear/unival_video_caption_stage_s2_onlylinear.sh * Audio Caption on Audiocaps: caption/onlylinear/unival_audio_caption_stage_s2_onlylinear.sh * VQA on VQAv2: vqa/onlylinear/unival_vqa_s2_onlylinear.sh * Video QA on MSRVTT: vqa/onlylinear/unival_video_vqa_s2_onlylinear.sh To finetune the stage-1 pretrained model, you can use the scripts with `s1`.

# Multimodal Model Merging In this section we provide the details to reproduce the experiments for weight interpolation and different weight averaging experiments. The objective is to leverage the synergy between models finetuned on different multimodal tasks. ## Weight interpolation To average several models, you can use `preprocess/average_save_models.py`. There is two options, either you average many models with uniform interpolation coefficient, or you interpolate between 2 models with interpolation coefficient from 0 to 1. However, you can also customise this script as you like. Once you saved the interpolated weights, you can use the following scripts to evaluate the model: ``` ## image-text tasks sh caption/eval/eval_caption_avg.sh sh refcoco/eval/eval_refcocoplus_avg.sh sh snli_ve/eval/eval_snli_ve_avg.sh sh vqa/eval/eval_vqa_avg.sh ## video-text tasks sh vqa/eval/video/eval_video_qa_avg.sh sh caption/eval/video/eval_msrvtt_video_caption_avg.sh ``` ## Ratatouille Finetuning For [Ratatouille finetuning](https://github.com/facebookresearch/ModelRatatouille), each one of the auxiliary models (e.g. models finetuned for captioning, vqa, visual grounding and visual entailment) are re-finetuned on the target task. At the end all obtained models are uniformly averaged. The scripts to launch the finetuning and evaluation are in `averaging/ratatouille/`. You need also to use the weight averaging script in `preprocess/average_save_models.py`. ## Fusing Finetuning For [Fusing finetuning](https://arxiv.org/abs/2204.03044), first the auxiliary models are averaged, then finetuned on the target task. The scripts to launch the finetuning and evaluation are in `averaging/fusing/`.

# Qualitative Results Below we provide qualitative results for some tasks. ## Visual Grounding

## Image Captioning

## Open-Ended VQA

# Citation If you find the work helpful, you can cite it using the following citation: ``` @article{ shukor2023unival, title={Un{IVAL}: Unified Model for Image, Video, Audio and Language Tasks}, author={Mustafa Shukor and Corentin Dancette and Alexandre Rame and Matthieu Cord}, journal={Transactions on Machine Learning Research}, issn={2835-8856}, year={2023}, url={https://openreview.net/forum?id=4uflhObpcp}, note={} } ```

# Aknowledgment This code is based mainly on the following repos: * [OFA](https://github.com/OFA-Sys/OFA) * [Fairseq](https://github.com/pytorch/fairseq) * [taming-transformers](https://github.com/CompVis/taming-transformers) We thank the authors for releasing their code.