π«° We also have other multimodal continual instruction tuning projects that may interest you π«°.
> [**HiDe-LLaVA: Hierarchical Decoupling for Continual Instruction Tuning of Multimodal Large Language Model**](https://arxiv.org/pdf/2503.12941)
> Haiyang Guo, Fanhu Zeng, Ziwei Xiang, Fei Zhu, Da-Han Wang, Xu-Yao Zhang, Cheng-Lin Liu
[](https://github.com/Ghy0501/HiDe-LLaVA) [](https://arxiv.org/pdf/2503.12941) 
> [**Federated Continual Instruction Tuning**](https://arxiv.org/pdf/2503.12897)
> Haiyang Guo, Fanhu Zeng, Fei Zhu, Wenzhuo Liu, Da-Han Wang, Jian Xu, Xu-Yao Zhang, Cheng-Lin Liu
[](https://github.com/Ghy0501/FCIT) [](https://arxiv.org/pdf/2503.12897) 
> [**ModalPrompt: Towards Efficient Multimodal Continual Instruction Tuning with Dual-Modality Guided Prompt**](https://arxiv.org/pdf/2410.05849)
> Fanhu Zeng, Fei Zhu, Haiyang Guo, Xu-Yao Zhang, Cheng-Lin Liu
[](https://github.com/AuroraZengfh/ModalPrompt) [](https://arxiv.org/pdf/2410.05849) 
> [**Continual Learning for Generative AI: From LLMs to MLLMs and Beyond**](https://arxiv.org/pdf/2506.13045)
> Haiyang Guo, Fanhu Zeng, Fei Zhu, Jiayi Wang, Xukai Wang, Jingang Zhou, Hongbo Zhao,
Wenzhuo Liu, Shijie Ma, Da-Han Wang, Xu-Yao Zhang, Cheng-Lin Liu
[](https://github.com/Ghy0501/Awesome-Continual-Learning-in-Generative-Models) [](https://arxiv.org/pdf/2506.13045)
> [**MLLM-CL: Continual Learning for Multimodal Large Language Models**](https://arxiv.org/pdf/2506.05453)
> Hongbo Zhao, Fei Zhu, Haiyang Guo, Meng Wang, Rundong Wang, Gaofeng Meng, Zhaoxiang Zhang
[](https://github.com/bjzhb666/MLLM-CL) [](https://arxiv.org/pdf/2506.05453)
> [**LLaVA-c: Continual Improved Visual Instruction Tuning**](https://arxiv.org/pdf/2506.08666?)
> Wenzhuo Liu, Fei Zhu, Haiyang Guo, Longhui Wei, Cheng-Lin Liu
[](https://arxiv.org/pdf/2506.08666?)
## π° News
- **[2026.1.2]** π₯π₯π₯ We have updated the paper in [MCITlib](https://arxiv.org/pdf/2508.07307) with the latest results. Please feel free to check it out. πππ
- **[2025.10.14]** π₯π₯π₯ **MCITlib-v2** has been updated! The latest version includes training and testing code for **8 mainstream multimodal continual instruction tuning methods**, compatible with **2 base models** and **3 continual instruction tuning datasets**. πππ
- **[2025.09.16]** We have updated the new version of the [paper](https://arxiv.org/pdf/2508.07307) and attached the accuracy matrix of each method for reference. :tada:
- **[2025.08.12]** Initial [MCITlib](https://arxiv.org/pdf/2508.07307) paper released! :tada:
- **[2025.08.10]** Initial version of MCITlib is released. :tada:
## π₯ Methods Provided
- `LoRA-FT`: Baseline method which simply updates LoRA parameters on new tasks. [[Paper]](https://arxiv.org/pdf/2106.09685v1/1000) 
- `O-LoRA`: Orthogonal subspace learning for language model continual learning. [[Paper]](https://arxiv.org/pdf/2310.14152) 
- `MoELoRA`: CoIN: A Benchmark of Continual Instruction Tuning for Multimodal Large Language Models [[Paper]](https://proceedings.neurips.cc/paper_files/paper/2024/file/6a45500d9eda640deed90d8a62742be5-Paper-Datasets_and_Benchmarks_Track.pdf) 
- `ModalPrompt`: ModalPrompt: Dual-Modality Guided Prompt for Continual Learning of Large Multimodal Models [[Paper]](https://arxiv.org/pdf/2410.05849) 
- `CL-MoE`: CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question Answering [[Paper]](https://arxiv.org/pdf/2503.00413?) 
- `HiDe`: HiDe-LLaVA: Hierarchical Decoupling for Continual Instruction Tuning of Multimodal Large Language Model [[Paper]](https://arxiv.org/pdf/2503.12941?) 
- `SEFE`: SEFE: Superficial and Essential Forgetting Eliminator for Multimodal Continual Instruction Tuning [[Paper]](https://arxiv.org/pdf/2505.02486?) 
- `DISCO`: Federated Continual Instruction Tuning [[Paper]](https://arxiv.org/pdf/2503.12897) 
## π¦ Benchmarks
We currently report results on the [UCIT](https://github.com/Ghy0501/HiDe-LLaVA), [MLLM-DCL](https://github.com/bjzhb666/MLLM-CL) and [MLLM-ACL](https://github.com/bjzhb666/MLLM-CL) benchmarks. Please refer to the provided links to download the corresponding images and instruction sets, and organize them in the following directory structure:
```
|--your_path
|-- Domain_data
|-- AD
|-- Med
|-- RS
|-- Sci
|-- Fin
|-- Ability_data
|-- OCR
|-- OCR_test
|-- Math
|-- Math_test
|-- APP
|-- APP_test
|-- VP
|-- VP_test
|-- UCIT
|-- datasets
|-- ArxivQA
|-- CLEVR-Math
|-- Flickr30k
|-- IconQA
|-- ImageNet-R
|-- VizWiz
```
Note: You need to modify the data path in all the scripts to your own path.
## π¨ Models
We currently provide a reproduction based on the [LLaVA-1.5-7B](https://github.com/haotian-liu/LLaVA) and [InternVL-Chat-7B](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat_llava) model. Please download it to your local directory.
```
huggingface-cli download liuhaotian/llava-v1.5-7b --local-dir /your_path/llava-v1.5-7b
huggingface-cli download openai/clip-vit-large-patch14-336 --local-dir /your_path/clip-vit-large-patch14-336
huggingface-cli download OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-7B --local-dir /your_path/Internvl-chat-7b
huggingface-cli download OpenGVLab/InternViT-6B-224px --local-dir /your_path/InternViT-6B-224px
```
We also plan to extend our reproduction to other MLLM architectures in the near future.
Note: To meet the requirements of certain methods, we need to apply additional processing to the config file in the downloaded model. The details are outlined below:
1. add `"mm_text_select_layer": -1` and `"mm_text_tower": "/your_path/clip-vit-large-patch14-336"` to the `config.py` in your local model weight path `/your_path/llava-v1.5-7b` and `/your_path/Internvl-chat-7b`.
2. remove `"temperature": 0.9` and `"top_p": 0.6` in the `generation_config.json` of your local model weight path.
We provide reference `config.py` and `generation_config.json` in `examples`.
## π How to run
Note: Our experiment is conducted in a CUDA 11.8 environment, with most libraries in the setup aligned to this CUDA version. Therefore, we recommend using `nvcc -V` to check the CUDA version on your current server. If it does not match, please install CUDA 11.8 before proceeding.
### 1. Clone this repository
```
git clone https://github.com/Ghy0501/MCITlib.git
cd MCITlib
```
### 2. Install Package
```
conda create -n MCITlib python=3.10 -y
conda activate MCITlib
conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.8 -c pytorch -c nvidia
cd LLaVA/LoRA-FT
pip install --upgrade pip
pip install -e .
pip install -e ".[train]"
```
For installing [flash-attn](https://github.com/Dao-AILab/flash-attention/releases), we recommend downloading version 2.6.3 from the official repository according to your CUDA and PyTorch versions, and placing it in a local directory for manual installation. For example:
```
pip install flash_attn-2.6.3+cu118torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
```
We also provide an `environment.yml` file to help users identify missing dependencies and version mismatches. However, due to potential library conflicts, automatic installation may fail to install certain packages. We therefore recommend manually installing them based on the provided error messages and version specifications. For essential evaluation-related dependencies, please refer to the [UCIT](https://github.com/Ghy0501/HiDe-LLaVA) and [MLLM-CL](https://github.com/bjzhb666/MLLM-CL) repositories.
### 3. Modify path and parameter settings
Before running, please set all the model paths to your local paths. The paths that need to be modified are listed below, and donβt forget to update the dataset path as well.
- Change `/mnt/haiyangguo/mywork/CL-MLLM/MCITlib_v2` to `/your_path/MCITlib`.
- Change `/mnt/haiyangguo/mywork/FCIT/pre_trained/llava-v1.5-7b` to `/your_path/llava-v1.5-7b`.
- Change `/mnt/haiyangguo/mywork/CL-MLLM/pre_trained/Internvl-chat-7b` to `/your_path/Internvl-chat-7b`.
- Change `/mnt/ShareDB_6TB/models/clip-vit-large-patch14-336` to `/your_path/clip-vit-large-patch14-336`.
- Change `/mnt/ShareDB_6TB/models/InternViT-6B-224px` to `/your_path/InternViT-6B-224px`.
- Change `/mnt/ShareDB_6TB/datasets/MLLM_CL/checkpoint` to `/your_path/checkpoint`.
After adjusting the path, users can modify parameters like `gpu_num` based on their actual operating environment. All parameter settings are integrated into the `configs/` folder.
Note: We recommend using the `Find in Folder` command in VS Code for search and replace operations.
### 4. Training and Evaluation
We provide predefined training and testing hyperparameters in the `configs` files within each method's directory, which can be adjusted as needed. The corresponding training and testing scripts are located in the `scripts` directory. Once all paths are correctly configured, the scripts should execute without issues. For example:
```
cd LLaVA/LoRA-FT
sh scripts/MCITlib/Train/train_DCL.sh
```
The program will automatically perform both training and inference. However, for ModalPrompt (LLaVA version), training and inference must be executed separately. Please refer to its [repository](https://github.com/AuroraZengfh/ModalPrompt) for detailed instructions.
## Citation
```bibtex
@article{guo2025mcitlib,
title={MCITlib: Multimodal Continual Instruction Tuning Library and Benchmark},
author={Guo, Haiyang and Zhu, Fei and Zhao, Hongbo and Zeng, Fanhu and Liu, Wenzhuo and Ma, Shijie and Wang, Da-Han and Zhang, Xu-Yao},
journal={arXiv preprint arXiv:2508.07307},
year={2025}
}
```
## π€ Acknowledgments
We thank the following repos providing helpful functions in our work.
- [LLaVA](https://github.com/haotian-liu/LLaVA)
- [CoIN](https://github.com/zackschen/CoIN)
- [O-LoRA](https://github.com/cmnfriend/O-LoRA)
- [ModalPrompt](https://github.com/AuroraZengfh/ModalPrompt)
- [CL-MoE](https://github.com/ECNU-ICALK/CL-MoE)
- [HiDe-LLaVA](https://github.com/Ghy0501/HiDe-LLaVA)
- [SEFE](https://github.com/jinpeng0528/SEFE)
- [FCIT](https://github.com/Ghy0501/FCIT)
## π Contact
If you have any questions or suggestions for new features, please open an issue or contact the author, Haiyang Guo (guohaiyang2023@ia.ac.cn).