同步操作将从 Mr.Huang/MiniGPT-4 强制同步,此操作会覆盖自 Fork 仓库以来所做的任何修改,且无法恢复!!!
确定后同步将在后台操作,完成时将刷新页面,请耐心等待。
Deyao Zhu* (On Job Market!), Jun Chen* (On Job Market!), Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. *Equal Contribution
King Abdullah University of Science and Technology
We now provide a pretrained MiniGPT-4 aligned with Vicuna-7B! The demo GPU memory consumption now can be as low as 12GB.
Click the image to chat with MiniGPT-4 around your images
More examples can be found in the project page.
1. Prepare the code and the environment
Git clone our repository, creating a python environment and ativate it via the following command
git clone https://github.com/Vision-CAIR/MiniGPT-4.git
cd MiniGPT-4
conda env create -f environment.yml
conda activate minigpt4
2. Prepare the pretrained Vicuna weights
The current version of MiniGPT-4 is built on the v0 versoin of Vicuna-13B. Please refer to our instruction here to prepare the Vicuna weights. The final weights would be in a single folder in a structure similar to the following:
vicuna_weights
├── config.json
├── generation_config.json
├── pytorch_model.bin.index.json
├── pytorch_model-00001-of-00003.bin
...
Then, set the path to the vicuna weight in the model config file here at Line 16.
3. Prepare the pretrained MiniGPT-4 checkpoint
Download the pretrained checkpoints according to the Vicuna model you prepare.
Checkpoint Aligned with Vicuna 13B | Checkpoint Aligned with Vicuna 7B |
---|---|
Downlad | Download |
Then, set the path to the pretrained checkpoint in the evaluation config file in eval_configs/minigpt4_eval.yaml at Line 11.
Try out our demo demo.py on your local machine by running
python demo.py --cfg-path eval_configs/minigpt4_eval.yaml --gpu-id 0
To save GPU memory, Vicuna loads as 8 bit by default, with a beam search width of 1. This configuration requires about 23G GPU memory for Vicuna 13B and 11.5G GPU memory for Vicuna 7B. For more powerful GPUs, you can run the model in 16 bit by setting low_resource to False in the config file minigpt4_eval.yaml and use a larger beam search width.
Thanks @WangRongsheng, you can also run our code on Colab
The training of MiniGPT-4 contains two alignment stages.
1. First pretraining stage
In the first pretrained stage, the model is trained using image-text pairs from Laion and CC datasets to align the vision and language model. To download and prepare the datasets, please check our first stage dataset preparation instruction. After the first stage, the visual features are mapped and can be understood by the language model. To launch the first stage training, run the following command. In our experiments, we use 4 A100. You can change the save path in the config file train_configs/minigpt4_stage1_pretrain.yaml
torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigpt4_stage1_pretrain.yaml
A MiniGPT-4 checkpoint with only stage one training can be downloaded here (13B) or here (7B). Compared to the model after stage two, this checkpoint generate incomplete and repeated sentences frequently.
2. Second finetuning stage
In the second stage, we use a small high quality image-text pair dataset created by ourselves and convert it to a conversation format to further align MiniGPT-4. To download and prepare our second stage dataset, please check our second stage dataset preparation instruction. To launch the second stage alignment, first specify the path to the checkpoint file trained in stage 1 in train_configs/minigpt4_stage1_pretrain.yaml. You can also specify the output path there. Then, run the following command. In our experiments, we use 1 A100.
torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigpt4_stage2_finetune.yaml
After the second stage alignment, MiniGPT-4 is able to talk about the image coherently and user-friendly.
If you're using MiniGPT-4 in your research or applications, please cite using this BibTeX:
@article{zhu2023minigpt,
title={MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models},
author={Zhu, Deyao and Chen, Jun and Shen, Xiaoqian and Li, Xiang and Elhoseiny, Mohamed},
journal={arXiv preprint arXiv:2304.10592},
year={2023}
}
This repository is under BSD 3-Clause License. Many codes are based on Lavis with BSD 3-Clause License here.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。