# RSGPT **Repository Path**: a--designer/RSGPT ## Basic Information - **Project Name**: RSGPT - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-07-31 - **Last Updated**: 2025-07-31 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README **RSGPT: A Remote Sensing Vision Language Model and Benchmark** [Yuan Hu](https://scholar.google.com.sg/citations?user=NFRuz4kAAAAJ&hl=zh-CN), Jianlong Yuan, [Congcong Wen](https://wencc.xyz), Xiaonan Lu, [Xiang Li☨](https://xiangli.ac.cn) ☨corresponding author This is an ongoing project. We are working on increasing the dataset size. ## Related Projects **RS-RAG: Bridging Remote Sensing Imagery and Comprehensive Knowledge with a Multi-Modal Dataset and Retrieval-Augmented Generation Model** [Congcong Wen*](https://wencc.xyz/), Yiting Lin*, Xiaokang Qu, Nan Li, Yong Liao, Hui Lin, [Xiang Li](https://xiangli.ac.cn) **FedRSCLIP: Federated learning for remote sensing scene classification using vision-language models** Hui Lin*, Chao Zhang*, Danfeng Hong, Kexin Dong, and [Congcong Wen☨](https://wencc.xyz) **RS-MoE: A Vision–Language Model With Mixture of Experts for Remote Sensing Image Captioning and Visual Question Answering** Hui Lin*, Danfeng Hong*, Shuhang Ge*, Chuyao Luo, Kai Jiang, Hao Jin, and [Congcong Wen☨](https://wencc.xyz) **VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding** Xiang Li, Jian Ding, Mohamed Elhoseiny **Vision-language models in remote sensing: Current progress and future trends** [Xiang Li*☨](https://xiangli.ac.cn), [Congcong Wen*](https://wencc.xyz/), [Yuan Hu*](https://scholar.google.com.sg/citations?user=NFRuz4kAAAAJ&hl=zh-CN), Zhenghang Yuan, [Xiao Xiang Zhu](https://www.professoren.tum.de/en/zhu-xiaoxiang) **RS-CLIP: Zero Shot Remote Sensing Scene Classification via Contrastive Vision-Language Supervision** [Xiang Li](https://xiangli.ac.cn), [Congcong Wen](https://wencc.xyz/), [Yuan Hu](https://scholar.google.com.sg/citations?user=NFRuz4kAAAAJ&hl=zh-CN), Nan Zhou ## :fire: Updates * **[2025.05.08]** We release the code for training and testing RSGPT. * **[2024.12.18]** We release the [manual scoring results](https://drive.google.com/file/d/1e3joLIiWfUgena17Dx8wZPWGNjs7vGua/view?usp=sharing) for RSIEval. * **[2024.06.19]** We release the VRSBench, A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding. VRSBench contains 29,614 images, with 29,614 human-verified detailed captions, 52,472 object references, and 123,221 question-answer pairs. check [VRSBench Project Page](https://vrsbench.github.io/). * **[2024.05.23]** We release the RSICap dataset. Please fill out this [form](https://docs.google.com/forms/d/1h5ydiswunM_EMfZZtyJjNiTMpeOzRwooXh73AOqokzU/edit) to get both RSICap and RSIEval dataset. * **[2023.11.10]** Our survey about vision-language models in remote sensing. [RSVLM](https://arxiv.org/pdf/2305.05726.pdf). * **[2023.10.22]** The RSICap dataset and code will be released upon paper acceptance. * **[2023.10.22]** We release the evaluation dataset RSIEval. Please fill out this [form](https://docs.google.com/forms/d/1h5ydiswunM_EMfZZtyJjNiTMpeOzRwooXh73AOqokzU/edit) to get both the RSIEval dataset. ## Dataset * RSICap: 2,585 image-text pairs with high-quality human-annotated captions. * RSIEval: 100 high-quality human-annotated captions with 936 open-ended visual question-answer pairs. ## Code The idea of finetuning our vision-language model is borrowed from [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4). Our model is based on finetuning [InstructBLIP](https://github.com/salesforce/LAVIS/blob/main/projects/instructblip/README.md) using our RSICap dataset. ## 🚀 Installation Set up a conda environment using the provided `environment.yml` file: ### Step 1: Create the environment ``` conda env create -f environment.yml ``` ### Step 2: Activate the environment ``` conda activate rsgpt ``` ## Training ``` torchrun --nproc_per_node=8 train.py --cfg-path train_configs/rsgpt_train.yaml ``` ## Testing Test image captioning: ``` python test.py --cfg-path eval_configs/rsgpt_eval.yaml --gpu-id 0 --out-path rsgpt/output --task ic ``` Test visual question answering: ``` python test.py --cfg-path eval_configs/rsgpt_eval.yaml --gpu-id 0 --out-path rsgpt/output --task vqa ``` ## Acknowledgement + [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4). A popular open-source vision-language model. + [InstructBLIP](https://github.com/salesforce/LAVIS/blob/main/projects/instructblip/README.md). The model architecture of RSGPT follows InstructBLIP. Don't forget to check out this great open-source work if you don't know it before! + [Lavis](https://github.com/salesforce/LAVIS). This repository is built upon Lavis! + [Vicuna](https://github.com/lm-sys/FastChat). The fantastic language ability of Vicuna with only 13B parameters is just amazing. And it is open-source! If you're using RSGPT in your research or applications, please cite using this BibTeX: ```bibtex @article{hu2025rsgpt, title={Rsgpt: A remote sensing vision language model and benchmark}, author={Hu, Yuan and Yuan, Jianlong and Wen, Congcong and Lu, Xiaonan and Liu, Yu and Li, Xiang}, journal={ISPRS Journal of Photogrammetry and Remote Sensing}, volume={224}, pages={272--286}, year={2025}, publisher={Elsevier} } ```