# MTVQA
**Repository Path**: ByteDance/MTVQA
## Basic Information
- **Project Name**: MTVQA
- **Description**: MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 1
- **Forks**: 0
- **Created**: 2024-06-01
- **Last Updated**: 2025-12-31
## Categories & Tags
**Categories**: cv
**Tags**: None
## README
# MTVQA
MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering
> Text-Centric Visual Question Answering (TEC-VQA) in its proper format not only facilitates human-machine interaction in text-centric visual environments but also serves as a de facto gold proxy to evaluate AI models in the domain of text-centric scene understanding. Nonetheless, most existing TEC-VQA benchmarks have focused on high-resource languages like English and Chinese. Despite pioneering works to expand multilingual QA pairs in non-text-centric VQA datasets through translation engines, the translation-based protocol encounters a substantial ''Visual-textual misalignment'' problem when applied to TEC-VQA. Specifically, it prioritizes the text in question-answer pairs while disregarding the visual text present in images. Furthermore, it fails to address complexities related to nuanced meaning, contextual distortion, language bias, and question-type diversity. In this work, we tackle multilingual TEC-VQA by introducing MTVQA, the first benchmark featuring high-quality human expert annotations across 9 diverse languages. Further, by comprehensively evaluating numerous state-of-the-art Multimodal Large Language Models (MLLMs), including GPT-4o, GPT-4V, Claude3, and Gemini, on the MTVQA dataset, it is evident that there is still large room for performance improvement, underscoring the value of the dataset. Additionally, we supply multilingual training data within the MTVQA dataset, demonstrating that straightforward fine-tuning with this data can substantially enhance multilingual TEC-VQA performance. We aspire that MTVQA will offer the research community fresh insights and stimulate further exploration in multilingual visual text comprehension.
|**[๐ Project Page](https://bytedance.github.io/MTVQA/)** | **[๐ Paper](https://arxiv.org/abs/2405.11985)** |**[๐ Dataset](https://huggingface.co/datasets/ByteDance/MTVQA)** | **[๐ Leaderboard](https://github.com/bytedance/MTVQA?tab=readme-ov-file#-leaderboard)**
## ๐ฅ News
* **`2025.05.16`** ๐ MTVQA is accepted by ACL 2025๏ผ
* **`2025.03.25`** ๐ The [Elice](https://elice.io/en) team from Korea tests their MLLM **Helpy-V Reasoning** on MTVQA. Helpy-V Reasoning ranks **Second** among all models, and notably, its performance in **Korean Text comprehension** is far superior to previous SOTA models. Congratulations to the Elice team!
*
* **`2024.12.12`** ๐ InternVL2.5 tests its performance on MTVQA, InternVL2.5 78B model outperforms Qwen2VL 76B and achieves the SOTA performance, congratulations to the [InternVL2.5](https://github.com/OpenGVLab/InternVL?tab=readme-ov-file) team!
* **`2024.09.29`** ๐ The Blue LM team from VIVO tests their BlueLM-V-3B performance on MTVQA. BlueLM-V-3B achieves performance comparable to GPT-4o. It ranks the third place among all the SOTA MLLMs!
* **`2024.09.09`** ๐ We test GPT-4o mini's performance on MTVQA and it performs exceptionally well among the leading lightweight MLLMs!
* **`2024.09.04`** ๐ InternVL2 tests its performance on MTVQA, InternVL2 76B model outperforms GPT-4V, thanks to the [InternVL2](https://internvl.readthedocs.io/en/latest/internvl2.0/evaluation.html#mtvqa) team.
* **`2024.08.30`** ๐ Qwen2VL 72B is released, outperforming GPT-4o and achieving the best performance overall, congratulations!
* **`2024.07.23`** ๐ MTVQA is now supported in [VLMEvalKit](https://github.com/open-compass/VLMEvalKit).
* **`2024.07.23`** ๐ MTVQA is now supported in [OpenCompass](https://opencompass.org.cn/home).
* **`2024.06.04`** ๐ We are excited to launch MTVQA, the first multilingual visual text comprehension evaluation benchmark for MLLMs! MTVQA includes **9** widely-used but low-resource languages, i.t., AR, DE, FR, IT, JA, KO, RU, TH, and VI.
* **`2024.06.04`** ๐ GPT-4o achieves the best performance overall, MiniCPM-V2.5 achieves the best performance among open-source models!
## ๐ Data
| [RawData (Google Drive)](https://drive.google.com/file/d/1u09EVNVj17ws_AHEB7Y0eZiSPseTJUTx/view?usp=sharing) | [Huggingface Dataset](https://huggingface.co/datasets/ByteDance/MTVQA)
## ๐ฎ Evaluation
The test code for evaluating models in the paper can be found in [scripts](./scripts).
If you want to add your results to the MTVQA leaderboard, feel free to email us directly at tangjingqun@bytedance.com, haoliu.0128@bytedance.com or can.huang@bytedance.com.
## ๐ LeaderBoard
| Models | Open-Source | AR | DE | FR | IT | JA | KO | RU | TH | VI | AVG. |
|---|---|---|---|---|---|---|---|---|---|---|---|
| InternVL2.5 78B ๐ฅ | โ | 15.9 | 39.0 | ๐45.6 | ๐42.9 | 21.1 | 33.9 | 12.2 | ๐23.8 | 41.5 | ๐31.9 |
| Helpy-V Reasoning ๐ฅ | โ | 16.6 | 34.7 | 40.0 | 39.6 | 20.2 | ๐43.5 | 13.6 | 13.0 | 46.6 | 31.6 |
| Qwen2-VL 72B ๐ฅ | โ | ๐ 20.7 | 36.5 | 44.1 | 42.8 | 21.6 | 37.4 | ๐15.6 | 17.7 | 41.6 | 30.9 |
| GPT-4o | โ | 20.2 | 34.2 | 41.2 | 32.7 | 20.0 | 33.9 | 11.5 | 22.5 | 34.2 | 27.8 |
| BlueLM-V-3B | โ | 17.3 | ๐39.5 | 44.7 | 32.2 | ๐23.5 | 34.0 | 9.2 | 20.3 | 22.9 | 27.0 |
| Claude3 Opus | โ | 15.1 | 33.4 | 40.6 | 34.4 | 19.4 | 27.2 | 13.0 | 19.5 | 29.1 | 25.7 |
| Qwen2-VL 7B | โ | 15.5 | 32.1 | 41.6 | 38.9 | 17.8 | 30.6 | 13.0 | 10.8 | 30.0 | 25.6 |
| GPT-4o mini | โ | 16.9 | 33.0 | 41.2 | 32.1 | 18.5 | 27.4 | 11.5 | 19.9 | 29.1 | 25.5 |
| Gemini Ultra | โ | 14.7 | 32.3 | 40.0 | 31.8 | 12.3 | 17.2 | 11.8 | 20.3 | 28.6 | 23.2 |
| InternVL2 76B | โ | 9.5 | 31.3 | 35.7 | 35.2 | 11.1 | 14.3 | 11.9 | 10.0 | 26.9 | 22.0 |
| GPT-4V | โ | 11.5 | 31.5 | 40.4 | 32.3 | 11.5 | 16.7 | 10.3 | 15.0 | 28.9 | 22.0 |
| QwenVL Max | โ | 7.7 | 31.4 | 37.6 | 30.2 | 18.6 | 25.4 | 10.4 | 4.8 | 23.5 | 21.1 |
| Claude3 Sonnet | โ | 10.5 | 28.9 | 35.6 | 31.8 | 13.9 | 22.2 | 11.0 | 15.2 | 20.8 | 21.1 |
| QwenVL Plus | โ | 4.8 | 28.8 | 33.7 | 27.1 | 12.8 | 19.9 | 9.4 | 5.6 | 18.1 | 17.8 |
| MiniCPM-V2.5 | โ | 6.1 | 29.6 | 35.7 | 26.0 | 12.1 | 13.1 | 5.7 | 12.6 | 15.3 | 17.3 |
| InternVL-V1.5 | โ | 3.4 | 27.1 | 31.4 | 27.1 | 9.9 | 9.0 | 4.9 | 8.7 | 12.4 | 14.9 |
| GLM4V | โ | 0.3 | 30.0 | 34.1 | 30.1 | 3.4 | 5.7 | 3.0 | 3.5 | 12.3 | 13.6 |
| TextSquare | โ | 3.7 | 27.0 | 30.8 | 26.7 | 3.2 | 7.2 | 6.7 | 5.2 | 12.4 | 13.6 |
| Mini-Gemini-HD-34B | โ | 2.2 | 25.0 | 29.2 | 25.5 | 6.1 | 8.6 | 4.1 | 4.3 | 11.8 | 13.0 |
| Xcomposer2-4KHD | โ | 2.0 | 20.6 | 23.2 | 21.6 | 5.6 | 7.7 | 4.1 | 6.1 | 10.1 | 11.2 |
| Llava-Next-34B | โ | 3.3 | 24.0 | 28.0 | 22.3 | 3.6 | 6.1 | 2.6 | 0.4 | 9.8 | 11.1 |
| TextMonkey | โ | 2.0 | 18.1 | 19.9 | 22.1 | 4.6 | 7.2 | 3.2 | 0.9 | 11.1 | 9.9 |
| MiniCPM-V2.0 | โ | 1.3 | 12.7 | 14.9 | 17.0 | 3.7 | 5.6 | 2.2 | 2.2 | 6.8 | 7.4 |
| mPLUG-DocOwl 1.5 | โ | 1.0 | 13.9 | 14.9 | 18.2 | 2.9 | 5.0 | 2.0 | 0.9 | 6.4 | 7.2 |
| YI-VL-34B | โ | 1.7 | 13.5 | 15.7 | 12.1 | 4.8 | 5.2 | 0.8 | 3.5 | 4.1 | 6.8 |
| DeepSeek-VL | โ | 0.6 | 14.2 | 15.3 | 15.2 | 2.9 | 3.8 | 1.6 | 0.9 | 5.2 | 6.6 |