# MMLongCite
**Repository Path**: ByteDance/MMLongCite
## Basic Information
- **Project Name**: MMLongCite
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-11-04
- **Last Updated**: 2026-06-02
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# MMLONGCITE: A Benchmark for Evaluating Faithfulness of Long-Context Vision-Language Models
## 📢 News
- **[May, 2026]** Major update to MMLongCite: we extend the benchmark up to 128K tokens and refine the task taxonomy for more comprehensive faithfulness evaluation.
- **[October, 2025]** Code and data of MMLongCite are now publicly available.
## 🚀 Quick Navigation
- [🔍 Benchmark Overview](#-benchmark-overview)
- [⚙️ Preparation](#️-preparation)
- [Environment Setup](#environment)
- [Data Preparation](#data-prepare)
- [🤖️ Inference & Evaluation](#️-inference--evaluation)
- [Inference](#inference)
- [Evaluation](#evaluation)
- [📝 Citation](#-citation)
- [🏷️ License](#-license)
## 🔍 Benchmark Overview
MMLongCite is a benchmark for evaluating the **faithfulness** of long-context vision-language models (LCVLMs) through multimodal citation generation.
The benchmark contains 2,280 examples across 8 tasks, covering **image-only, image-text interleaved, and video-only** contexts. Context lengths span from **8K to 128K tokens**.
We also introduce MMLongCite-HR, a **high-resolution** setting that evaluates fine-grained visual grounding in dense stitched-image inputs. MMLongCite-HR provides two modes: the easy mode stitches 4 images into a single large image with an average resolution of 1K–2K, while the hard mode stitches 16 images into a single large image with an average resolution of 2K–4K.
Figure 1: Task format in MMLongCite.
Figure 2: Statistics of tasks in MMLongCite.
## ⚙️ Preparation
### Environment
Make sure you are in this project folder and then run:
```
conda activate /your/env_name
pip install -r requirements.txt
```
### Data Prepare
You can download MMLongCite data from [🤗 Hugging face](https://huggingface.co/datasets/Jonaszky123/MMLongCite). Once downloaded, place the data in the root directory of the repository.
The folder structure is organized as follows:
```
project/
├── data/ # Downloaded from Huggingface
│ ├── mmlongcite/
│ └── mmlongcite-hr/
│ ├── easy/
│ └── hard/
├── images/ # Downloaded from Huggingface
│ ├── mmlongcite/
│ └── mmlongcite-hr/
│ ├── easy/
│ └── hard/
├── scripts/
│ ├── infer.sh
│ └── eval.sh
├── src/ # Source code
├── results/ # Benchmark inference outputs
└── readme.md # Documentation
```
All data in MMLongCite follows the format below:
- id: A unique identifier for the data sample.
- context: A list containing all the contextual information (e.g., images, text) needed to answer the question.
- question: A list containing the specific question to be answered, which may include text and multiple-choice options.
- ground_truth: The correct answer for the question.
- meta: A dictionary containing additional information for each case, including:
- text_length: The length of text content within the context.
- mm_length: The length of multi-modal content within the context.
- evidence_ids: A list of position identifiers indicating where the supporting evidence is located within the long context.
Here is an example:
```
{
"id": 7,
"context": [
{
"type": "image",
"image": "image/mmlongcite/longdocurl/4120884_59.png"
},
...
],
"question": [
{
"type": "text",
"text": "Which para title discusses non-GAAP financial measures in the document?"
}
]
"ground_truth": "Non-US. GAAP Financial Measures",
"meta": {
"text_length": 0,
"mm_length": 11760,
"evidence_ids": [
7,
12
]
}
}
```
## 🤖️ Inference & Evaluation
### Inference
We provide a vLLM-based inference script in `src/infer_vllm.py`.
Run inference on the main MMLongCite benchmark:
```
python src/infer_vllm.py \
--model \
--dataset longdocurl mmlongbench-doc slidevqa 2wikimultihopqa mm-niah visual-haystack video-mme longvideobench
```
Run inference on MMLongCite-HR:
```
python src/infer_vllm.py \
--model \
--dataset longdocurl-hr-easy longdocurl-hr-hard \
2wikimultihopqa-hr-easy 2wikimultihopqa-hr-hard \
visual-haystack-hr-easy visual-haystack-hr-hard \
longvideobench-hr-easy longvideobench-hr-hard
```
For models with thinking mode enabled:
```
python src/infer_vllm.py \
--model \
--dataset longdocurl mmlongbench-doc slidevqa 2wikimultihopqa mm-niah visual-haystack video-mme longvideobench \
--thinking
```
Prediction files are saved to:
```
results//.json
results//_thinking.json
```
You can also refer to the example script:
```
bash script/infer.sh
```
### Evaluation
MMLongCite evaluates both citation quality and answer correctness. We use GPT-5.2 as the judge model, and the evaluation scripts support passing in multiple API keys to accelerate evaluation.
#### Citation Evaluation
```
python src/eval_cite.py \
--model \
--task \
--api_keys \
--api_base_url
```
This produces:
```
results//_citation_result.json
results//_citation_score.json
```
#### Correctness Evaluation
```
python src/eval_correct.py \
--model \
--task \
--api_keys \
--api_base_url
```
This produces:
```
results//_correctness_result.json
results//_correctness_score.json
```
Among the outputs, `citation_result.json` and `correctness_result.json` store per-case metrics for every example, while `citation_score.json` and `correctness_score.json` store the overall aggregated metrics for each task. Example evaluation commands are provided in:
```bash
bash script/eval.sh
```
## 📝 Citation
If you find our work helpful, please cite our paper:
```
@article{zhou2025mmlongcite,
title={MMLongCite: A Benchmark for Evaluating Fidelity of Long-Context Vision-Language Models},
author={Zhou, Keyan and Tang, Zecheng and Ming, Lingfeng and Zhou, Guanghao and Chen, Qiguang and Qiao, Dan and Yang, Zheming and Qin, Libo and Qiu, Minghui and Li, Juntao and others},
journal={arXiv preprint arXiv:2510.13276},
year={2025}
}
```
## 🏷️ License
All code within this repository is under [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0).