# UnifiedReward
**Repository Path**: xdjiangkai/UnifiedReward
## Basic Information
- **Project Name**: UnifiedReward
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-12-02
- **Last Updated**: 2025-12-02
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
UnifiedReward Team Works
Benchmarks
> [**UniREditBench: A Unified Reasoning-based Image Editing Benchmark**](https://maplebb.github.io/UniREditBench/): We propose **UniREditBench**, a unified reasoning-based image editing benchmark, and further construct **UniREdit-Data-100K**, a large-scale synthetic dataset with high-quality CoT annotations, and develop **UniREdit-Bagel** by fine-tuning Bagel on this dataset.
>
>
>
>
> [](https://huggingface.co/datasets/maplebb/UniREdit-Data-100K)
> [](https://huggingface.co/maplebb/UniREdit-Bagel)
> [**UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation**](https://codegoat24.github.io/UniGenBench): We propose **UniGenBench++**, a unified semantic benchmark for T2I generation. It supports both **short and long prompts in Chinese and English**, featuring a **streamlined evaluation pipeline** and a robust **offline evaluation model**.
>
>
>
>
> [](https://huggingface.co/datasets/CodeGoat24/UniGenBench-Eval-Images)
[](https://huggingface.co/CodeGoat24/UniGenBench-EvalModel-qwen-72b-v1)
> [](https://huggingface.co/spaces/CodeGoat24/UniGenBench_Leaderboard)
> [](https://huggingface.co/spaces/CodeGoat24/UniGenBench_Leaderboard_Chinese)
> [](https://huggingface.co/spaces/CodeGoat24/UniGenBench_Leaderboard_English_Long)
> [](https://huggingface.co/spaces/CodeGoat24/UniGenBench_Leaderboard_Chinese_Long)
Models
> [**Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning**](https://codegoat24.github.io/UnifiedReward/Pref-GRPO): We propose **Pref-GRPO** and **UniGenbench**, the first **preference reward-based GRPO method** for stable T2I reinforcement learning, and a **unified T2I generation benchmark** for fine-grained semantic consistency evaluation.
>
>
>
>
>
>
> [](https://huggingface.co/spaces/CodeGoat24/UniGenBench_Leaderboard)
> [**NeurIPS 2025**] [**Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning**](https://codegoat24.github.io/UnifiedReward/think): We propose **UnifiedReward-Think**, the first unified multimodal CoT reward model.
>
>
> [](https://huggingface.co/CodeGoat24/UnifiedReward-Think-qwen3vl-8b)
> [](https://huggingface.co/CodeGoat24/UnifiedReward-Think-qwen-7b)
> [](https://huggingface.co/CodeGoat24/UnifiedReward-Think-7b)
> [**Unified Reward Model for Multimodal Understanding and Generation**](https://codegoat24.github.io/UnifiedReward/): We release the **UnifiedReward**, **the first unified reward model for multimodal understanding and generation assessment**, enabling both pairwise ranking and pointwise scoring.
### ✨ **Awesome Works using UnifiedReward**
😊 Meta, [Transition Matching: Scalable and Flexible Generative Modeling](https://arxiv.org/pdf/2506.23589).
😊 NVIDIA, Stanford, Tsinghua, [DiffusionNFT: Online Diffusion Reinforcement with Forward Process](https://arxiv.org/pdf/2509.16117). [![[code]](https://img.shields.io/github/stars/NVlabs/DiffusionNFT)](https://github.com/NVlabs/DiffusionNFT)
😊 Apple, Fudan, [UniGen-1.5: Enhancing Image Generation and Editing through Reward Unification in Reinforcement Learning](https://arxiv.org/pdf/2511.14760).
😊 University of California, USTC, PKU, BIGAI, [MILR: Improving Multimodal Image Generation via Test-time Latent Reasoning](https://arxiv.org/pdf/2509.22761).
😊 Kuaishou, Tsinghua, CUHK, [Flow-GRPO: Training Flow Matching Models via Online RL](https://github.com/yifan123/flow_grpo). [![[code]](https://img.shields.io/github/stars/yifan123/flow_grpo)](https://github.com/yifan123/flow_grpo)
😊 Tencent Hunyuan, [MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE](https://arxiv.org/pdf/2507.21802). [![[code]](https://img.shields.io/github/stars/Tencent-Hunyuan/MixGRPO)](https://github.com/Tencent-Hunyuan/MixGRPO)
😊 Kling Team, CUHK MMLab, NJU, [VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning](https://arxiv.org/pdf/2510.10518). [![[code]](https://img.shields.io/github/stars/qunzhongwang/vr-thinker)](https://github.com/qunzhongwang/vr-thinker)
😊 CUHK MMLab, [Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO](https://arxiv.org/pdf/2505.17017). [![[code]](https://img.shields.io/github/stars/ZiyuGuo99/Image-Generation-CoT)](https://github.com/ZiyuGuo99/Image-Generation-CoT)
| Method | HPS | ImageReward | UnifiedReward |
|------------|-----------|-----------|-----------|
| Janus-Pro + DPO | 77.3 | 77.7 | **80.0** |
| Janus-Pro + GRPO | 79.2 | 79.3 | **81.0** |
| Janus-Pro + Best-of-4 | 82.1 | 82.4 | **84.5** |
😊 Tencent Hunyuan X, [X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again](https://arxiv.org/pdf/2507.22058). [![[code]](https://img.shields.io/github/stars/X-Omni-Team/X-Omni)](https://github.com/X-Omni-Team/X-Omni)
## 🔥 News
[2025/11/17] 🔥🔥🔥 We release **UnifiedReward-Think-qwen3vl**-[[2b](https://huggingface.co/CodeGoat24/UnifiedReward-Think-qwen3vl-2b)/[4b](https://huggingface.co/CodeGoat24/UnifiedReward-Think-qwen3vl-4b)/[8b](https://huggingface.co/CodeGoat24/UnifiedReward-Think-qwen3vl-8b)/[32b](https://huggingface.co/CodeGoat24/UnifiedReward-Think-qwen3vl-32b)]. The inference code is provided at [here](https://github.com/CodeGoat24/UnifiedReward/tree/main/UnifiedReward-Think/inference_qwen/UnifiedReward-Think-qwen3vl-inference).
[2025/11/11] 🔥🔥🔥 We release **UnifiedReward-2.0-qwen3vl**-[[2b](https://huggingface.co/CodeGoat24/UnifiedReward-2.0-qwen3vl-2b)/[4b](https://huggingface.co/CodeGoat24/UnifiedReward-2.0-qwen3vl-4b)/[8b](https://huggingface.co/CodeGoat24/UnifiedReward-2.0-qwen3vl-8b)/[32b](https://huggingface.co/CodeGoat24/UnifiedReward-2.0-qwen3vl-32b)] and **UnifiedReward-Edit-qwen3vl**-[[2b](https://huggingface.co/CodeGoat24/UnifiedReward-Edit-qwen3vl-2b)/[4b](https://huggingface.co/CodeGoat24/UnifiedReward-Edit-qwen3vl-4b)/[8b](https://huggingface.co/CodeGoat24/UnifiedReward-Edit-qwen3vl-8b)/[32b](https://huggingface.co/CodeGoat24/UnifiedReward-Edit-qwen3vl-32b)]!!!
[2025/10/23] 🔥 We release **UnifiedReward-Edit**-qwen-[[3b](https://huggingface.co/CodeGoat24/UnifiedReward-Edit-qwen-3b)/[7b](https://huggingface.co/CodeGoat24/UnifiedReward-Edit-qwen-7b)/[32b](https://huggingface.co/CodeGoat24/UnifiedReward-Edit-qwen-32b)/[72b](https://huggingface.co/CodeGoat24/UnifiedReward-Edit-qwen-72b)], a unified reward model for **both Text-to-Image and Image-to-Image generation** trained on approximately 700K unified image generation and editing reward data!!
For image editing reward task, our models support:
>1. Pairwise Rank — directly judge which of two edited images is better.
>
>2. Pairwise Score — assign a separate score to each image in a pair.
>
>3. Pointwise Score — rate a single image on two axes: instruction-following and overall image quality.
🚀 The image editing reward inference code is available at [`UnifiedReward-Edit/`](https://github.com/CodeGoat24/UnifiedReward/tree/main/UnifiedReward-Edit) directory, while T2I inference code is unchanged from previous models. The editing training data is preprocessed from [EditScore](https://huggingface.co/datasets/EditScore/EditScore-Reward-Data) and [EditReward](https://huggingface.co/datasets/TIGER-Lab/EditReward-Data) and will be released soon. We sincerely appreciate all contributors!!
[2025/9/25] 🔥 We release **UnifiedReward-2.0**-qwen-[[3b](https://huggingface.co/CodeGoat24/UnifiedReward-2.0-qwen-3b)/[7b](https://huggingface.co/CodeGoat24/UnifiedReward-2.0-qwen-7b)/[32b](https://huggingface.co/CodeGoat24/UnifiedReward-2.0-qwen-32b)/[72b](https://huggingface.co/CodeGoat24/UnifiedReward-2.0-qwen-72b)].
This version introduces several new capabilities:
>
>1. **Pairwise scoring** for image and video generation assessment on **_Alignment_**, **_Coherence_**, **_Style_** dimensions.
>
>2. **Pointwise scoring** for image and video generation assessment on **_Alignment_**, **_Coherence/Physics_**, **_Style_** dimensions.
>
The added inference code is available at [`inference_qwen/UnifiedReward-2.0-inference`](https://github.com/CodeGoat24/UnifiedReward/tree/main/inference_qwen/UnifiedReward-2.0-inference) directory. The newly added training data has been released [here](https://huggingface.co/datasets/CodeGoat24/UnifiedReward-2.0-T2X-score-data) 😊.
😊 We are actively gathering feedback from the community to improve our models. **We welcome your input and encourage you to stay updated through our repository**!!
Unified Reward Model for Multimodal Understanding and Generation
[](https://huggingface.co/collections/CodeGoat24/unifiedreward-20-models-68b7c99ab70ff81184c70270)
[](https://huggingface.co/collections/CodeGoat24/unifiedreward-20-models-68b7c99ab70ff81184c70270)
[](https://huggingface.co/collections/CodeGoat24/unifiedreward-edit-models)
[](https://huggingface.co/collections/CodeGoat24/unifiedreward-10-models-67c3008148c3a380d15ac63a)
[](https://huggingface.co/collections/CodeGoat24/unifiedreward-10-models-67c3008148c3a380d15ac63a)
[](https://huggingface.co/collections/CodeGoat24/unifiedreward-training-data-67c300d4fd5eff00fa7f1ede)
😊 We appreciate the [mradermacher](https://huggingface.co/mradermacher) team for providing the [GGUF](https://huggingface.co/collections/CodeGoat24/unifiedreward-models-gguf-683fe14b5e2b8422049f45ca) version of our models, and the [Tencent Hunyuan](https://hunyuan.tencent.com/) team for providing the evaluation results on several T2I models using [UnifiedReward-qwen-7b](https://huggingface.co/CodeGoat24/UnifiedReward-qwen-7b)!! The evaluation was conducted on 400 prompts sourced from [here](https://artificialanalysis.ai/text-to-image/arena?tab=arena).
click for evaluation results on several T2I models
| Model | Alignment | Coherence | Style |
|---------------------|------------------|-----------------------|------------------|
| Flux-pro-ultra | 3.6453 | 3.8193 | 3.4971 |
| Imagen-4.0 | 3.6792 | 3.8049 | 3.4756 |
| Recraft-v3 | 3.6611 | 3.8409 | **3.5158** |
| OpenAI-GPT-image-1 | 3.6890 | **3.8448** | 3.4960 |
| Imagen-3.0 | 3.6733 | 3.8027 | 3.4674 |
| Seedream-3.0 | **3.6927** | 3.8218 | 3.4887 |
## 🔥🔥🔥 [NeurIPS 2025] **UnifiedReward-Think**
Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning

[](https://huggingface.co/CodeGoat24/UnifiedReward-Think-qwen3vl-8b)[](https://huggingface.co/CodeGoat24/UnifiedReward-Think-qwen-7b)[](https://huggingface.co/CodeGoat24/UnifiedReward-Think-7b)
We release **UnifiedReward-Think** -- **the first unified multimodal CoT reward model**, capable of multi-dimensional, step-by-step long-chain reasoning for both visual understanding and generation reward tasks.
Please refer to the [README.md](https://github.com/CodeGoat24/UnifiedReward/tree/main/UnifiedReward-Think) for training and inference details.
[2025/11/17] 🔥🔥🔥 We release **UnifiedReward-Think-qwen3vl**-[[2b](https://huggingface.co/CodeGoat24/UnifiedReward-Think-qwen3vl-2b)/[4b](https://huggingface.co/CodeGoat24/UnifiedReward-Think-qwen3vl-4b)/[8b](https://huggingface.co/CodeGoat24/UnifiedReward-Think-qwen3vl-8b)/[32b](https://huggingface.co/CodeGoat24/UnifiedReward-Think-qwen3vl-32b)]. The inference code is provided at [here](https://github.com/CodeGoat24/UnifiedReward/tree/main/UnifiedReward-Think/inference_qwen/UnifiedReward-Think-qwen3vl-inference).
## 🏁 Compared with Current Reward Models
| Reward Model | Method| Image Generation | Image Understanding | Video Generation | Video Understanding | CoT Reasoning
| :-----: | :-----: |:-----: |:-----: | :-----: | :-----: | :-----: |
| [PickScore](https://github.com/yuvalkirstain/PickScore) |Point | √ | | || |
| [HPS](https://github.com/tgxs002/HPSv2) | Point | √ | ||| |
| [ImageReward](https://github.com/THUDM/ImageReward) | Point| √| ||| |
| [LLaVA-Critic](https://huggingface.co/lmms-lab/llava-critic-7b) | Pair/Point | | √ ||| |
| [IXC-2.5-Reward](https://github.com/InternLM/InternLM-XComposer) | Pair/Point | | √ ||√| |
| [VideoScore](https://github.com/TIGER-AI-Lab/VideoScore) | Point | | |√ || |
| [LiFT](https://github.com/CodeGoat24/LiFT) | Point | | |√| | |
| [VisionReward](https://github.com/THUDM/VisionReward) | Point |√ | |√|| |
| [VideoReward](https://github.com/KwaiVGI/VideoAlign) | Point | | |√ || |
| **UnifiedReward** (Ours) | Pair/Point | √ | √ |√|√| |
| **UnifiedReward-Think** (Ours) | Pair/Point | √ | √ |√|√|√|
## 🔧 Environment Set Up
1. Clone this repository and navigate to the UnifiedReward folder:
```bash
git clone https://github.com/CodeGoat24/UnifiedReward.git
cd UnifiedReward
```
2. Install the inference package:
```bash
conda create -n unifiedreward python=3.10 -y
conda activate unifiedreward
pip install --upgrade pip
pip install -e ".[train]"
pip install flash_attn==2.5.8 --no-build-isolation
```
## 🚀 Inference
For Qwen2.5-VL based UnifiedReward models, you should first install the inference packages as follows:
```bash
pip install git+https://github.com/huggingface/transformers accelerate qwen-vl-utils[decord]==0.0.8
```
We provide reference pair ranking and point score inference code for each task in the `./inference` and `./inference_qwen` directories.
```bash
inference
├── image_generation
├── pair_rank_image_generation.py
└── point_score_image_generation.py
├── video_understanding
├── pair_rank_video_understanding.py
└── point_score_video_understanding.py
...
```
Note that our model is not constrained to a fixed input prompt style.
You can flexibly adjust inputs based on your requirements.
### 1. vLLM Inference
We provide vLLM inference code for UnifiedReward-qwen in `vllm_qwen` directory.
1. Install vLLM
```bash
pip install vllm>=0.11.0
pip install qwen-vl-utils==0.0.14
```
2. Deploy vLLM Server
```bash
bash vllm_qwen/vllm_server.sh
```
3. Inference Request to vLLM Server
```bash
python vllm_qwen/vllm_inference.py
```
### 2. SGLang Inference
We provide SGLang inference code for UnifiedReward-llava in `sglang_llava` directory.
1. Install SGLang
```bash
pip install "sglang[all]"
```
2. Deploy SGLang Server
```bash
bash sglang_llava/sglang_server.sh
```
3. Inference Request to SGLang Server
```bash
python sglang_llava/sglang_inference.py
```
## 💻 Training UnifiedReward
### 1. Training based on Qwen2.5-VL-Instruct (Recommended)
We use [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) to train the SFT model.
1. Clone the [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) repository and install the dependencies.
```bash
git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]"
```
Follow this [README](https://github.com/hiyouga/LLaMA-Factory/blob/main/data/README.md) ([Multimodal Image Dataset](https://github.com/hiyouga/LLaMA-Factory/blob/main/data/mllm_demo.json)) to prepare our released [datasets](https://huggingface.co/collections/CodeGoat24/unifiedreward-training-data-67c300d4fd5eff00fa7f1ede).
2. Run the following command to train the SFT model.
```bash
llamafactory-cli train examples/train_full/qwen2_5vl_full_sft.yaml
```
### 2. Training based on LLaVA-Onevision
#### 2.1 Unified Preference Training Dataset Preparation
Please download our constructed unified preference dataset from [Huggingface](https://huggingface.co/collections/CodeGoat24/unifiedreward-training-data-67c300d4fd5eff00fa7f1ede) and put it in `./dataset/`.
```
dataset
├── EvalMuse
├── pairwise
└── pointwise
└── ...
└── HPD
└── LiFT-HRA
└── LLaVA-Critic
├── pairwise
└── pointwise
└── ...
└── OIP
└── ShareGPTVideo
├── pairwise
└── pointwise
└── ...
└── VideoDPO
└── VideoFeedback
└── train_data.yaml
```
#### 2.2 Training based on LLaVA-Onevision
```bash
bash train.sh
```
## ✨ Direct Preference Optimization
🎨 Image and Video Understanding DPO
#### 1. Construct Preference data
The data for preference data construction should adhere to the following structure:
```bash
[
{
"prompt": "",
"image": "",
},
...
]
```
Then
```bash
# image understanding
cd preference_data_construction/image_understanding
python infer+sift.py # you need to fill the 'image_folder' and 'data_path' in this file
# video understanding
cd preference_data_construction/video_understanding
python infer+sift.py # you need to fill the 'image_folder' and 'data_path' in this file
```
#### 2. Training
The training data format in `data.json` should adhere to the following structure:
```bash
[
{
"id": "",
"image": "",
"prompt": "",
"chosen": "",
"rejected": ""
},
...
]
```
Then start training:
```bash
# image understanding
bash dpo_image_understand_ov7b.sh
# video understanding
bash dpo_video_understand_llava_video_7b.sh
```
🖼️ Image Generation DPO
#### 0. Prepare Environments
```bash
cd DiffusionDPO
conda create -n diffdpo python=3.10 -y
conda activate diffdpo
pip install -r requirements.txt
```
#### 1. Construct Preference data
Image Generation
The data for preference data construction should adhere to the following structure:
```bash
[
{
"prompt": "",
},
...
]
```
Then
```bash
python data_generation.py # you need to fill the 'data_path' in this file
```
Preference Pair Data Construction
```bash
python sift_dpo_data.py
```
#### 2. Training
The training data format in `data.json` should adhere to the following structure:
```bash
[
{
"id": "",
"caption": "",
"jpg_0": "", #chosen image path
"jpg_1": "", #rejected image path
"label_0": 1,
},
...
]
```
Then start training:
```bash
bash launchers/turbo_dpo.sh
```
🎬 Video Generation DPO
#### 0. Prepare Environments
```bash
cd VideoDPO
conda create -n videodpo python=3.10 -y
conda activate videodpo
pip install -r requirements.txt
```
Run following instruction to download VideoCrafter checkpoints.
```bash
mkdir -p checkpoints/vc2
wget -P checkpoints/vc2 https://huggingface.co/VideoCrafter/VideoCrafter2/resolve/main/model.ckpt
```
Please download our constructed T2V-Turbo model and its reference model from [Huggingface](https://huggingface.co/CodeGoat24/T2V-Turbo) and put it in `./checkpoints/t2v-turbo`.
#### 1. Construct Preference data
Video Generation
The data for preference data construction should adhere to the following structure:
```bash
[
{
"prompt": "",
},
...
]
```
Then
```bash
bash data_generation.sh # you need to fill '--prompts_file' in this file
```
Preference Pair Data Construction
```bash
python sift_dpo_data.py
```
#### 2. Training
The training data format in `data.json` should adhere to the following structure:
```bash
[
{
"id": "",
"caption": "",
"chosen": "", # chosen video path
"rejected": "", # rejected video path
},
...
]
```
Then start training:
```bash
bash run.sh
```
## 🚀 Evaluation
We provide several evaluation code in `./benchmark_evaluation` directory.
### Reward model
We provide evaluation code for [GenAI-Bench-Video](https://github.com/TIGER-AI-Lab/GenAI-Bench), [GenAI-Bench-Image](https://github.com/TIGER-AI-Lab/GenAI-Bench), [VideoGen-RewardBench](https://huggingface.co/datasets/KwaiVGI/VideoGen-RewardBench) and [VL-RewardBench](https://huggingface.co/datasets/MMInstruction/VL-RewardBench) benchmarks.
### Video Understanding
We provide evaluation code for [MSRVTT](https://github.com/xudejing/video-question-answering), [MSVD](https://github.com/xudejing/video-question-answering), and [TGIF](https://github.com/YunseokJANG/tgif-qa) benchmarks while using the [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) toolkit for evaluating LongVideoBench, MLVU, and Video-MME benchmarks with 64 input frames.
### Image Understanding
We use [LMMs-Eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) toolkit to evaluate LLaVABench, WildVision, LLaVABench-Wilder, LiveBench, and MMHal benchmarks.
### Image Generation
We utilize the image reward model,
i.e., [PickScore](https://github.com/yuvalkirstain/PickScore), [HPS](https://github.com/tgxs002/HPSv2) and [ImageReward](https://github.com/THUDM/ImageReward) for
quality assessment.
### Video Generation
[VBench](https://github.com/Vchitect/VBench) is used for video generation assessment.
## 📧 Contact
If you have any comments or questions, please open a new issue or feel free to contact [Yibin Wang](https://codegoat24.github.io).
## 🤗 Acknowledgments
In this work, reward model and image/video understanding DPO code is based on [LLaVA-Next](https://github.com/LLaVA-VL/LLaVA-NeXT), while image and video generation DPO is based on [DiffusionDPO](https://github.com/SalesforceAIResearch/DiffusionDPO) and [VideoDPO](https://github.com/CIntellifusion/VideoDPO).
We also utilize [LMMs-Eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) and [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) toolkits for evaluation.
Thanks to all the contributors!
## ⭐ Citation
```bibtex
@article{unifiedreward-think,
title={Unified multimodal chain-of-thought reward model through reinforcement fine-tuning},
author={Wang, Yibin and Li, Zhimin and Zang, Yuhang and Wang, Chunyu and Lu, Qinglin and Jin, Cheng and Wang, Jiaqi},
journal={arXiv preprint arXiv:2505.03318},
year={2025}
}
```
```bibtex
@article{unifiedreward,
title={Unified reward model for multimodal understanding and generation},
author={Wang, Yibin and Zang, Yuhang and Li, Hao and Jin, Cheng and Wang, Jiaqi},
journal={arXiv preprint arXiv:2503.05236},
year={2025}
}
```