# GoT-R1 **Repository Path**: 910024445/GoT-R1 ## Basic Information - **Project Name**: GoT-R1 - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-06-26 - **Last Updated**: 2025-06-26 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

[Chengqi Duan](https://scholar.google.com/citations?user=r9qb4ZwAAAAJ&hl=zh-CN)^1\*, [Rongyao Fang](https://scholar.google.com/citations?user=FtH3CW4AAAAJ&hl=en)^2\*, [Yuqing Wang](https://scholar.google.com/citations?user=QC7nNe0AAAAJ&hl=zh-CN&oi=ao)^1\*, [Kun Wang]()³, [Linjiang Huang](https://leonhlj.github.io/)⁴, [Xingyu Zeng](), [Hongsheng Li](https://www.ee.cuhk.edu.hk/~hsli/)², [Xihui Liu](https://xh-liu.github.io/)^{1 :envelope:} ¹HKU MMLab, ²CUHK MMLab, ³Sensetime, ⁴Beihang University \*Equal contribution, :envelope:Corresponding authors

Paper • Introduction • Framework • Key Features • License • Citation

## Introduction Visual generation models have made remarkable progress but still struggle with complex prompts that specify multiple objects with precise spatial relationships and attributes. This limitation often stems from a direct mapping from text embeddings to visual features without explicit reasoning about the compositional structure. We present **GoT-R1**, a framework that significantly enhances semantic-spatial reasoning in visual generation by applying reinforcement learning. Building upon the Generation Chain-of-Thought (GoT) approach, GoT-R1 enables models to autonomously discover effective reasoning strategies that go beyond predefined templates. This is achieved through a carefully designed dual-stage, multi-dimensional reward framework that leverages Multimodal Large Language Models (MLLMs) to evaluate both the intermediate reasoning process and the final visual output. Our reward system assesses semantic alignment, spatial accuracy, and visual quality in a unified manner. Experimental results demonstrate significant improvements on benchmarks like T2I-CompBench, particularly in compositional tasks requiring precise spatial relationships and attribute binding. GoT-R1 advances the state-of-the-art by successfully transferring sophisticated reasoning capabilities to the visual generation domain. GoT-R1 pioneers advancements in reasoning-driven visual generation by: - **Enhanced Semantic-Spatial Reasoning**: Utilizes reinforcement learning to improve the model's ability to understand and plan complex scenes with accurate object attributes and spatial arrangements. - **Autonomous Reasoning Chain Discovery**: Moves beyond fixed templates by allowing the model to autonomously explore and learn more effective reasoning chains. - **Comprehensive MLLM-based Rewards**: Implements a novel dual-stage, multi-dimensional reward system for effective supervision across the entire generation pipeline. ## Released Model: GoT-R1 | Model | Link | |---------------|-------------------------------------------------------------| | **GoT-R1-1B** | [🤗 HuggingFace](https://huggingface.co/gogoduan/GoT-R1-1B) | | **GoT-R1-7B** | [🤗 HuggingFace](https://huggingface.co/gogoduan/GoT-R1-7B) | ## Framework Overview GoT-R1 builds upon the Generation Chain-of-Thought (GoT) framework by introducing reinforcement learning (RL) to refine the model's semantic-spatial reasoning capabilities. The base model is a unified MLLM architecture (e.g., Janus-Pro) that autoregressively generates a textual reasoning chain followed by image tokens. The RL process involves: 1. Sampling multiple reasoning chains (GoT) and corresponding images for a given prompt. 2. Evaluating these samples using our multi-dimensional MLLM-based reward model. 3. Updating the model parameters using Group Relative Policy Optimization (GRPO) to encourage high-reward reasoning and generation strategies.

## Key Features ### MLLM-based Dual-stage Multi-dimensional Reward A core innovation of GoT-R1 is its comprehensive reward framework designed to address the unique challenges of applying RL to visual generation. This system evaluates both the intermediate reasoning process and the final image: - **Prompt-to-Reasoning Semantic Reward ($R_{sem}$)**: Assesses if the reasoning chain accurately captures all semantic elements (objects, attributes) from the prompt without contradiction. It considers completeness, faithfulness, consistency, and clarity. - **Prompt-to-Reasoning Spatial Reward ($R_{spa}$)**: Evaluates the correctness of planned spatial arrangements in the reasoning chain relative to the prompt. To enhance MLLM's spatial evaluation, textual coordinates are rendered as bounding boxes on a blank canvas for visual assessment. - **Reasoning-to-Image Reward ($R_{RI}$)**: Measures how faithfully the generated image reflects the planned reasoning, checking if objects appear at their specified locations using IoU between planned and grounded bounding boxes. - **Prompt-to-Image Reward ($R_{PI}$)**: Assesses the overall quality and compositional accuracy of the final generated image against the initial prompt. The total reward $R_{total}$ is a product of these individual rewards, ensuring holistic optimization: $R_{total} = R_{PI} \cdot (R_{sem} + R_{spa}) \cdot R_{RI}$.

## Usage ### Dependencies - Python >= 3.8 (Recommend to use [Anaconda](https://www.anaconda.com/download/#linux)) - [PyTorch >=2.0.1](https://pytorch.org/) - NVIDIA GPU + [CUDA](https://developer.nvidia.com/cuda-downloads) ### Installation Clone the repo and install dependent packages. ```bash git clone git@github.com:gogoduan/GoT-R1.git cd GoT-R1 pip install -r requirements.txt ``` This automatically downloads **cuda-11.7** and pytorch-2.0.1. If you are using sm-90 GPUs like Nvidia H100, please download **cuda-11.8**。 ### Model Weights Expected directory structure might be: ``` GoT-R1 ├── ckpts │ ├── GoT-R1-1B │ ├── GoT-R1-7B ├── ... ``` ### Inference ```python python infer.py --ckpt_path ``` ### License This code is released under the MIT License. ### Citation If you find this work helpful, please consider citing our paper: ``` @article{duan2025got, title={GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning}, author={Duan, Chengqi and Fang, Rongyao and Wang, Yuqing and Wang, Kun and Huang, Linjiang and Zeng, Xingyu and Li, Hongsheng and Liu, Xihui}, journal={arXiv preprint arXiv:2505.17022}, year={2025} } ``` ### Contact If you have any questions, please raise an issue or contact us at [duancq24@connect.hku.hk](mailto:duancq24@connect.hku.hk).