# ReWatch-R1 **Repository Path**: alibaba/ReWatch-R1 ## Basic Information - **Project Name**: ReWatch-R1 - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-10-21 - **Last Updated**: 2026-01-10 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # ReWatch-R1 [![Paper](https://img.shields.io/badge/ArXiv-Paper-brown)](https://arxiv.org/abs/2509.23652) [![Project Page](https://img.shields.io/badge/GitHub-Project%20Page-blue)](https://rewatch-r1.github.io/) [![Model](https://img.shields.io/badge/HuggingFace-Model-yellow)](https://www.modelscope.cn/models/zcccccz/ReWatch-R1) [![Dataset](https://img.shields.io/badge/HuggingFace-Dataset-yellow)](https://www.modelscope.cn/datasets/zcccccz/ReWatch) ReWatch-R1: Boosting Complex Video Reasoning in Large Vision-Language Models through Agentic Data Synthesis This is the official code used to train ReWatch-R1. Note that the code only contains the reinforcement learning part. ## Using ReWatch-R1 to Inference Use our model for video reasoning! Please use transformers==4.56.0 and qwen_vl_utils. \ Please download our model [ReWatch-R1](https://www.modelscope.cn/models/zcccccz/ReWatch-R1).\ It is recommended to use the video parameters in the paper (up to 192 frames, with a resolution of 128\*28\*28 per frame). \ For the best results, you must provide the duration of the video in the prompt (for example, 00:00-10:00), and the timestamp should be in the MM\:SS format. ```python import torch from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor from qwen_vl_utils import process_vision_info model_path = "ReWatch-R1" model = Qwen2_5_VLForConditionalGeneration.from_pretrained( model_path, torch_dtype=torch.bfloat16, trust_remote_code=True, attn_implementation="flash_attention_2", ) processor = AutoProcessor.from_pretrained( model_path, trust_remote_code=True, use_fast=True, padding_side="left", truncation_side="right", ) video_path = "videos/example.mp4" video_duration = 600 question = "What happened from [05:00] to [05:10]?" total_pixels = 12288*28*28 min_pixels = 128*28*28 max_pixels = 128*28*28 fps = 2.0 max_frames = 192 video_config = { "type": "video", "video": video_path, "total_pixels": total_pixels, "min_pixels": min_pixels, "max_pixels": max_pixels, "fps": fps, "max_frames": max_frames } react_prompt = """You are a video understanding expert. You are given a video and a question. You need to answer the question based on the video content. Please answer the question step by step. When you need more video details, you will re-watch the relevant clips and use and to mark the actions, and use and to mark the visual details you observe. When you have enough information to determine the final answer, you will wrap the final answer in and . **Video Information and Question:** - **Video Duration**: {video_duration} - **Question**: {question}""" def seconds_to_timestamp(seconds): """将秒数转换为时间戳字符串 (MM:SS)""" minutes = seconds // 60 seconds = seconds % 60 return f"{minutes:02d}:{seconds:02d}" duration_str = f"00:00-{seconds_to_timestamp(video_duration)}" instruction = react_prompt.format(video_duration=duration_str, question=question) messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": [ video_config, {"type": "text", "text": instruction}, ]}, ] text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, ) image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", max_length=16384, truncation=True, do_sample_frames=False, **video_kwargs, ) inputs = inputs.to("cuda") generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=4096, use_cache=True) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text) ``` ## Quick Start for RL Training Please follow the steps below to start your video RL training! ### Prepare the data and model First, download our cold-started model [ReWatch-R1-SFT](https://www.modelscope.cn/models/zcccccz/ReWatch-R1-SFT).\ Second, download our [QA](https://www.modelscope.cn/datasets/zcccccz/ReWatch) data and [Caption](https://www.modelscope.cn/datasets/zcccccz/ReWatch) data.\ Then prepare the QA data in the following format: The training set is a jsonl format file, and the format of each line is as follows. - multiple-choice format ```json { "problem_id": "1Yc9DM8j378.mp4_temporal_localization_multiple_choice", "question_type": "temporal_localization", "multiple_choice": true, "problem": "At 01:40, what specific piece of jewelry is Man 2 described as wearing, a detail consistent with Abram's later appearance in the bar?\nA: A beaded necklace\nB: A large gold medallion\nC: A leather wristband\nD: A silver chain", "data_type": "video", "videos": "1Yc9DM8j378.mp4", "answer": "D", "answer_str": "A silver chain", "duration": 972, "duration_str": "00:00-16:12" } ``` - open-end format ```json { "problem_id": "1Yc9DM8j378.mp4_temporal_localization_open_end", "question_type": "temporal_localization", "multiple_choice": false, "problem": "At 01:40, what specific piece of jewelry is Man 2 described as wearing, a detail consistent with Abram's later appearance in the bar?", "data_type": "video", "videos": "1Yc9DM8j378.mp4", "answer": "A silver chain", "answer_str": "A silver chain", "duration": 972, "duration_str": "00:00-16:12" } ``` ### Modify the training configuration This is the most crucial step. Please modify the configuration file at ```configs/config.yaml``` by updating the data path and model path to your local paths. Please modify all the places marked "TODO:". If you want to customize your data loading logic, you can further modify the `RLHFDataset` class in the `verl/utils/dataset.py` file. ### Training Then run the following script to start the RL training! #### Single node ```bash bash scripts/train_single_node.sh ``` #### Multi-nodes ```bash bash scripts/srun_multi_nodes.sh $TRAIN_SCRIPT $NNODES ``` where `TRAIN_SCRIPT` is the script to train on single node, `NNODES` is the number of nodes required. For example, ```bash bash scripts/srun_multi_nodes.sh scripts/train_single_node.sh 2 ``` ## Acknowledgement - [Long-RL](https://github.com/NVlabs/Long-RL): the codebase we built upon. Thanks for their wonderful work. ## Citation ```bibtex @misc{zhang2025rewatchr1boostingcomplexvideo, title={ReWatch-R1: Boosting Complex Video Reasoning in Large Vision-Language Models through Agentic Data Synthesis}, author={Congzhi Zhang and Zhibin Wang and Yinchao Ma and Jiawei Peng and Yihan Wang and Qiang Zhou and Jun Song and Bo Zheng}, year={2025}, eprint={2509.23652}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2509.23652}, } ```