# X-R1 **Repository Path**: loki-2019/X-R1 ## Basic Information - **Project Name**: X-R1 - **Description**: 使用GRPO对Qwen进行模型后训练 - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-06-08 - **Last Updated**: 2025-06-16 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # X-R1 ![x-r1-logo](./README.assets/X-R1-log.png) X-R1 aims to build an easy-to-use, low-cost training framework based on end-to-end reinforcement learning to accelerate the development of Scaling Post-Training Inspired by [DeepSeek-R1](https://github.com/deepseek-ai/DeepSeek-R1) and [open-r1](https://github.com/huggingface/open-r1) , we produce minimal-cost for training 0.5B R1-Zero "Aha Moment"💡 from base model ## Feature - 🔥Training with LoRA - 4x3090/4090 GPUs training 1hour, 💰cost < 7 dollar, 10min 37'step output “aha Moment“ 💡 - 0.5B scale model RL training - support BIGGER model: 1.5B/7B/32B... - We supply 0.75k/1.5k/7.5k dataset for fast train loop - We logging GRPO online sampling data to log file ## News - 2025.02.18 Suppor LoRA+Zero3, Medical and llm-as-a-reward, add MATH500 benchmark evaluation result. - 2025.02.16 Support LoRA - 2025.02.15 Release Chinese Training - 2025.02.13 Release X-R1-3B, whick better follow format. colab inference - 2025.02.12 Release X-R1-1.5B config/wandb/model/log - 2025.02.12: Release X-R1 first version ## Result ### Overview Running Scripts: ```bash bash ./scripts/run_x_r1_zero.sh ``` We would share training details about config/wandb/model/log, also evaluation results: 📈 [wandb details](https://api.wandb.ai/links/xiaodonggua/eb471rlw) | 🔥 [Colab Inference](https://colab.research.google.com/drive/1TxjJ-M9J2lLW3zcKr7oeER3snXe0oWo4#scrollTo=VnkmSMGwZOhI) | 🤗 [Models](https://huggingface.co/xiaodongguaAIGC) We have confirmed the effectiveness of the X-R1 RL-Zero-training method for `0.5B/1.5B/3B-Base` model, We can observe that in the without-SFT, reinforcement learning has **Incentivizing** the model's reasoning abilities and format-following capabilities, and the experimental results of X-R1 are very encouraging. ![X-R1-base-result-curves](./README.assets/X-R1-base-result-curves.png) training config | Model | 0.5B | 1.5B | 3B | 7B | | --------------------- | -------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------- | --- | | TargetModel | [X-R1-0.5B](https://huggingface.co/xiaodongguaAIGC/X-R1-0.5B) | [X-R1-1.5B](https://huggingface.co/xiaodongguaAIGC/X-R1-1.5B) | [X-R1-3B](https://huggingface.co/xiaodongguaAIGC/X-R1-3B) | | | Log | [[link]](https://drive.google.com/file/d/1m-w0B2L9o-bwGDgaOtWFLR0C0MAEBTFQ/view?usp=sharing) | [[link]](https://drive.google.com/file/d/11tBShY206Pu_SxWE0M-mG2_Cdf9mFNig/view?usp=sharing) | [[link]](https://drive.google.com/file/d/1t4WzsK0aMrULYKjKsKH29LsWQMeTDjTb/view?usp=sharing) | | | GPU | 4x3090 | 4x3090 | 4x3090 | | | Base | Qwen/Qwen2.5-0.5B | Qwen/Qwen2.5-1.5B | Qwen/Qwen2.5-3B | | | Dataset | X-R1-750 | X-R1-750 | X-R1-750 | | | Config: recipes | X_R1_zero_0dot5B_config.yaml | X_R1_zero_1dot5B_config.yaml | X_R1_zero_3B_config.yaml | | | num_generations | 16 | 8 | 4 | | | max_completion_length | 1024 | 1024 | 1024 | | | num_train_epochs | 3 | 3 | 3 | | | Times | 1:14:10 | 1:59:06 | 2:23:06 | | ### Example: 0.5B R1-Zero 0.5B, 4x3090. if you have 4 GPUs, you should set `--num_processes=3`. One GPU deploy vLLM as online inference engine, for faster GRPO sampling example: 4x4090, 3epochs, training time, ~1h20min ```shell ACCELERATE_LOG_LEVEL=info \ accelerate launch \ --config_file recipes/zero3.yaml \ --num_processes=3 \ src/x_r1/grpo.py \ --config recipes/X_R1_zero_0dot5B_config.yaml \ > ./output/x_r1_0dotB_sampling.log 2>&1 ``` tips : use `--config recipes/X_R1_zero_3B_config.yaml` for better learning reasoning and format #### Aha Moment: ***Wait**, that doesn't match either of our options. It seems like I made a **mistake** in my **assumptions**. **Let's go back** to the original equations* ![aha_moment](./README.assets/aha_moment_0.5B.png) #### benchmark evaluation we use vllm as backend, to evaluate benchmark. output accuracy-metric and format-metric, and json file ```bash CUDA_VISIBLE_DEVICES=0,1 python ./src/x_r1/benchmark.py \ --model_name='xiaodongguaAIGC/X-R1-0.5B' \ --dataset_name='HuggingFaceH4/MATH-500' \ --output_name='./output/result_benchmark_math500' \ --max_output_tokens=1024 \ --num_gpus=2 ``` ### Example: Chinese Math Reasoning X-R1 support chinese math reasoning, it's easy to make chinese `Aha Moment`, as follow ```bash ACCELERATE_LOG_LEVEL=info \ accelerate launch \ --config_file recipes/zero3.yaml \ --num_processes=3 \ src/x_r1/grpo.py \ --config recipes/examples/mathcn_zero_3B_config.yaml \ > ./output/mathcn_3B_sampling.log 2>&1 ``` #### reward curve X-R1 use 4x3090 ~16h training 3B-base with 7.5k chinese math problem. ![X-R1-math-cn-curve](./README.assets/X-R1-math-cn-curve.png) #### Chinese Aha Moment [X-R1-3B-CN](xiaodongguaAIGC/X-R1-0.5B-CN) training [log](https://drive.google.com/file/d/1dPex_uiZ-4Lj2Jv8G8SWw6z0OsNSqLLM/view?usp=sharing) we track ”Aha Moment” ![X-R1-Math-cn-AhaMoment-1](./README.assets/X-R1-Math-cn-AhaMoment-1.png) ![X-R1-Math-cn-AhaMoment-2](./README.assets/X-R1-Math-cn-AhaMoment-2.png) ### Example: GRPO + LoRA 1. multi-gpu run: ```bash ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/zero3.yaml --num_processes=3 src/x_r1/grpo.py --config recipes/examples/X_R1_zero_7B_peft_usevllm_config.yaml.yaml > ./output/test_7b_lora_sampling.log 2>&1 ``` 2. single-gpu 3090 training 7B LoRA run: ```bash ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/zero3.yaml --num_processes=1 src/x_r1/grpo.py --config recipes/examples/X_R1_zero_7B_peft_novllm_config.yaml.yaml > ./output/test_7b_lora_sampling.log 2>&1 ``` ### Example: GRPO without KL set KL iterm `beta: 0.0` and ignore `ref_model` , improve 20% performance. ```bash accelerate launch \ --config_file recipes/zero3.yaml \ --num_processes=3 \ src/x_r1/grpo.py \ --config recipes/X_R1_zero_1dot5B_noKL_config.yaml \ > ./output/test_1dot5B_sampling.log 2>&1 ``` ## Installation ### conda & pip required: cuda >= 12.4 ```bash conda create -n xr1 python=3.11 conda activate xr1 ``` and ```bash pip install -r requirements.txt pip install flash-attn ``` ### quick start for test environment: ```bash mkdir output ``` \[option\]: single GPU with LoRA: ```shell ACCELERATE_LOG_LEVEL=info \ accelerate launch \ --config_file recipes/zero1.yaml \ --num_processes=1 \ src/x_r1/grpo.py \ --config recipes/X_R1_zero_0dot5B_peft_config.yaml \ > ./output/x_r1_test_sampling.log 2>&1 ``` \[option\]Multi-GPU: ```shell ACCELERATE_LOG_LEVEL=info \ accelerate launch \ --config_file recipes/accelerate_configs/zero3.yaml \ --num_processes=1 \ src/x_r1/grpo.py \ --config recipes/x_r1_test_sampling.yaml \ > ./output/test.log 2>&1 ``` and we check log file: `./output/test.log` ## Q & A ### How to setting correct batch_Size and num_generations we have 4gpu(1 vLLM + 3 training), setting config is: ```yaml per_device_train_batch_size: 1 num_generations: 4 ``` running with `--num_processes=3`: ```text ValueError: The global train batch size (3 x 1) must be evenly divisible by the number of generations per prompt (4). Given the current train batch size, the valid values for the number of generations are: [3]. ``` ( `per_device_train_batch_size` * `num_processes` ) % `num_generations` == 0 we should set ```yaml # example 1 num_processes: 3 per_device_train_batch_size: 1 num_generations: 3 # 1 * 3 % 3 = 0 # example 2 num_processes: 3 per_device_train_batch_size: 4 num_generations: 6 # 4 * 3 % 6 = 0 ``` if your have 8GPU(1vllm + 7training) ```yaml num_processes: 7 per_device_train_batch_size: 4 num_generations: 14 # 4 * 7 % 14 = 0 ``` ## Todo - support QLoRA GRPO Trainning - Release 7B config/result - add more rule reward - support more base model - add benchmark evaluation reuslt ## About If you have any suggestions, please contact: dhcode95@gmail.com ## Acknowledge [Open-R1](https://github.com/huggingface/open-r1), [TRL](https://github.com/huggingface/trl) ## Citation ```bib @misc{deng2025xr1, author = {hang deng}, title = {X-R1}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/dhcode-cpp/X-R1}} year = {2025}, } ```