This repository hosts the code and datasets for the Open RS project, accompanying the paper Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn’t. The project explores enhancing reasoning capabilities in small large language models (LLMs) using reinforcement learning (RL) under resource-constrained conditions.
We focus on a 1.5-billion-parameter model, DeepSeek-R1-Distill-Qwen-1.5B
, trained on 4 NVIDIA A40 GPUs (48 GB VRAM each) within 24 hours. By adapting the Group Relative Policy Optimization (GRPO) algorithm and leveraging a curated, compact mathematical reasoning dataset, we conducted three experiments to assess performance and behavior. Key findings include:
o1-preview
.These results showcase RL-based fine-tuning as a cost-effective approach for small LLMs, making reasoning capabilities accessible in resource-limited settings. We open-source our code, models, and datasets to support further research.
Install uv
for managing virtual environments:
curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="$HOME/.local/bin:$PATH"
Set up a virtual environment with Python 3.11:
uv venv openr1 --python 3.11
source openr1/bin/activate
uv pip install --upgrade pip
export UV_LINK_MODE=copy
Install vLLM
and FlashAttention
:
uv pip install vllm==0.7.2
uv pip install setuptools
uv pip install flash-attn --no-build-isolation
Note: This installs PyTorch
v2.5.1
, which is required forvLLM
compatibility. Using a different version may cause issues.
Install additional dependencies based on your use case:
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e ".[dev]"
Log in to Hugging Face and Weights & Biases:
huggingface-cli login
wandb login
Ensure Git LFS is installed for model/dataset management:
git-lfs --version
If not installed:
sudo apt-get install git-lfs
Train models using a YAML config with 4 GPUs (set num_processes=3
):
ACCELERATE_LOG_LEVEL=info accelerate launch \
--config_file recipes/accelerate_configs/zero2.yaml \
--num_processes=3 \
src/open_r1/grpo.py \
--config recipes/grpo.yaml
For Experiment 3, add the cosine_max_len
parameter:
ACCELERATE_LOG_LEVEL=info accelerate launch \
--config_file recipes/accelerate_configs/zero2.yaml \
--num_processes=3 \
src/open_r1/grpo.py \
--config recipes/grpo.yaml \
--cosine_max_len 3584
Evaluate models using lighteval
with custom tasks in src/open_r1/evaluate.py
. For single-GPU setups:
MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
OUTPUT_DIR=data/evals/$MODEL
# Example: AIME 2024
TASK=aime24
lighteval vllm "$MODEL_ARGS" "custom|$TASK|0|0" \
--custom-tasks src/open_r1/evaluate.py \
--use-chat-template \
--output-dir "$OUTPUT_DIR"
Important: Set
max_model_length=32768
to matchmax_new_tokens
, orlighteval
will fail.
For multi-GPU evaluation with data parallelism:
NUM_GPUS=4
MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,data_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
TASK=aime24
OUTPUT_DIR=data/evals/$MODEL
lighteval vllm "$MODEL_ARGS" "custom|$TASK|0|0" \
--custom-tasks src/open_r1/evaluate.py \
--use-chat-template \
--output-dir "$OUTPUT_DIR"
Alternatively, use the evaluation script:
sh eval.sh
Modify tasks in eval.sh
(line 8) as needed.
o1-preview
at 44.6%)Our approach uses 7,000 samples (42,000 total outputs) and costs ~$42 on 4x A40 GPUs in 24 hours, compared to:
Qwen2.5-7B-SimpleRL
($1,633), Eurus-2-7B-PRIME
($1,088)DeepScaleR-1.5B-Preview
($3,629), Still-3-1.5B-Preview
($2,268)
Thanks to the Hugging Face team for their open-r1 project.
Coming soon.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。