# RAGEN
**Repository Path**: gapyanpeng/RAGEN
## Basic Information
- **Project Name**: RAGEN
- **Description**: https://github.com/ZihanWang314/RAGEN.git
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-01-31
- **Last Updated**: 2025-01-31
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# RAGEN: A General-Purpose Reasoning Agent Training Framework
RAGEN is the first reproduction of the DeepSeek-R1(-Zero) methods for training agentic models.
We strongly believe in the future of RL + LLM + Agents. The release is a minimally viable leap forward.
## Framework
Figure: Rollout and Update Pipeline
### Rollout Phase
During rollout, we have two types of tokens
* **Environment tokens** (shown in blue): Generated by the simulator/env, including states $s$ and rewards $r$
* **LLM-generated tokens** (shown in red): Including both thinking tokens $t$ and action tokens $a$
The input consists of a sequence $s_0,A_0,r_0...s_t$, and the output is $A_t$, which contains both thinking $t_t$ and answer $a_t$, where only $a_t$ will be sent to the simulator. While the LLM could potentially generate the entire trajectory given the current state and history trajectory information, we implement a forced truncation after the first generated answer.
The process flow is as follows:
* Given $s_0,A_0,r_0,s_1..s_t$, the LLM tries to generate $A_t,s_{t+1}...s_k$
* A forced truncation is performed to get $A_t$, which contains reasoning (`...`) and answer (`...`)
* $a_t$ is extracted from $A_t$ and fed into the simulator to obtain $r_t$ and $s_{t+1}$
* $A_t$, $r_t$ and $s_{t+1}$ are appended to the existing trajectory to form the new input
* After $k$ rounds of rollout, obtaining the sequence $s_0,A_0,r_0,s_1...s_k$ to train the model
* Rollouts are Generated in batch
### Update Phase
During the update phase:
* Compute and back propagate loss for the tokens in orange
* Reward calculation: parsing $r_0,...r_{k-1}$ from the trajectory tokens using regex-based rules
* Final reward computation: $r = {\rm sum}(r_0,...r_{k-1})$ for each rollout generated
### Benefits
* **Unified Multi-round Processing**: Maintains consistency by avoiding new instance creation that could destabilize batch sizes
* **World Modeling**: Potentially enables world modeling (state and reward prediction), helps LLM-agent to plan
## Performance
We run RAGEN on Qwen-2.5-{0.5B, 3B}-{Instruct, None} and DeepSeek-R1-Distill-Qwen-1.5B, on the [Gym-Sokoban](https://github.com/mpSchrader/gym-sokoban) task.
About the sokoban task (from the official repo): Sokoban is Japanese for warehouse keeper and a traditional video game. The game is a transportation puzzle, where the player has to push all boxes in the room on the storage locations/ targets. The possibility of making irreversible mistakes makes these puzzles so challenging especially for Reinforcement Learning algorithms, which mostly lack the ability to think ahead.
NOTE: See [Visualization](https://github.com/ZihanWang314/ragen/#visualization) Section for details. The maximum reward of this environment is **10.9**. Action spaces are 0-4 (0: Stand, 1: Up, 2: Down, 3: Left, 4: Right).
The loss curve have not converged (since our compute is currently limited...). But we already see some trends:
- Instruct-finetuned models are not significantly advantaged ahead Pretrained-only models, although they are better at start.
- 3B models are performing better than 0.5B models as well, but the advantages are also not that obvious at around 40 steps.
- Interestingly, R1-distilled 1.5B model do less well than 0.5B models for now.
We prepare to release a complete wandb plot for these experiment runs, although you can try it your own and it may even be faster than our run (reasons above).
## Environment Setup
```bash
conda create -n ragen python=3.9 -y
conda activate ragen
git clone git@github.com:ZihanWang314/ragen.git
cd ragen
# setup install
pip install -e . # includes verl-ragen (by us) and verl-core (by the verl team)
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121
# Optional: to install flash-attn, you may need to install cuda-toolkit first if you don't have
conda install -c "nvidia/label/cuda-12.1.0" cuda-toolkit -y
export CUDA_HOME=$CONDA_PREFIX # /opt/conda/envs/zero
pip3 install flash-attn --no-build-isolation
pip install -r requirements.txt # other packages
```
## Train Models
### Create data
On the [Gym-Sokoban](https://github.com/mpSchrader/gym-sokoban) task, We create 10k first-round-observation data for training and run for <=1 epoch.
```bash
# sokoban env settings. will determine game difficulty
# it's normal to see some SOKOBAN errors, but the data will be created and it's fine
export DIM_X=6
export DIM_Y=6
export NUM_BOXES=1
export MAX_STEPS=5
export SEARCH_DEPTH=30
python scripts/dataset_curation.py \
--output data/sokoban \
--seed 10000 \
--train_size 10000 \
--test_size 10 \
--prefix qwen-instruct # we find it could work for base models
```
### Export variables and train
```bash
export DATA_DIR=data/sokoban
export DIM_X=6
export DIM_Y=6
export NUM_BOXES=1
export MAX_STEPS=5
export SEARCH_DEPTH=30
# export CUDA_VISIBLE_DEVICES=0
# export BASE_MODEL=Qwen/Qwen2.5-0.5B
# export EXPERIMENT_NAME=test-qwen2.5-0.5b
export CUDA_VISIBLE_DEVICES=0
export BASE_MODEL=checkpoints/Agent-R1/test-qwen2.5-0.5b-instruct-1mbsz/actor/global_step_100
export EXPERIMENT_NAME=test-qwen2.5-0.5b-imagetest
export MICRO_BATCH_SIZE=1
export TRAIN_BATCH_SIZE=128 # 256
export PPO_BATCH_SIZE=64 # 128
export MAX_START_LENGTH=400 # the first round prompt max length
export MAX_RESPONSE_LENGTH=100
export MAX_OBS_LENGTH=120
export MAX_TURNS=5
export NUM_UPDATE_PER_ROLL=1 # roll out for a batch, then the model do N times of update. Currently not implemented.
export LOG_MODE="['wandb']" # or 'console'
export GCP=True # gradient checkpointing
export N_GPUS=1
export ROLLOUT_TP_SIZE=1
bash ./train.sh # more arguments in this file
# default config file is verl/trainer/config/ppo_trainer.yaml
```
## Visualization
1. By setting arguments in `train.sh`, you can visualize the trajectory:
```bash
logging.log_images=True # set to True to log images
logging.log_image_dir=.log.debug/trajectory # set to the directory to save images
logging.log_image_step_size=1 # save image every _ steps
logging.log_n_image_per_batch=8 # save _ images per batch
```
2. You may also need to install fonts to make the figures displayed correctly:
```bash
sudo apt-get install fonts-noto-cjk
```
3. Example image for one trajectory:
## Cases
Please see cases/ file.
There are only limited cases for now, including [reward hacking](https://github.com/ZihanWang314/agent-r1/blob/main/cases/reward_hacking.txt) and the [suck moment](https://github.com/ZihanWang314/agent-r1/blob/main/cases/suck_moment.txt). we will add more cases recently.
## Feedback
We welcome all sorts of feedback! Please just raise an issue, no matter if it's any bugs you find or specific questions / suggestions regarding the project, so our team members won't be answering similar problems multiple times and thus would lead to more productive and efficient community building. Cheers!
## Authors
[Zihan Wang*](https://zihanwang314.github.io/)
[Kangrui Wang](https://jameskrw.github.io/)
[Qineng Wang](https://qinengwang-aiden.github.io/)
[Pingyue Zhang](https://williamzhangsjtu.github.io/)
[Manling Li†](https://limanling.github.io)
*: Project Lead; †: Advising.
Remaining authors are alphabetical order.
## Acknowledgements
We thank [DeepSeek](https://github.com/deepseek-ai/DeepSeek-R1) for providing the DeepSeek-R1 model and ideas. We thank the [veRL](https://github.com/volcengine/verl) team for their infrastructure. We thank the [TinyZero](https://github.com/Jiayi-Pan/TinyZero) team for their discoveries that inspired our early exploration. We thank Yiping Lu, Runxin Xu, Kyunghyun Cho for insightful discussions with them.
## Citation
```md
@misc{RAGEN,
author = {Zihan Wang and Kangrui Wang and Qineng Wang and Pingyue Zhang and Manling Li},
title = {RAGEN: A General-Purpose Reasoning Agent Training Framework},
year = {2025},
organization = {GitHub},
url = {https://github.com/ZihanWang314/ragen},
}
```