# RAGEN **Repository Path**: gapyanpeng/RAGEN ## Basic Information - **Project Name**: RAGEN - **Description**: https://github.com/ZihanWang314/RAGEN.git - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-01-31 - **Last Updated**: 2025-01-31 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # RAGEN: A General-Purpose Reasoning Agent Training Framework
RAGEN is the first reproduction of the DeepSeek-R1(-Zero) methods for training agentic models.
We strongly believe in the future of RL + LLM + Agents. The release is a minimally viable leap forward.
Figure: Rollout and Update Pipeline
### Rollout Phase During rollout, we have two types of tokens * **Environment tokens** (shown in blue): Generated by the simulator/env, including states $s$ and rewards $r$ * **LLM-generated tokens** (shown in red): Including both thinking tokens $t$ and action tokens $a$ The input consists of a sequence $s_0,A_0,r_0...s_t$, and the output is $A_t$, which contains both thinking $t_t$ and answer $a_t$, where only $a_t$ will be sent to the simulator. While the LLM could potentially generate the entire trajectory given the current state and history trajectory information, we implement a forced truncation after the first generated answer. The process flow is as follows: * Given $s_0,A_0,r_0,s_1..s_t$, the LLM tries to generate $A_t,s_{t+1}...s_k$ * A forced truncation is performed to get $A_t$, which contains reasoning (`
The loss curve have not converged (since our compute is currently limited...). But we already see some trends:
- Instruct-finetuned models are not significantly advantaged ahead Pretrained-only models, although they are better at start.
- 3B models are performing better than 0.5B models as well, but the advantages are also not that obvious at around 40 steps.
- Interestingly, R1-distilled 1.5B model do less well than 0.5B models for now.
We prepare to release a complete wandb plot for these experiment runs, although you can try it your own and it may even be faster than our run (reasons above).
## Environment Setup
```bash
conda create -n ragen python=3.9 -y
conda activate ragen
git clone git@github.com:ZihanWang314/ragen.git
cd ragen
# setup install
pip install -e . # includes verl-ragen (by us) and verl-core (by the verl team)
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121
# Optional: to install flash-attn, you may need to install cuda-toolkit first if you don't have
conda install -c "nvidia/label/cuda-12.1.0" cuda-toolkit -y
export CUDA_HOME=$CONDA_PREFIX # /opt/conda/envs/zero
pip3 install flash-attn --no-build-isolation
pip install -r requirements.txt # other packages
```
## Train Models
### Create data
On the [Gym-Sokoban](https://github.com/mpSchrader/gym-sokoban) task, We create 10k first-round-observation data for training and run for <=1 epoch.
```bash
# sokoban env settings. will determine game difficulty
# it's normal to see some SOKOBAN errors, but the data will be created and it's fine
export DIM_X=6
export DIM_Y=6
export NUM_BOXES=1
export MAX_STEPS=5
export SEARCH_DEPTH=30
python scripts/dataset_curation.py \
--output data/sokoban \
--seed 10000 \
--train_size 10000 \
--test_size 10 \
--prefix qwen-instruct # we find it could work for base models
```
### Export variables and train
```bash
export DATA_DIR=data/sokoban
export DIM_X=6
export DIM_Y=6
export NUM_BOXES=1
export MAX_STEPS=5
export SEARCH_DEPTH=30
# export CUDA_VISIBLE_DEVICES=0
# export BASE_MODEL=Qwen/Qwen2.5-0.5B
# export EXPERIMENT_NAME=test-qwen2.5-0.5b
export CUDA_VISIBLE_DEVICES=0
export BASE_MODEL=checkpoints/Agent-R1/test-qwen2.5-0.5b-instruct-1mbsz/actor/global_step_100
export EXPERIMENT_NAME=test-qwen2.5-0.5b-imagetest
export MICRO_BATCH_SIZE=1
export TRAIN_BATCH_SIZE=128 # 256
export PPO_BATCH_SIZE=64 # 128
export MAX_START_LENGTH=400 # the first round prompt max length
export MAX_RESPONSE_LENGTH=100
export MAX_OBS_LENGTH=120
export MAX_TURNS=5
export NUM_UPDATE_PER_ROLL=1 # roll out for a batch, then the model do N times of update. Currently not implemented.
export LOG_MODE="['wandb']" # or 'console'
export GCP=True # gradient checkpointing
export N_GPUS=1
export ROLLOUT_TP_SIZE=1
bash ./train.sh # more arguments in this file
# default config file is verl/trainer/config/ppo_trainer.yaml
```
## Visualization
1. By setting arguments in `train.sh`, you can visualize the trajectory:
```bash
logging.log_images=True # set to True to log images
logging.log_image_dir=.log.debug/trajectory # set to the directory to save images
logging.log_image_step_size=1 # save image every _ steps
logging.log_n_image_per_batch=8 # save _ images per batch
```
2. You may also need to install fonts to make the figures displayed correctly:
```bash
sudo apt-get install fonts-noto-cjk
```
3. Example image for one trajectory: