# RL2
**Repository Path**: github_zoo/RL2
## Basic Information
- **Project Name**: RL2
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-03-17
- **Last Updated**: 2026-03-17
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# RL2: Ray Less Reinforcement Learning
A concise library of post-training for large language models.
This is the right library for you if you want to learn reinforcement learning for large language models or have a quick test for your own algorithm.
We deliver a clear implementation without complicated abstractions.
Despite the simplicity, you should be able to scale up with 3D (DP/CP/TP) parallelism in FSDP backend and 5D parallelism (DP/CP/PP/TP/EP) in Megatron backend.
We also support
* Balanced sequence packing for higher throughput
* Multi-turn rollout with [SGLang](https://github.com/sgl-project/sglang) async inference engine
* [GEM](https://github.com/axon-rl/gem.git) (OpenAI Gym like) Agentic Environments
RL2 is a production-ready library!
It achieves comparable performance with other popular LLM RL libraries.
Also check our wandb report on [OpenThoughts](https://wandb.ai/chenmientan/OpenThoughts_archive), [SkyworkRM](https://wandb.ai/chenmientan/SkyworkRM_archive), [UltraFeedback](https://wandb.ai/chenmientan/UltraFeedback_archive), [TinyZero](https://wandb.ai/chenmientan/Countdown_archive), [LetterCounting](https://wandb.ai/chenmientan/LetterCounting_archive), and [SearchR1](https://wandb.ai/chenmientan/SearchR1_archive).
## Incoming Features
- [X] Support Megatron backend to increase GPU utilization for Mixture-of-Expert
- [ ] Support Low-Rank Adaptation to decrease GPU memory comsumption
- [X] Initialize model on meta device to decrease RAM consumption
- [X] Support partial rollout to decrease GPU idle
- [X] Use SGLang Router to forward requests for load balance between inference engines
- [X] Integrate GEM to scale environments
## Getting Started
### Installation
**From PyPI:**
```bash
pip install rl-square
```
**Using Pre-built Docker Images:**
```bash
# FSDP backend only
docker pull rlsquare/rl2:latest
# Megatron backend (includes FSDP)
docker pull rlsquare/rl2-megatron:latest
```
### Data Preperation [[Examples]](https://huggingface.co/Chenmien/datasets)
Hugging Face dataset and various file types, *i.e.*, JSON, JSONL, CSV, Parquet, and Arrow, are accepted.
All trainers support formats of both raw text and messages.
The former is more flexible but may be model-specific.
#### SFT
```json
[
{
"prompt": "The capital of China is",
"response": "Beijing."
}
]
```
```json
[
{
"messages": [
{"role": "user", "content": "What is the capital of China?"},
{"role": "assistant", "content": "Beijing."}
]
}
]
```
Multi-turn is only supported by the latter format.
#### RM and DPO
```json
[
{
"prompt": "The capital of China is",
"chosen": "Beijing.",
"rejected": "Shanghai."
}
]
```
```json
[
{
"chosen": [
{"role": "user", "content": "What is the capital of China?"},
{"role": "assistant", "content": "Beijing."}
],
"rejected": [
{"role": "user", "content": "What is the capital of China?"},
{"role": "assistant", "content": "Shanghai."}
]
}
]
```
#### PPO
```json
[
{
"prompt": "The capital of China is",
"extra_info": {
"answer": "Beijing"
}
}
]
```
```json
[
{
"messages": [
{"role": "user", "content": "What is the capital of China?"}
],
"extra_info": {
"answer": "Beijing"
}
}
]
```
### Environments [[Examples]](./envs)
In PPO, the language model interacts with the environment through a user-defined function `step` in the following format.
```python
async def step(
state: str, action: str, extra_info: Dict
) -> Dict:
action_type = parse_action_type(action)
env_response = {
"next_state": None,
"reward": 0.0,
"score": 0.0,
"done": False,
"extra_info": extra_info
}
if action_type == "search":
query = parse_query(action)
passage = await search_result(query)
env_response["next_state"] = state + action + passage
elif action_type == "answer":
pred = parse_pred(action)
reward = float(is_equivalent(pred, extra_info["answer"]))
env["reward"] = reward
env["score"] = score
env_response["done"] = True
return env_response
```
* `state` and `action` are the input and output of language model in the last turn and `next_state` is the input of language model in the next turn.
When `state + action` is a prefix of `next_state`, the two turns will be processed in a single sequence.
* `reward` is used to compute advantages (and subsequently update the model) while `score` is used to log the model performance.
Diverge values may be used when needed.
* `done` indicates whether to proceed to the next turn.
* `extra_info` contains everything not aforementioned, *e.g.*, answer.
The function should be included in a Python script where the path is specified by `actor.rollout.env_path`.
### Launch [[Examples]](./examples)
Use `torchrun` to launch the trainer. For example, for single node
```bash
torchrun \
--nproc_per_node= \
-m RL2.trainer.ppo \
```
For multi nodes
```bash
torchrun \
--nnodes= \
--node_rank= \
--nproc_per_node= \
--master_addr= \
--master_port= \
-m RL2.trainer.ppo \
```
## Hyper-Parameters
### Training Engine Partition
By default, *i.e.*, `ddp_size=1, tp_size=1`, your model will be partitioned via ZeRO stage 3.
`ddp_size` specifies the number of model parameter copies.
Larger `ddp_size` leads to higher memory consumption and lower communication cost.
For large models, you may specify `tp_size > 1` to enable tensor parallelism.
The product of `ddp_size` and `tp_size` should be a factor of the total number of GPUs.
### Sequence Length
For SFT, RM, and DPO, `max_length` is used to truncate sequences.
In RM and DPO, the chosen and rejected sequences will be packed together, so the actual sequence length can be up to twice of `max_length`.
For PPO, `max_new_tokens` is used to terminate generations.
The length of any sequence cannot exceed `cp_size * tp_size * max_length_per_device`.
### Algorithm
The default algorithm is [Dr. GRPO](https://arxiv.org/abs/2503.20783), where the loss is averaged at the token level and the advantage is not divided by the standard deviation.
* To use OpenAI PPO, set `kl.type=reward`, `kl.reward_estimator=k1`, and `adv.estimator=gae`
* To use DeepSeek GRPO, set `actor.avg_level=sequence`, `kl.type=loss`, `kl.loss_estimator=k3`, and `adv.norm_var=true`
## Acknowledgement
This project is built upon the basis of many remarkable projects, including but not limited to
* [DeepSpeedChat](https://github.com/deepspeedai/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat) for the proposal of hybrid engine
* [RingFlashAttention](https://github.com/zhuzilin/ring-flash-attention) for the support of ZigZag context parallelism
* [SGLang](https://github.com/sgl-project/sglang) for the support of async inference engine
We also thank [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF), [veRL](https://github.com/volcengine/verl), and [slime](https://github.com/THUDM/slime) for their pioneering work.
## Citation
If you find this library useful, please cite in the following format
```latex
@misc{Tan2025RL2,
author={Chenmien Tan and Simon Yu and Lanbo Lin and Ze Zhang and Yuanwu Xu and Chenhao Jiang and Tianyuan Yang and Sicong Xie and Guannan Zhang},
title={RL2: Ray Less Reinforcement Learning},
note={GitHub repository},
howpublished={\url{https://github.com/ChenmienTan/RL2}},
year={2025}
}
```
## Job Oppotunities
We are [Accio](https://www.accio.com/?src=p_GoogleDisplay_web&cmpgn=23138010880&adgrp=&fditm=&tgt=&locintrst=&locphyscl=1013962&mtchtyp=&ntwrk=x&device=c&dvcmdl=&creative=&plcmnt=&plcmntcat=&aceid=&position=&gad_source=1&gad_campaignid=23128156539&gbraid=0AAAAA-lIm6h93W5FT1DWH95ZQ28d5GT1N&gclid=Cj0KCQiAwYrNBhDcARIsAGo3u30JTZWSXpG4z0mrHg1E9y1ZaLOFYUI6hvWFNltm38VxLGBL-7MydRMaAnEZEALw_wcB), the world's first AI sourcing agent.
We are always looking for talents in Hangzhou.
Send us an [email](mailto:sicong.xsc@alibaba-inc.com) if you are interested in internship/full-time positions in post-training/agent.