# openr
**Repository Path**: liao_1995/openr
## Basic Information
- **Project Name**: openr
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: 15-small-bugs-about-string-post-processing-in-rmremotecaller
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2024-10-18
- **Last Updated**: 2024-10-18
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
---
[][contributors-url]
[](https://arxiv.org/pdf/2410.09671)

[][issues-url]
[][forks-url]
[][stars-url]
[](https://huggingface.co/openreasoner)
[](https://x.com/openreasoner)
[](#community)
[//]: # ()
Table of Contents
- News and Updates
- Features
- Plots
- Datasets and Models
-
Getting Started
- Usage
- Contact
- License
- Response Examples
- Community
- Reference
[//]: # ( )
## News and Updates
- **[15/10/2024]** Our report is on [**Arxiv**](https://arxiv.org/abs/2410.09671)!
- **[12/10/2024]** ***OpenR*** has been released! 🚀
## Features
- ✅ Process-supervision Data Generation
- ✅ Online Policy Training
- ✅ Generative and Discriminative PRM Training
- ✅ Multiple Search Strategies
- ✅ Test-time Computation and Scaling Law
## Plots
## Provided Datasets and Models
[//]: # ([PRM800K](https://github.com/openai/prm800k) (Process Supervision Dataset))
[MATH-APS](https://huggingface.co/datasets/mengfang/MATH-APS) (Our Dataset)
[MATH-psa](https://huggingface.co/openreasoner/Math-psa) (Our Process Reward Model)
## Getting Started
### Installation
```
conda create -n open_reasoner python=3.10
conda activate open_reasoner
pip install -r requirements.txt
pip3 install "fschat[model_worker,webui]"
pip install -U pydantic
cd envs/MATH/latex2sympy
pip install -e .
cd -
```
### Download Base Models
Before running the project, please ensure that all required base models are downloaded. The models used in this project include:
- `Qwen2.5-Math-1.5B-Instruct`, `Qwen2.5-Math-7B-Instruct`
- `Qwen2.5-Math-RM-72B`
- `peiyi9979/mistral-7b-sft`
- `peiyi9979/math-shepherd-mistral-7b-prm`
To download these models, please refer to the [Hugging Face model downloading tutorial](https://huggingface.co/docs/hub/models-downloading) for step-by-step guidance on downloading models from the Hugging Face Hub.
Please make sure that all models are saved in their directories according to the project setup before proceeding.
### Quickstart
Before running inference, please modify the following variables in the scripts under `reason/llm_service/` to set the appropriate base models for your usage:
- `$MODEL_BASE`: Set this to the directory where your models are stored.
- `$POLICY_MODEL_NAME`: Set this to the name of the policy model you wish to use.
- `$VALUE_MODEL_NAME`: Set this to the name of the value model you wish to use.
- `$NUM_LM_WORKER`: Set this to the number of language model (LM) workers to start.
- `$NUM_RM_WORKER`: Set this to the number of reward model (RM) workers to start.
Then it prepares and runs inference using different techniques.
#### Start LM & RM Services
For example, to start the LM and RM services for the Math Shepherd model, run the following command:
```bash
sh reason/llm_service/create_service_math_shepherd.sh
```
## Usage
#### Run Inference
⚠️ Make sure the input (`--LM`, `--RM`) in the script aligns with the variables (`$POLICY_MODEL_NAME`, `$VALUE_MODEL_NAME`) in the pending worker!
```bash
export PYTHONPATH=$(pwd)
sh scripts/eval/cot_greedy.sh
# Method: cot. Average result: ({'majority_vote': 0.734, 'total_completion_tokens': 559.13},)
sh scripts/eval/cot_rerank.sh
# Method: best_of_n. Average result: ({'majority_vote': 0.782,
# 'prm_min_max': 0.772,
# 'prm_min_vote': 0.792,
# 'prm_last_max': 0.776,
# 'prm_last_vote': 0.792,
# 'total_completion_tokens': 4431.268},)
sh scripts/eval/beam_search.sh
# Method: beam_search. Average result: ({'majority_vote': 0.74, 'total_completion_tokens': 2350.492},)
```
#### Run Training
⚠️ Before training, please modify the `$dataset_path`, `$model_name_or_path` and `$prm_name_or_path` in `train/mat/scripts/train_llm.sh`.
```bash
cd train/mat/scripts
bash train_llm.sh
```
#### Run PRM Learning
```bash
cd prm/code
\\ single gpu
python finetune_qwen_single_gpu.py --model_path $YOUR_MODEL_PATH \
--train_data_path $TRAIN_DATA_PATH \
--test_data_path $TEST_DATA_PATH
\\ multi gpu
torchrun --nproc_per_node=2 finetune_qwen.py --model_path $YOUR_MODEL_PATH \
--data_path $YOUR_DATA_FOLDER_PATH \
--datasets both \
```
## Future Plan
- Add More Comprehensive Evaluations on RL Training and Search Strategies
- Scaling the Prove-Verifier Model Size
- Support Self-improvement Training
## Contact
The OpenR community is maintained by:
- **Openreasoner Team** (openreasoner@gmail.com)
## License
OpenR is released under the MIT License.
## Citation
If you do find our resources helpful, please cite our paper:
```
@article{openr2024,
title = {OpenR: An Open Source Framework for Advanced Reasoning with Large Language Models},
url = {https://arxiv.org/pdf/2410.09671},
author = {Jun Wang, Meng Fang, Ziyu Wan, Muning Wen, Jiachen Zhu, Anjie Liu,
Ziqin Gong, Yan Song, Lei Chen, Lionel M. Ni, Linyi Yang, Ying Wen, Weinan Zhang},
year = {2024}
}
```
## Response Examples
### Comparing PRM, Math-psa (Ours) V.S. Math-Shepherd
### Justifing RL Training
### Exploring Test-time Computation
## Community
**WeChat**:
## Reference
### Inference-time Computing
[1] [Alphazero-like tree-search can guide large language model decoding and training.](https://arxiv.org/pdf/2309.17179)
[2] [Reasoning with language model is planning with world model.](https://arxiv.org/pdf/2305.14992)
[3] [Scaling LLM test-time compute optimally can be more effective than scaling model parameters](https://arxiv.org/pdf/2408.03314?)
[4] [Think before you speak: Training language models with pause tokens](https://arxiv.org/pdf/2310.02226)
### From Outcome Supervision to Process Supervision
[1] [Training verifiers to solve math word problems](https://arxiv.org/pdf/2110.14168)
[2] [Solving math word problems with process-and outcome-based feedback](https://arxiv.org/pdf/2211.14275)
[3] [Let’s verify step by step](https://arxiv.org/pdf/2305.20050)
[4] [Making large language models better reasoners with step-aware verifier](https://arxiv.org/pdf/2206.02336)
[5] [Ovm, outcome-supervised value models for planning in
mathematical reasoning](https://aclanthology.org/2024.findings-naacl.55.pdf)
[6] [Generative verifiers: Reward modeling as next-token prediction](https://arxiv.org/pdf/2408.15240)
### Data Acquisition
[1] [Star: Bootstrapping reasoning with reasoning](https://proceedings.neurips.cc/paper_files/paper/2022/file/639a9a172c044fbb64175b5fad42e9a5-Paper-Conference.pdf)
[2] [Quiet-star: Language models can teach themselves to think before speaking](https://arxiv.org/pdf/2403.09629)
[3] [Improve mathematical reasoning in language models by automated
process supervision](https://arxiv.org/pdf/2406.06592)
[4] [Shepherd: A critic for language model generation](https://arxiv.org/abs/2308.04592)
[5] [Math-shepherd: Verify and reinforce llms step-by-step without human annotations](https://aclanthology.org/2024.acl-long.510.pdf)
[contributors-shield]: https://img.shields.io/github/contributors/openreasoner/openr.svg?style=for-the-badge
[contributors-url]: https://github.com/openreasoner/openr/graphs/contributors
[forks-shield]: https://img.shields.io/github/forks/openreasoner/openr.svg?style=for-the-badge
[forks-url]: https://github.com/openreasoner/openr/network/members
[stars-shield]: https://img.shields.io/github/stars/openreasoner/openr.svg?style=for-the-badge
[stars-url]: https://github.com/openreasoner/openr/stargazers
[issues-shield]: https://img.shields.io/github/issues/openreasoner/openr.svg?style=for-the-badge
[issues-url]: https://github.com/openreasoner/openr/issues
[license-shield]: https://img.shields.io/github/license/openreasoner/openr.svg?style=for-the-badge
[license-url]: https://github.com/openreasoner/openr/blob/main/LICENSE.txt