# ProcessBench **Repository Path**: mirrors_huggingface/ProcessBench ## Basic Information - **Project Name**: ProcessBench - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-02-16 - **Last Updated**: 2025-08-30 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # ProcessBench 📄 [**[paper]**](https://huggingface.co/papers/2412.06559) 🤗 [**[data]**](https://huggingface.co/datasets/Qwen/ProcessBench) This is the official repository for paper "**ProcessBench: Identifying Process Errors in Mathematical Reasoning**" If you find this work relevant or helpful to your work, please kindly cite us: ``` @article{processbench, title={ProcessBench: Identifying Process Errors in Mathematical Reasoning}, author={ Chujie Zheng and Zhenru Zhang and Beichen Zhang and Runji Lin and Keming Lu and Bowen Yu and Dayiheng Liu and Jingren Zhou and Junyang Lin }, journal={arXiv preprint arXiv:2412.06559}, year={2024} } ``` ## News * **[12/13/2024]** Released the evaluation [code](./code/run_eval_prm_rlhflow.py) for the RLHFlow PRMs * **[12/11/2024]** Released the evaluation [**code**](./code) and the [**data**](https://huggingface.co/datasets/Qwen/ProcessBench) on HuggingFace * **[12/10/2024]** Released the [**paper**](https://huggingface.co/papers/2412.06559) on arXiv ## Data Usage You can use the following code to preview the ProcessBench data: ```python import json from datasets import load_dataset dataset = load_dataset('Qwen/ProcessBench', split='gsm8k') print(json.dumps(dataset[0], indent=2)) # Expected output: """ { "id": "gsm8k-0", "generator": "Qwen2-7B-Instruct", "problem": "Sue lives in a fun neighborhood...", "steps": [ "To find out how many more pink plastic flamingos were out than...", ... ], "final_answer_correct": false, "label": 1 } """ ``` ## Evaluation You can refer to the [code](./code) folder for the evaluation code and the prompt templates we use in this work ### Evaluating TRL based models In TRL v0.13.0 a [PRM trainer](https://huggingface.co/docs/trl/v0.13.0/en/prm_trainer) was introduced. The resulting PRM returns probabilities for the different tokens, and works directly with the [Token classification](https://huggingface.co/docs/transformers/tasks/token_classification) pipeline. To evaluate these models, clone this repository and install the `requirements-trl.txt` dependencies: ```bash uv pip install -r requirements-trl.txt ``` Now go to the `/code` folder, and run the following script: ```bash python run_eval_prm_trl.py \ --model_name "plaguss/Qwen2.5-Math-1.5B-Instruct-PRM-0.2" \ --output_dir "./outputs" \ --batch_size 256 \ --sep "\n\n" ``` Other than the model to evaluate, and the token used as a separator, the only relevant argument is the batch size. Internally, the process runs using a transformers pipeline, and it benefits from bigger sizes. *For reference, for a 7B model, a batch size of 128 should work, taking close to 2 hours to complete the benchmark*. The results are saved in the `output_dir`, and if the command is rerun, it will check for the results to only compute the final metrics. The help for the script can be found here: ```bash usage: run_eval_prm_trl.py [-h] [--config {gsm8k,math,olympiadbench,omnimath,all}] --model_name MODEL_NAME [--output_dir OUTPUT_DIR] [--sep SEP] [--batch_size BATCH_SIZE] [--max_elements MAX_ELEMENTS] options: -h, --help show this help message and exit --config {gsm8k,math,olympiadbench,omnimath,all} The configuration to run from the dataset, by default will use 'all'. --model_name MODEL_NAME --output_dir OUTPUT_DIR The path to save the results to. --sep SEP Separator of the model, ensure it corresponds to the same one used during training. --batch_size BATCH_SIZE The number of examples to run in a single batch. Each question has multiple steps, and a batch can contain multiple from different questions to speed up the process. --max_elements MAX_ELEMENTS Number of elements to run. Helpful for testing, by default will run the full dataset. ``` * Analyzing the results: The following output corresponds to [plaguss/Qwen2.5-Math-7B-Instruct-PRM-0.2](https://huggingface.co/plaguss/Qwen2.5-Math-1.5B-Instruct-PRM-0.2): ```bash Individual Results: ---------------------------------------------------------------------- gsm8k -> Precision: 22.71 Recall: 93.78 F1 Score: 36.56 math -> Precision: 38.22 Recall: 70.69 F1 Score: 49.61 olympiadbench -> Precision: 27.08 Recall: 53.98 F1 Score: 36.07 omnimath -> Precision: 27.93 Recall: 54.77 F1 Score: 37.00 Weighted Averages: ---------------------------------------------------------------------- Weighted -> Precision: 30.09 Recall: 63.81 F1 Score: 40.38 ``` It yields the individual results, and finally the weighted average by the number of examples in in subset. The weighted F1 Score corresponds to the value shown in the reference paper to compare.