# RAG-Bench **Repository Path**: duzgd/RAG-Bench ## Basic Information - **Project Name**: RAG-Bench - **Description**: No description available - **Primary Language**: Unknown - **License**: GPL-3.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-05-26 - **Last Updated**: 2025-05-26 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # RQABench: Retrieval QA Benchmark [![Documentation Status](https://readthedocs.org/projects/retrieval-qa-benchmark/badge/?version=latest)](https://retrieval-qa-benchmark.readthedocs.io/en/latest/?badge=latest) Retreival QA Benchmark (RQABench in short) is an open-sourced, end-to-end test workbench for Retrieval Augmented Generation (RAG) systems. We intend to build an open benchmark for all developers and researchers to reproduce and design new RAG systems. We also want to create a platform for everyone to share their lego blocks to help others to build up their own retrieval + LLM system. Here are some major feature of this benchmark: - **Flexibility**: We maximize the flexibility when design your retrieval system, as long as your transform accept `QARecord` as input and `QARecord` as output. - **Reproducibility**: We gather all settings in the evaluation process into a single YAML configuration. It helps you to track and reproduce experiements. - **Traceability**: We collect more than the accuracy and scores. We also focus on running times on any function you want to watch and the tokens used in the whole RAG system. ## Getting started ### Clone and install ```bash # Clone to your local machine git clone https://github.com/myscale/Retrieval-QA-Benchmark # install it as editable package cd Retrieval-QA-Benchmark && python3 -m pip3 install -e . ``` ### Run it ```python from retrieval_qa_benchmark.models import * from retrieval_qa_benchmark.datasets import * from retrieval_qa_benchmark.transforms import * from retrieval_qa_benchmark.evaluators import * from retrieval_qa_benchmark.utils.profiler import PROFILER # This is for loading our special yaml configuration with `!include` keyword from retrieval_qa_benchmark.utils.config import load # This is where you can contruct evaluator from config from retrieval_qa_benchmark.utils.factory import EvaluatorFactory # This will print all loaded modules. You can also use it as reference to edit your configuration print(str(REGISTRY)) # Choose a configuration to evaluatoe config = load(open("config/mmlu.yaml")) evaluator = EvaluatorFactory.from_config(config).build() # evaluator will return accuracy in float and list of `QAPrediction` acc, result = evaluator() # you can set out_file to generate a JSONL file or write it as your own. with open("some-file-name-to-store-result.jsonl", "w") as f: f.write("\n".join([r.model_dump_json() for r in result])) ``` ## Replicate our FAISS / MyScale Benchmark 1. RAG with FAISS - Download the index file for wikipedia [here](https://myscale-datasets.s3.ap-southeast-1.amazonaws.com/RQA/IVFSQ_IP.index) (around 26G). - Download dataset from huggingface with our code (around 140G). It will automatically download the dataset for the first time. - Set the index path to the download index. 2. RAG with MyScale - Download the data for wikipedia in parquet [here](https://myscale-datasets.s3.ap-southeast-1.amazonaws.com/wiki_abstract_with_vector.parquet). - Insert the data and create vector index. You can also directly use our free pod hosting the Wikipedia data as described [here](https://github.com/myscale/ChatData?tab=readme-ov-file#data-schema). ## Result with Simple RAG pipeline ### with MyScale

Setup		Dataset					Average
LLM	Contexts	mmlu-astronomy	mmlu-prehistory	mmlu-global-facts	mmlu-college-medicine	mmlu-clinical-knowledge	Average
gpt-3.5-turbo	❌	71.71%	70.37%	38.00%	67.63%	74.72%	68.05%
	✅ (Top-1)	75.66% (+3.95%)	78.40% (+8.03%)	46.00% (+8.00%)	67.05% (-0.58%)	73.21% (-1.51%)	71.50% (+3.45%)
	✅ (Top-3)	76.97% (+5.26%)	81.79% (+11.42%)	48.00% (+10.00%)	65.90% (-1.73%)	73.96% (-0.76%)	72.98% (+4.93%)
	✅ (Top-5)	78.29% (+6.58%)	79.63% (+9.26%)	42.00% (+4.00%)	68.21% (+0.58%)	74.34% (-0.38%)	72.39% (+4.34%)
	✅ (Top-10)	78.29% (+6.58%)	79.32% (+8.95%)	44.00% (+6.00%)	71.10% (+3.47%)	75.47% (+0.75%)	73.27% (+5.22%)
llama2-13b-chat-q6_0	❌	53.29%	57.41%	33.00%	44.51%	50.19%	50.30%
	✅ (Top-1)	58.55% (+5.26%)	61.73% (+4.32%)	45.00% (+12.00%)	46.24% (+1.73%)	54.72% (+4.53%)	55.13% (+4.83%)
	✅ (Top-3)	63.16% (+9.87%)	63.27% (+5.86%)	49.00% (+16.00%)	46.82% (+2.31%)	55.85% (+5.66%)	57.10% (+6.80%)
	✅ (Top-5)	63.82% (+10.53%)	65.43% (+8.02%)	51.00% (+18.00%)	51.45% (+6.94%)	57.74% (+7.55%)	59.37% (+9.07%)
	✅ (Top-10)	65.13% (+11.84%)	66.67% (+9.26%)	46.00% (+13.00%)	49.71% (+5.20%)	57.36% (+7.17%)	59.07% (+8.77%)
* The benchmark uses MyScale MSTG as vector index * This benchmark can be reproduced with our github repository retrieval-qa-benchmark

------------------ ### with FAISS

Setup		Dataset					Average
LLM	Contexts	mmlu-astronomy	mmlu-prehistory	mmlu-global-facts	mmlu-college-medicine	mmlu-clinical-knowledge	Average
gpt-3.5-turbo	❌	71.71%	70.37%	38.00%	67.63%	74.72%	68.05%
	✅ (Top-1)	75.00% (+3.29%)	77.16% (+6.79%)	44.00% (+6.00%)	66.47% (-1.16%)	73.58% (-1.14%)	70.81% (+2.76%)
	✅ (Top-3)	75.66% (+3.95%)	80.25% (+9.88%)	44.00% (+6.00%)	65.90% (-1.73%)	73.21% (-1.51%)	71.70% (+3.65%)
	✅ (Top-5)	78.29% (+6.58%)	79.32% (+8.95%)	46.00% (+8.00%)	65.90% (-1.73%)	73.58% (-1.14%)	72.09% (+4.04%)
	✅ (Top-10)	78.29% (+6.58%)	80.86% (+10.49%)	49.00% (+11.00%)	69.94% (+2.31%)	75.85% (+1.13%)	74.16% (+6.11%)
llama2-13b-chat-q6_0	❌	53.29%	57.41%	33.00%	44.51%	50.19%	50.30%
	✅ (Top-1)	57.89% (+4.60%)	61.42% (+4.01%)	48.00% (+15.00%)	45.66% (+1.15%)	55.09% (+4.90%)	55.22% (+4.92%)
	✅ (Top-3)	59.21% (+5.92%)	65.74% (+8.33%)	50.00% (+17.00%)	50.29% (+5.78%)	56.98% (+6.79%)	58.28% (+7.98%)
	✅ (Top-5)	65.79% (+12.50%)	64.51% (+7.10%)	48.00% (+15.00%)	50.29% (+5.78%)	58.11% (+7.92%)	58.97% (+8.67%)
	✅ (Top-10)	65.13% (+11.84%)	66.05% (+8.64%)	48.00% (+15.00%)	47.40% (+2.89%)	56.23% (+6.04%)	58.38% (+8.08%)
* The benchmark uses FAISS IVFSQ (nprobes=128) as vector index * This benchmark can be reproduced with our github repository retrieval-qa-benchmark