# mcp-bench

**Repository Path**: mirrors_huggingface/mcp-bench

## Basic Information

- **Project Name**: mcp-bench
- **Description**: MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: trl-internal
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-09-03
- **Last Updated**: 2025-12-20

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

[![arXiv](https://img.shields.io/badge/arXiv-2508.20453-b31b1b.svg)](https://arxiv.org/abs/2508.20453)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Python Version](https://img.shields.io/badge/python-3.10%2B-blue)](https://www.python.org/downloads/)
[![MCP Protocol](https://img.shields.io/badge/MCP-Protocol-green)](https://github.com/anthropics/mcp)

![MCP-Bench](./images/mcpbench_intro.png)

## 🍴 Fork Instructions

To run the code in this project, first, create a Python virtual environment using e.g. `uv`.
To install `uv`, follow the [UV Installation Guide](https://docs.astral.sh/uv/getting-started/installation/).


```shell
uv self update
uv venv mcpbench --python 3.11 && source mcpbench/bin/activate && uv pip install --upgrade pip
```

Next, install vLLM:

```shell
uv pip install vllm==0.10.0
```

Then install the MCP dependencies:

```shell
pushd mcp_servers && bash ./install-no-sudo.sh && popd
```

**Configure MCP Server API Keys**

Some MCP servers require external API keys to function properly. These keys are automatically loaded from `./mcp_servers/api_key`. You should set these keys by yourself in file `./mcp_servers/api_key`:

```bash
# View configured API keys
cat ./mcp_servers/api_key
```

One of these keys is a `HF_TOKEN`, so Lewis has created a [dummy user `h4-bot`](https://huggingface.co/h4-bot) with limited Hub access. You can find the rest of the keys on the HFC [/fsx/lewis/git/hf/mcp-bench/mcp_servers/api_key](/fsx/lewis/git/hf/mcp-bench/mcp_servers/api_key).

**Test everything works**

Spin up a vLLM server:

```sh
vllm serve Qwen/Qwen3-4B-Instruct-2507 \
    --port 8000 --host 0.0.0.0 \
    --enable-auto-tool-choice --tool-call-parser hermes
```

> [!NOTE]
> Make sure the tool-call-parser is configured correctly - otherwise, you will get garbage results! See the [vLLM docs](https://docs.vllm.ai/en/stable/features/tool_calling.html).

Then run the test benchmark:

```sh
HUGGINGFACE_BASE_URL="http://localhost:8000/v1" python run_benchmark.py --models huggingface --tasks-file tasks/test.json --distraction-count 0 --disable-judge-stability
```

If everything works, you'll see the results stored in `benchmark_results_{timestamp}.json`.

To run on all or a subset of tasks run:

```sh
export HUGGINGFACE_BASE_URL="http://localhost:8000/v1"
## run all tasks
python run_benchmark.py --models huggingface

## single server tasks
python run_benchmark.py --models huggingface \
--tasks-file tasks/mcpbench_tasks_single_runner_format.json

## two server tasks
python run_benchmark.py --models huggingface \
--tasks-file tasks/mcpbench_tasks_multi_2server_runner_format.json

## three server tasks
python run_benchmark.py --models huggingface \
--tasks-file tasks/mcpbench_tasks_multi_3server_runner_format.json
```

## Overview

MCP-Bench is a comprehensive evaluation framework designed to assess Large Language Models' (LLMs) capabilities in tool-use scenarios through the Model Context Protocol (MCP). This benchmark provides an end-to-end pipeline for evaluating how effectively different LLMs can discover, select, and utilize tools to solve real-world tasks.

## Leaderboard

| Rank | Model | Overall Score |
|------|-------|---------------|
| 1 | gpt-5 | 0.749 |
| 2 | o3 | 0.715 |
| 3 | gpt-oss-120b | 0.692 |
| 4 | gemini-2.5-pro | 0.690 |
| 5 | claude-sonnet-4 | 0.681 |
| 6 | qwen3-235b-a22b-2507 | 0.678 |
| 7 | glm-4.5 | 0.668 |
| 8 | gpt-oss-20b | 0.654 |
| 9 | kimi-k2 | 0.629 |
| 10 | qwen3-30b-a3b-instruct-2507 | 0.627 |
| 11 | gemini-2.5-flash-lite | 0.598 |
| 12 | gpt-4o | 0.595 |
| 13 | gemma-3-27b-it | 0.582 |
| 14 | llama-3-3-70b-instruct | 0.558 |
| 15 | gpt-4o-mini | 0.557 |
| 16 | mistral-small-2503 | 0.530 |
| 17 | llama-3-1-70b-instruct | 0.510 |
| 18 | nova-micro-v1 | 0.508 |
| 19 | llama-3-2-90b-vision-instruct | 0.495 |
| 20 | llama-3-1-8b-instruct | 0.428 |

*Overall Score represents the average performance across all evaluation dimensions including rule-based schema understanding, LLM-judged task completion, tool usage, and planning effectiveness. Scores are averaged across single-server and multi-server settings.*

## Quick Start

### Installation

1. **Clone the repository**
```bash
git clone https://github.com/accenture/mcp-bench.git
cd mcp-bench
```

2. **Install dependencies**
```bash
conda create -n mcpbench python=3.10
conda activate mcpbench
cd mcp_servers
# Install MCP server dependencies
bash ./install.sh
cd ..
```

3. **Set up environment variables**
```bash
# Create .env file with API keys
# Default setup uses both OpenRouter and Azure OpenAI
# For Azure OpenAI, you also need to set your API version in file benchmark_config.yaml (line205)
# For OpenRouter-only setup, see "Optional: Using only OpenRouter API" section below
cat > .env << EOF
export OPENROUTER_API_KEY="your_openrouterkey_here"
export AZURE_OPENAI_API_KEY="your_azureopenai_apikey_here"
export AZURE_OPENAI_ENDPOINT="your_azureopenai_endpoint_here"
EOF
```

4. **Configure MCP Server API Keys**

Some MCP servers require external API keys to function properly. These keys are automatically loaded from `./mcp_servers/api_key`. You should set these keys by yourself in file `./mcp_servers/api_key`:

```bash
# View configured API keys
cat ./mcp_servers/api_key
```

Required API keys include (These API keys are free and easy to get. You can get all of them within 10 mins):
- `NPS_API_KEY`: National Park Service API key (for nationalparks server) - [Get API key](https://www.nps.gov/subjects/developer/get-started.htm)
- `NASA_API_KEY`: NASA Open Data API key (for nasa-mcp server) - [Get API key](https://api.nasa.gov/)
- `HF_TOKEN`: Hugging Face token (for huggingface-mcp-server) - [Get token](https://huggingface.co/docs/hub/security-tokens)
- `GOOGLE_MAPS_API_KEY`: Google Maps API key (for mcp-google-map server) - [Get API key](https://developers.google.com/maps)
- `NCI_API_KEY`: National Cancer Institute API key (for biomcp server) - [Get API key](https://clinicaltrialsapi.cancer.gov/)


### Basic Usage

```bash
# 1. Verify all MCP servers can be connected
##You should see "28/28 servers connected" 
##and "All successfully connected servers returned tools!" after running this
python ./utils/collect_mcp_info.py


# 2. List available models
source .env
python run_benchmark.py --list-models 

# 3. Run benchmark (gpt-oss-20b as an example)
## run all tasks
source .env
python run_benchmark.py --models gpt-oss-20b

## single server tasks
source .env
python run_benchmark.py --models gpt-oss-20b \
--tasks-file tasks/mcpbench_tasks_single_runner_format.json

## two server tasks
source .env
python run_benchmark.py --models gpt-oss-20b \
--tasks-file tasks/mcpbench_tasks_multi_2server_runner_format.json

## three server tasks
source .env
python run_benchmark.py --models gpt-oss-20b \
--tasks-file tasks/mcpbench_tasks_multi_3server_runner_format.json

```

### Optional: Add other model providers

To add new models from OpenRouter:

1. **Find your model on OpenRouter**
   - Visit [OpenRouter Models](https://openrouter.ai/models) to browse available models
   - Copy the model ID (e.g., `anthropic/claude-sonnet-4` or `meta-llama/llama-3.3-70b-instruct`)

2. **Add the model configuration**
   - Edit `llm/factory.py` and add your model in the OpenRouter section (around line 152)
   - Follow this pattern:
   ```python
   configs["your-model-name"] = ModelConfig(
       name="your-model-name",
       provider_type="openrouter",
       api_key=os.getenv("OPENROUTER_API_KEY"),
       base_url="https://openrouter.ai/api/v1",
       model_name="provider/model-id"  # The exact model ID from OpenRouter
   )
   ```

3. **Verify the model is available**
   ```bash
   source .env
   python run_benchmark.py --list-models
   # Your new model should appear in the list
   ```

4. **Run benchmark with your model**
   ```bash
   source .env
   python run_benchmark.py --models your-model-name
   ```

### Optional: Using only OpenRouter API

If you only want to use OpenRouter without Azure:

1. **Set up .env file with only OpenRouter:**
```bash
cat > .env << EOF
OPENROUTER_API_KEY=your_openrouterkey_here
EOF
```

2. **Modify the code to access Azure models through OpenRouter:**

Edit `llm/factory.py` and comment out the Azure section (lines 69-101), then add Azure models through OpenRouter instead:

```python
# Comment out or remove the Azure section (lines 69-109)
# if os.getenv("AZURE_OPENAI_API_KEY") and os.getenv("AZURE_OPENAI_ENDPOINT"):
#     configs["o4-mini"] = ModelConfig(...)
#     ...

# Add Azure models through OpenRouter (in the OpenRouter section around line 106)
if os.getenv("OPENROUTER_API_KEY"):
    # Add OpenAI models via OpenRouter
    configs["gpt-4o"] = ModelConfig(
        name="gpt-4o",
        provider_type="openrouter",
        api_key=os.getenv("OPENROUTER_API_KEY"),
        base_url="https://openrouter.ai/api/v1",
        model_name="openai/gpt-4o"
    )
    
    configs["gpt-4o-mini"] = ModelConfig(
        name="gpt-4o-mini",
        provider_type="openrouter",
        api_key=os.getenv("OPENROUTER_API_KEY"),
        base_url="https://openrouter.ai/api/v1",
        model_name="openai/gpt-4o-mini"
    )
    
    configs["o3"] = ModelConfig(
        name="o3",
        provider_type="openrouter",
        api_key=os.getenv("OPENROUTER_API_KEY"),
        base_url="https://openrouter.ai/api/v1",
        model_name="openai/o3"
    )
    
    configs["o4-mini"] = ModelConfig(
        name="o4-mini",
        provider_type="openrouter",
        api_key=os.getenv("OPENROUTER_API_KEY"),
        base_url="https://openrouter.ai/api/v1",
        model_name="openai/o4-mini"
    )

    configs["gpt-5"] = ModelConfig(
        name="gpt-5",
        provider_type="openrouter",
        api_key=os.getenv("OPENROUTER_API_KEY"),
        base_url="https://openrouter.ai/api/v1",
        model_name="openai/gpt-5"
    )
    
    
    # Keep existing OpenRouter models...
```

This way all models will be accessed through OpenRouter's unified API.


## MCP Servers

MCP-Bench includes 28 diverse MCP servers:

- [BioMCP](https://github.com/genomoncology/biomcp) - Biomedical research data, clinical trials, and health information
- [Bibliomantic](https://github.com/d4nshields/bibliomantic-mcp-server) - I Ching divination, hexagrams, and mystical guidance
- [Call for Papers](https://github.com/iremert/call-for-papers-mcp) - Academic conference submissions and call announcements
- [Car Price Evaluator](https://github.com/yusaaztrk/car-price-mcp-main) - Vehicle valuation and automotive market analysis
- [Context7](https://github.com/upstash/context7) - Project context management and documentation services
- [DEX Paprika](https://github.com/coinpaprika/dexpaprika-mcp) - Cryptocurrency DeFi analytics and decentralized exchange data
- [FruityVice](https://github.com/CelalKhalilov/fruityvice-mcp) - Comprehensive fruit nutrition information and dietary data
- [Game Trends](https://github.com/halismertkir/game-trends-mcp) - Gaming industry statistics and trend analysis
- [Google Maps](https://github.com/cablate/mcp-google-map) - Location services, geocoding, and mapping functionality
- [Huge Icons](https://github.com/hugeicons/mcp-server) - Icon search, management, and design resources
- [Hugging Face](https://github.com/shreyaskarnik/huggingface-mcp-server) - Machine learning models, datasets, and AI capabilities
- [Math MCP](https://github.com/EthanHenrickson/math-mcp) - Mathematical calculations and computational operations
- [Medical Calculator](https://github.com/vitaldb/medcalc) - Clinical calculation tools and medical formulas
- [Metropolitan Museum](https://github.com/mikechao/metmuseum-mcp) - Art collection database and museum information
- [Movie Recommender](https://github.com/iremert/movie-recommender-mcp) - Film recommendations and movie metadata
- [NASA Data](https://github.com/AnCode666/nasa-mcp) - Space mission data and astronomical information
- [National Parks](https://github.com/KyrieTangSheng/mcp-server-nationalparks) - US National Parks information and visitor services
- [NixOS](https://github.com/utensils/mcp-nixos) - Package management and system configuration tools
- [OKX Exchange](https://github.com/esshka/okx-mcp) - Cryptocurrency trading data and market information
- [OpenAPI Explorer](https://github.com/janwilmake/openapi-mcp-server) - API specification exploration and testing tools
- [OSINT Intelligence](https://github.com/himanshusanecha/mcp-osint-server) - Open source intelligence gathering and analysis
- [Paper Search](https://github.com/openags/paper-search-mcp) - Academic paper search across multiple research databases
- [Reddit](https://github.com/dumyCq/mcp-reddit) - Social media content and community discussions
- [Scientific Computing](https://github.com/Aman-Amith-Shastry/scientific_computation_mcp) - Advanced mathematical computations and data analysis
- [Time MCP](https://github.com/dumyCq/time-mcp) - Date, time utilities, and timezone conversions
- [Unit Converter](https://github.com/zazencodes/unit-converter-mcp) - Measurement conversions across different unit systems
- [Weather Data](https://github.com/HarunGuclu/weather_mcp) - Weather forecasts and meteorological information
- [Wikipedia](https://github.com/Rudra-ravi/wikipedia-mcp) - Encyclopedia content search and retrieval

## Project Structure

```
mcp-bench/
├── agent/                     # Task execution agents
│   ├── __init__.py
│   ├── executor.py           # Multi-round task executor with retry logic
│   └── execution_context.py  # Execution context management
├── benchmark/                 # Evaluation framework
│   ├── __init__.py
│   ├── evaluator.py          # LLM-as-judge evaluation metrics
│   ├── runner.py             # Benchmark orchestrator
│   ├── results_aggregator.py # Results aggregation and statistics
│   └── results_formatter.py  # Results formatting and display
├── config/                    # Configuration management
│   ├── __init__.py
│   ├── benchmark_config.yaml # Benchmark configuration
│   └── config_loader.py      # Configuration loader
├── llm/                       # LLM provider abstractions
│   ├── __init__.py
│   ├── factory.py            # Model factory for multiple providers
│   └── provider.py           # Unified provider interface
├── mcp_modules/              # MCP server management
│   ├── __init__.py
│   ├── connector.py          # Server connection handling
│   ├── server_manager.py     # Multi-server orchestration
│   ├── server_manager_persistent.py # Persistent connection manager
│   └── tool_cache.py         # Tool call caching mechanism
├── synthesis/                # Task generation
│   ├── __init__.py
│   ├── task_synthesis.py     # Task generation with fuzzy conversion
│   ├── generate_benchmark_tasks.py # Batch task generation script
│   ├── benchmark_generator.py # Unified benchmark task generator
│   ├── README.md             # Task synthesis documentation
│   └── split_combinations/   # Server combination splits
│       ├── mcp_2server_combinations.json
│       └── mcp_3server_combinations.json
├── utils/                    # Utilities
│   ├── __init__.py
│   ├── collect_mcp_info.py  # Server discovery and tool collection
│   ├── local_server_config.py # Local server configuration
│   └── error_handler.py     # Error handling utilities
├── tasks/                    # Benchmark task files
│   ├── mcpbench_tasks_single_runner_format.json
│   ├── mcpbench_tasks_multi_2server_runner_format.json
│   └── mcpbench_tasks_multi_3server_runner_format.json
├── mcp_servers/             # MCP server implementations (28 servers)
│   ├── api_key              # API keys configuration file
│   ├── commands.json        # Server command configurations
│   ├── install.sh          # Installation script for all servers
│   ├── requirements.txt    # Python dependencies
│   └── [28 server directories]
├── cache/                   # Tool call cache directory (auto-created)
├── run_benchmark.py         # Main benchmark runner script
├── README.md               # Project documentation
├── .gitignore              # Git ignore configuration
└── .gitmodules             # Git submodules configuration
```

## Citation

If you use MCP-Bench in your research, please cite:

```bibtex
@article{wang2025mcpbench,
  title={MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers},
  author={Wang, Zhenting and Chang, Qi and Patel, Hemani and Biju, Shashank and Wu, Cheng-En and Liu, Quan and Ding, Aolin and Rezazadeh, Alireza and Shah, Ankit and Bao, Yujia and Siow, Eugene},
  journal={arXiv preprint arXiv:2508.20453},
  year={2025}
}
```

## Star History

[![Star History Chart](https://api.star-history.com/svg?repos=accenture/mcp-bench&type=Date)](https://star-history.com/#accenture/mcp-bench&Date)

## Acknowledgments

- Built on the [Model Context Protocol](https://github.com/anthropics/mcp) by Anthropic
- Thanks to all open-sourced MCP servers implemetation used