# ToolCall-15

**Repository Path**: pioneerdk/tool-call-15

## Basic Information

- **Project Name**: ToolCall-15
- **Description**: ToolCall-15 是一种用于比较 LLM 工具使用的视觉基准。它通过一个与 OpenAI 兼容的聊天完成界面，运行 15 个预定义的场景，以确定性的方式评估每个结果，并在实时仪表板中呈现完整的结果矩阵。

该套件旨在进行实际评估，而不是进行抽象的基准数学：每个场景都有明确的预期行为、一个模拟的工具环境以及可检查的成功、部分成功或失败结果。
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: 60415
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-04-12
- **Last Updated**: 2026-05-15

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# ToolCall-15

![ToolCall-15 screenshot](./screenshot.png)

ToolCall-15 is a visual benchmark for comparing LLM tool use. It runs 15 fixed scenarios through an OpenAI-compatible chat completions interface, scores each result deterministically, and renders the full matrix in a live dashboard.

The suite is designed for practical evaluation rather than abstract benchmark math: every scenario has a clear expected behavior, a mocked tool environment, and an inspectable pass, partial, or fail outcome.

## What It Measures

ToolCall-15 is organized into 5 categories, with 3 scenarios per category:

- Tool Selection
- Parameter Precision
- Multi-Step Chains
- Restraint and Refusal
- Error Recovery

Each scenario is scored as:

- `2` points for a pass
- `1` point for a partial pass
- `0` points for a fail

Each category is worth `6` points. The final score is the average of the 5 category percentages, rounded to a whole number.

## Methodology

The benchmark spec is documented in [METHODOLOGY.md](./METHODOLOGY.md) and implemented in [lib/benchmark.ts](./lib/benchmark.ts).

### Design goals

- Reproducible: the system prompt, tool schema, mocked tool outputs, and scoring logic are all versioned in the repo.
- Visual: the dashboard makes the outcome of each scenario obvious without external scoring scripts.
- Balanced: the suite spreads scenarios across distinct tool-use failure modes instead of over-indexing on one skill.
- Deterministic: tool results are mocked and the benchmark uses `temperature: 0`.
- Inspectable: every scenario stores a raw trace so failures can be audited.

### Execution model

For every scenario, each model receives:

1. A shared system prompt.
2. A fixed benchmark context message that sets the reference date to `2026-03-20 (Friday)` for relative-time tasks.
3. The scenario user message.
4. The same universal tool set of 12 functions.

The runner then:

1. Calls the model through `/chat/completions`.
2. Executes any requested tool calls against deterministic mock handlers.
3. Appends tool results back into the conversation.
4. Repeats for up to 8 assistant turns.
5. Evaluates the final trace against scenario-specific scoring logic.

Provider errors matching `provider returned error` are retried up to 3 times with backoff. Model requests time out after 30 seconds by default, and the timeout can be overridden with `MODEL_REQUEST_TIMEOUT_SECONDS` in `.env`.

### Scoring details

- `pass`: the model followed the preferred tool behavior exactly enough to earn full credit.
- `partial`: the model was functional but suboptimal or overly conservative.
- `fail`: the model hallucinated, chose the wrong tool, missed required parameters, or broke the intended flow.

The dashboard also distinguishes timeout failures visually so stalled runs are easy to spot.

## Supported Providers

ToolCall-15 accepts models from five OpenAI-compatible providers:

- `openrouter`
- `ollama`
- `llamacpp`
- `mlx`
- `lmstudio`

Model configuration uses comma-separated `provider:model` entries.

Examples:

```env
OPENROUTER_API_KEY=...
OLLAMA_HOST=http://localhost:11434
LLAMACPP_HOST=http://localhost:8080
MLX_HOST=http://localhost:8082
LMSTUDIO_HOST=http://localhost:1234

LLM_MODELS=openrouter:openai/gpt-4.1,ollama:qwen3.5:4b,llamacpp:local-model,lmstudio:qwen3.5-0.8b
LLM_MODELS_2=mlx:mlx-community/Qwen3.5-0.8B-8bit
```

Notes:

- `LLM_MODELS` is the primary table.
- `LLM_MODELS_2` is an optional secondary table for a separate comparison group.
- `OLLAMA_HOST`, `LLAMACPP_HOST`, `MLX_HOST`, and `LMSTUDIO_HOST` should be configured as raw hosts. The app normalizes them to the OpenAI-compatible `/v1` base URL.
- `MODEL_REQUEST_TIMEOUT_SECONDS` controls the per-request timeout for every provider. The default is `30`.
- Every configured `provider:model` must be unique across both env vars.
- Provider support is transport-level. Actual benchmark quality still depends on the specific model's tool-calling behavior.

## Getting Started

### Requirements

- Node.js 20 or newer
- npm
- At least one reachable OpenAI-compatible provider

### Install

```bash
npm install
cp .env.example .env
```

Then edit `.env` with your providers and models.

### Run

```bash
npm run dev
```

Open `http://localhost:3000`.

### Validation

```bash
npm run lint
npm run typecheck
```

## Dashboard Behavior

- The runner advances scenario-by-scenario, not model-by-model. Every displayed model completes the current scenario before the dashboard moves to the next column.
- The run button starts all configured models against all 15 scenarios.
- The config button opens a modal for generation parameters: `temperature`, `top_p`, `top_k`, and `min_p`.
- Benchmark config is stored in `localStorage` so the same browser keeps your latest settings between sessions.
- `Shift+Click` a scenario header to rerun only that scenario across all displayed models.
- Clicking a failed or timed-out cell opens the raw trace for that model and scenario.
- If `LLM_MODELS_2` is empty, the second table stays hidden.

## Repository Structure

- [app/](./app) contains the Next.js app router entry points and styles.
- [components/dashboard.tsx](./components/dashboard.tsx) renders the benchmark UI and live event handling.
- [app/api/run/route.ts](./app/api/run/route.ts) streams benchmark progress over Server-Sent Events.
- [lib/benchmark.ts](./lib/benchmark.ts) defines the benchmark spec, mocked tools, and scoring logic.
- [lib/orchestrator.ts](./lib/orchestrator.ts) runs scenarios and captures traces.
- [lib/llm-client.ts](./lib/llm-client.ts) contains the OpenAI-compatible client adapter.
- [lib/models.ts](./lib/models.ts) parses provider configuration and model groups.

## Limitations

- This is not a general intelligence benchmark. It isolates tool-use behavior under a fixed tool schema.
- The suite uses mocked tools, so it measures orchestration quality rather than live external service quality.
- The benchmark uses one universal system prompt and one deterministic date anchor; prompt-sensitive rankings may change under different instructions.
- Models are compared through OpenAI-compatible endpoints. Provider-specific extras outside that interface are intentionally ignored.

## License

This project is licensed under the MIT License. See [LICENSE](./LICENSE).

## Author

Created by [stevibe](https://x.com/stevibe).