# ScreenSuite

**Repository Path**: mirrors/ScreenSuite

## Basic Information

- **Project Name**: ScreenSuite
- **Description**: ScreenSuite 是专用于评估 GUI 智能体的综合测试套件
- **Primary Language**: Python
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: https://www.oschina.net/p/screensuite
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-06-13
- **Last Updated**: 2026-02-21

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# ScreenSuite

A comprehensive benchmaring suite for evaluating Graphical User Interface (GUI) agents (i.e. agents that act on your screen, like our [Computer Agent](https://huggingface.co/spaces/smolagents/computer-agent)) across areas of ability : perception, single-step and multi-step agentic behaviour.

This does not aim to compare agent implementations, only the MLLMs that power them: thus we propose only simple agent implementations based on [smolagents](https://github.com/huggingface/smolagents).

# GUI Agent Benchmarks Overview

## Grounding/Perception Benchmarks

| Data Source      | Evaluation Type                                  | Platform | Link                                                                                 |
| ---------------- | ------------------------------------------------ | -------- | ------------------------------------------------------------------------------------ |
| ScreenSpot       | BBox + click accuracy                            | Web      | [HuggingFace](https://huggingface.co/datasets/rootsautomation/ScreenSpot)            |
| ScreenSpot v2    | BBox + click accuracy                            | Web      | [HuggingFace](https://huggingface.co/datasets/HongxinLi/ScreenSpot_v2)               |
| ScreenSpot-Pro   | BBox + click accuracy                            | Web      | [HuggingFace](https://huggingface.co/datasets/HongxinLi/ScreenSpot-Pro)              |
| Visual-WebBench  | Multi-task (Caption, OCR, QA, Grounding, Action) | Web      | [HuggingFace](https://huggingface.co/datasets/visualwebbench/VisualWebBench)         |
| WebSRC           | Web QA                                           | Web      | [HuggingFace](https://huggingface.co/datasets/X-LANCE/WebSRC_v1.0)                   |
| ScreenQA-short   | Mobile QA                                        | Mobile   | [HuggingFace](https://huggingface.co/datasets/rootsautomation/RICO-ScreenQA-Short)   |
| ScreenQA-complex | Mobile QA                                        | Mobile   | [HuggingFace](https://huggingface.co/datasets/rootsautomation/RICO-ScreenQA-Complex) |
| Showdown-Clicks  | Click prediction                                 | Web      | [HuggingFace](https://huggingface.co/datasets/generalagents/showdown-clicks)         |

## Single Step - Offline Agent Benchmarks

| Data Source         | Evaluation Type | Platform | Link                                                                                     |
| ------------------- | --------------- | -------- | ---------------------------------------------------------------------------------------- |
| Multimodal-Mind2Web | Web navigation  | Web      | [HuggingFace](https://huggingface.co/datasets/osunlp/Multimodal-Mind2Web)                |
| AndroidControl      | Mobile control  | Mobile   | [GitHub](https://github.com/google-research/google-research/tree/master/android_control) |

## Multi-step - Online Agent Benchmarks

| Data Source   | Evaluation Type | Platform | Link                                                                                     |
| ------------- | --------------- | -------- | ---------------------------------------------------------------------------------------- |
| Mind2Web-Live | URL matching    | Web      | [HuggingFace](https://huggingface.co/datasets/iMeanAI/Mind2Web-Live)                     |
| GAIA          | Exact match     | Web      | [HuggingFace](https://huggingface.co/datasets/gaia-benchmark/GAIA)                       |
| BrowseComp    | LLM judge       | Web      | [Link](https://openai.com/index/browsecomp/)                                             |
| AndroidWorld  | Task-specific   | Mobile   | [GitHub](https://github.com/google-research/android_world)                               |
| MobileMiniWob | Task-specific   | Mobile   | Included in AndroidWorld [GitHub](https://github.com/Farama-Foundation/miniwob-plusplus) |
| OSWorld       | Task-specific   | Desktop  | [GitHub](https://github.com/xlang-ai/OSWorld)                                            |

![screensuite_benchmark_scores_filtered](https://github.com/user-attachments/assets/2081bede-d5f2-44a3-8605-645b1224d6e5)

<img width="1396" alt="scores_screensuite" src="https://github.com/user-attachments/assets/0bb1563f-6b09-4df4-883a-01c7adc23556" />


## Cloning the Repository

Make sure to clone the repository with submodules required:

```bash
git clone --recurse-submodules git@github.com:huggingface/screensuite.git
```

or

```bash
git submodule update --init --recursive # if you already cloned the repository. To run also when you pull branches to update the submodules
```

## Requirements
- Docker
- Python >= 3.11
- [uv](https://docs.astral.sh/uv/getting-started/installation/)

> For multistep agent benchmarks, we need to spawn containers environment. To do so, you need KVM virtualization enabled.
> To check if your hosting platform supports KVM, run
> ```bash
> egrep -c '(vmx|svm)' /proc/cpuinfo
> ```
> on Linux. If the return value is greater than zero, the processor should be able to support KVM.
> Note: macOS hosts generally do not support KVM.

## Installation

```bash
# Using uv (faster)
uv sync --extra submodules --python 3.11
```

> If you encounter issues with `evdev` python package, you can try installing the build-essential package:
> ```bash
> sudo apt-get install build-essential
> ```

## Development

```bash
# Install development dependencies
uv sync --all-extras

# Run tests
uv run pytest

# Code quality
uv run pre-commit run --all-files --show-diff-on-failure
```

## Running the benchmarks

```python
#!/usr/bin/env python
import json
import os
from datetime import datetime

from dotenv import load_dotenv
from smolagents.models import InferenceClientModel

from screensuite import (
    EvaluationConfig,
    ImageResizeConfig,
    OSWorldEnvironmentConfig,
    get_registry,
)

load_dotenv()

# Setup results directory
RESULTS_DIR = os.path.join(os.path.dirname(__file__), "results")
os.makedirs(RESULTS_DIR, exist_ok=True)


def run_benchmarks():
    # Get benchmarks to run
    registry = get_registry()
    # benchmarks = registry.list_all()
    benchmarks = registry.get(
        [
            "screenqa_short",
            "screenqa_complex",
            "screenspot-v1-click-prompt",
            "screenspot-v1-bounding-box-prompt",
            "screenspot-v2-click-prompt",
            "screenspot-v2-bounding-box-prompt",
            "screenspot-pro-click-prompt",
            "screenspot-pro-bounding-box-prompt",
            "websrc_dev",
            "visualwebbench",
            "android_control",
            "showdown_clicks",
            "mmind2web",
            "android_world",
            "osworld",
            "gaia_web",
        ]
    )

    for bench in benchmarks:
        print(bench.name)

    # Configure your model (choose one)
    model = InferenceClientModel(
        model_id="Qwen/Qwen2.5-VL-32B-Instruct",
        provider="fireworks-ai",
        max_tokens=4096,
    )

    # Alternative models:
    # model = OpenAIServerModel(model_id="gpt-4o", max_tokens=4096)
    # model = LiteLLMModel(model_id="anthropic/claude-sonnet-4-20250514", max_tokens=4096)
    # see smolagents documentation for more models -> https://github.com/huggingface/smolagents/blob/main/examples/agent_from_any_llm.py

    # Run benchmarks
    run_name = f"test_{datetime.now().strftime('%Y-%m-%d')}"
    max_samples_to_test = 1
    parallel_workers = 1
    osworld_env_config = OSWorldEnvironmentConfig(provider_name="docker")

    for benchmark in benchmarks:
        print(f"Running: {benchmark.name}")

        # Configure based on benchmark type
        config = EvaluationConfig(
            parallel_workers=parallel_workers,
            run_name=run_name,
            max_samples_to_test=max_samples_to_test,
            image_resize_config=ImageResizeConfig(),  # If you want to resize the images given to the model - Default is to Qwen2.5-VL resize values
        )

        try:
            results = benchmark.evaluate(
                model,
                evaluation_config=config,
                env_config=osworld_env_config if "osworld" in benchmark.tags else None,
            )
            print(f"Results: {results._metrics}")

            # Save results
            with open(f"{RESULTS_DIR}/results_{run_name}.jsonl", "a") as f:
                entry = {"benchmark_name": benchmark.name, "metrics": results._metrics}
                f.write(json.dumps(entry) + "\n")

        except Exception as e:
            print(f"Error in {benchmark.name}: {e}")
            continue


if __name__ == "__main__":
    run_benchmarks()
```

## OSWorld Google Tasks

To run OSWorld Google tasks, you need to create a Google account and a Google Cloud project.
See [OSWorld documentation](https://github.com/xlang-ai/OSWorld/blob/main/ACCOUNT_GUIDELINE.md) for more details.

## License

This project is licensed under the terms of the Apache License 2.0.

## Cite ScreenSuite

If you use ScreenSuite in your publication, please cite it by using the following BibTeX entry.

```bibtex
@Misc{screensuite,
  title =        {ScreenSuite: the most comprehensive benchmarking suite for GUI agents},
  author =       {Aymeric Roucher and Amir Mahla},
  howpublished = {\url{https://github.com/huggingface/screensuite}},
  year =         {2025}
}
```