# ScreenSuite **Repository Path**: mirrors/ScreenSuite ## Basic Information - **Project Name**: ScreenSuite - **Description**: ScreenSuite 是专用于评估 GUI 智能体的综合测试套件 - **Primary Language**: Python - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: https://www.oschina.net/p/screensuite - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-06-13 - **Last Updated**: 2026-02-21 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # ScreenSuite A comprehensive benchmaring suite for evaluating Graphical User Interface (GUI) agents (i.e. agents that act on your screen, like our [Computer Agent](https://huggingface.co/spaces/smolagents/computer-agent)) across areas of ability : perception, single-step and multi-step agentic behaviour. This does not aim to compare agent implementations, only the MLLMs that power them: thus we propose only simple agent implementations based on [smolagents](https://github.com/huggingface/smolagents). # GUI Agent Benchmarks Overview ## Grounding/Perception Benchmarks | Data Source | Evaluation Type | Platform | Link | | ---------------- | ------------------------------------------------ | -------- | ------------------------------------------------------------------------------------ | | ScreenSpot | BBox + click accuracy | Web | [HuggingFace](https://huggingface.co/datasets/rootsautomation/ScreenSpot) | | ScreenSpot v2 | BBox + click accuracy | Web | [HuggingFace](https://huggingface.co/datasets/HongxinLi/ScreenSpot_v2) | | ScreenSpot-Pro | BBox + click accuracy | Web | [HuggingFace](https://huggingface.co/datasets/HongxinLi/ScreenSpot-Pro) | | Visual-WebBench | Multi-task (Caption, OCR, QA, Grounding, Action) | Web | [HuggingFace](https://huggingface.co/datasets/visualwebbench/VisualWebBench) | | WebSRC | Web QA | Web | [HuggingFace](https://huggingface.co/datasets/X-LANCE/WebSRC_v1.0) | | ScreenQA-short | Mobile QA | Mobile | [HuggingFace](https://huggingface.co/datasets/rootsautomation/RICO-ScreenQA-Short) | | ScreenQA-complex | Mobile QA | Mobile | [HuggingFace](https://huggingface.co/datasets/rootsautomation/RICO-ScreenQA-Complex) | | Showdown-Clicks | Click prediction | Web | [HuggingFace](https://huggingface.co/datasets/generalagents/showdown-clicks) | ## Single Step - Offline Agent Benchmarks | Data Source | Evaluation Type | Platform | Link | | ------------------- | --------------- | -------- | ---------------------------------------------------------------------------------------- | | Multimodal-Mind2Web | Web navigation | Web | [HuggingFace](https://huggingface.co/datasets/osunlp/Multimodal-Mind2Web) | | AndroidControl | Mobile control | Mobile | [GitHub](https://github.com/google-research/google-research/tree/master/android_control) | ## Multi-step - Online Agent Benchmarks | Data Source | Evaluation Type | Platform | Link | | ------------- | --------------- | -------- | ---------------------------------------------------------------------------------------- | | Mind2Web-Live | URL matching | Web | [HuggingFace](https://huggingface.co/datasets/iMeanAI/Mind2Web-Live) | | GAIA | Exact match | Web | [HuggingFace](https://huggingface.co/datasets/gaia-benchmark/GAIA) | | BrowseComp | LLM judge | Web | [Link](https://openai.com/index/browsecomp/) | | AndroidWorld | Task-specific | Mobile | [GitHub](https://github.com/google-research/android_world) | | MobileMiniWob | Task-specific | Mobile | Included in AndroidWorld [GitHub](https://github.com/Farama-Foundation/miniwob-plusplus) | | OSWorld | Task-specific | Desktop | [GitHub](https://github.com/xlang-ai/OSWorld) | ![screensuite_benchmark_scores_filtered](https://github.com/user-attachments/assets/2081bede-d5f2-44a3-8605-645b1224d6e5) scores_screensuite ## Cloning the Repository Make sure to clone the repository with submodules required: ```bash git clone --recurse-submodules git@github.com:huggingface/screensuite.git ``` or ```bash git submodule update --init --recursive # if you already cloned the repository. To run also when you pull branches to update the submodules ``` ## Requirements - Docker - Python >= 3.11 - [uv](https://docs.astral.sh/uv/getting-started/installation/) > For multistep agent benchmarks, we need to spawn containers environment. To do so, you need KVM virtualization enabled. > To check if your hosting platform supports KVM, run > ```bash > egrep -c '(vmx|svm)' /proc/cpuinfo > ``` > on Linux. If the return value is greater than zero, the processor should be able to support KVM. > Note: macOS hosts generally do not support KVM. ## Installation ```bash # Using uv (faster) uv sync --extra submodules --python 3.11 ``` > If you encounter issues with `evdev` python package, you can try installing the build-essential package: > ```bash > sudo apt-get install build-essential > ``` ## Development ```bash # Install development dependencies uv sync --all-extras # Run tests uv run pytest # Code quality uv run pre-commit run --all-files --show-diff-on-failure ``` ## Running the benchmarks ```python #!/usr/bin/env python import json import os from datetime import datetime from dotenv import load_dotenv from smolagents.models import InferenceClientModel from screensuite import ( EvaluationConfig, ImageResizeConfig, OSWorldEnvironmentConfig, get_registry, ) load_dotenv() # Setup results directory RESULTS_DIR = os.path.join(os.path.dirname(__file__), "results") os.makedirs(RESULTS_DIR, exist_ok=True) def run_benchmarks(): # Get benchmarks to run registry = get_registry() # benchmarks = registry.list_all() benchmarks = registry.get( [ "screenqa_short", "screenqa_complex", "screenspot-v1-click-prompt", "screenspot-v1-bounding-box-prompt", "screenspot-v2-click-prompt", "screenspot-v2-bounding-box-prompt", "screenspot-pro-click-prompt", "screenspot-pro-bounding-box-prompt", "websrc_dev", "visualwebbench", "android_control", "showdown_clicks", "mmind2web", "android_world", "osworld", "gaia_web", ] ) for bench in benchmarks: print(bench.name) # Configure your model (choose one) model = InferenceClientModel( model_id="Qwen/Qwen2.5-VL-32B-Instruct", provider="fireworks-ai", max_tokens=4096, ) # Alternative models: # model = OpenAIServerModel(model_id="gpt-4o", max_tokens=4096) # model = LiteLLMModel(model_id="anthropic/claude-sonnet-4-20250514", max_tokens=4096) # see smolagents documentation for more models -> https://github.com/huggingface/smolagents/blob/main/examples/agent_from_any_llm.py # Run benchmarks run_name = f"test_{datetime.now().strftime('%Y-%m-%d')}" max_samples_to_test = 1 parallel_workers = 1 osworld_env_config = OSWorldEnvironmentConfig(provider_name="docker") for benchmark in benchmarks: print(f"Running: {benchmark.name}") # Configure based on benchmark type config = EvaluationConfig( parallel_workers=parallel_workers, run_name=run_name, max_samples_to_test=max_samples_to_test, image_resize_config=ImageResizeConfig(), # If you want to resize the images given to the model - Default is to Qwen2.5-VL resize values ) try: results = benchmark.evaluate( model, evaluation_config=config, env_config=osworld_env_config if "osworld" in benchmark.tags else None, ) print(f"Results: {results._metrics}") # Save results with open(f"{RESULTS_DIR}/results_{run_name}.jsonl", "a") as f: entry = {"benchmark_name": benchmark.name, "metrics": results._metrics} f.write(json.dumps(entry) + "\n") except Exception as e: print(f"Error in {benchmark.name}: {e}") continue if __name__ == "__main__": run_benchmarks() ``` ## OSWorld Google Tasks To run OSWorld Google tasks, you need to create a Google account and a Google Cloud project. See [OSWorld documentation](https://github.com/xlang-ai/OSWorld/blob/main/ACCOUNT_GUIDELINE.md) for more details. ## License This project is licensed under the terms of the Apache License 2.0. ## Cite ScreenSuite If you use ScreenSuite in your publication, please cite it by using the following BibTeX entry. ```bibtex @Misc{screensuite, title = {ScreenSuite: the most comprehensive benchmarking suite for GUI agents}, author = {Aymeric Roucher and Amir Mahla}, howpublished = {\url{https://github.com/huggingface/screensuite}}, year = {2025} } ```