UI-TARS: 字节跳动的操作软件

Local Image

We also offer a UI-TARS-desktop version, which can operate on your local personal device. To use it, please visit https://github.com/bytedance/UI-TARS-desktop. To use UI-TARS in web automation, you may refer to the open-source project Midscene.js.

⚠️ Important Announcement: GGUF Model Performance

The GGUF model has undergone quantization, but unfortunately, its performance cannot be guaranteed. As a result, we have decided to downgrade it.

💡 Alternative Solution:
You can use Cloud Deployment or Local Deployment [vLLM](If you have enough GPU resources) instead.

We appreciate your understanding and patience as we work to ensure the best possible experience.

Updates

🚀 01.25: We updated the Cloud Deployment section in the 中文版: GUI模型部署教程 with new information related to the ModelScope platform. You can now use the ModelScope platform for deployment.

Overview

UI-TARS is a next-generation native GUI agent model designed to interact seamlessly with graphical user interfaces (GUIs) using human-like perception, reasoning, and action capabilities. Unlike traditional modular frameworks, UI-TARS integrates all key components—perception, reasoning, grounding, and memory—within a single vision-language model (VLM), enabling end-to-end task automation without predefined workflows or manual rules. Local Image

Core Features

Perception

Comprehensive GUI Understanding: Processes multimodal inputs (text, images, interactions) to build a coherent understanding of interfaces.
Real-Time Interaction: Continuously monitors dynamic GUIs and responds accurately to changes in real-time.

Action

Unified Action Space: Standardized action definitions across platforms (desktop, mobile, and web).
Platform-Specific Actions: Supports additional actions like hotkeys, long press, and platform-specific gestures.

Reasoning

System 1 & System 2 Reasoning: Combines fast, intuitive responses with deliberate, high-level planning for complex tasks.
Task Decomposition & Reflection: Supports multi-step planning, reflection, and error correction for robust task execution.

Memory

Short-Term Memory: Captures task-specific context for situational awareness.
Long-Term Memory: Retains historical interactions and knowledge for improved decision-making.

Capabilities

Cross-Platform Interaction: Supports desktop, mobile, and web environments with a unified action framework.
Multi-Step Task Execution: Trained to handle complex tasks through multi-step trajectories and reasoning.
Learning from Synthetic and Real Data: Combines large-scale annotated and synthetic datasets for improved generalization and robustness.

Performance

Perception Capabilty Evaluation

Model	VisualWebBench	WebSRC	SQAshort
Qwen2-VL-7B	73.3	81.8	84.9
Qwen-VL-Max	74.1	91.1	78.6
Gemini-1.5-Pro	75.4	88.9	82.2
UIX-Qwen2-7B	75.9	82.9	78.8
Claude-3.5-Sonnet	78.2	90.4	83.1
GPT-4o	78.5	87.7	82.3
UI-TARS-2B	72.9	89.2	86.4
UI-TARS-7B	79.7	93.6	87.7
UI-TARS-72B	82.8	89.3	88.6

Grounding Capability Evaluation

ScreenSpot Pro

Agent Model	Dev-Text	Dev-Icon	Dev-Avg	Creative-Text	Creative-Icon	Creative-Avg	CAD-Text	CAD-Icon	CAD-Avg	Scientific-Text	Scientific-Icon	Scientific-Avg	Office-Text	Office-Icon	Office-Avg	OS-Text	OS-Icon	OS-Avg	Avg-Text	Avg-Icon	Avg
QwenVL-7B	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.7	0.0	0.4	0.0	0.0	0.0	0.0	0.0	0.0	0.1	0.0	0.1
GPT-4o	1.3	0.0	0.7	1.0	0.0	0.6	2.0	0.0	1.5	2.1	0.0	1.2	1.1	0.0	0.9	0.0	0.0	0.0	1.3	0.0	0.8
SeeClick	0.6	0.0	0.3	1.0	0.0	0.6	2.5	0.0	1.9	3.5	0.0	2.0	1.1	0.0	0.9	2.8	0.0	1.5	1.8	0.0	1.1
Qwen2-VL-7B	2.6	0.0	1.3	1.5	0.0	0.9	0.5	0.0	0.4	6.3	0.0	3.5	3.4	1.9	3.0	0.9	0.0	0.5	2.5	0.2	1.6
OS-Atlas-4B	7.1	0.0	3.7	3.0	1.4	2.3	2.0	0.0	1.5	9.0	5.5	7.5	5.1	3.8	4.8	5.6	0.0	3.1	5.0	1.7	3.7
ShowUI-2B	16.9	1.4	9.4	9.1	0.0	5.3	2.5	0.0	1.9	13.2	7.3	10.6	15.3	7.5	13.5	10.3	2.2	6.6	10.8	2.6	7.7
CogAgent-18B	14.9	0.7	8.0	9.6	0.0	5.6	7.1	3.1	6.1	22.2	1.8	13.4	13.0	0.0	10.0	5.6	0.0	3.1	12.0	0.8	7.7
Aria-UI	16.2	0.0	8.4	23.7	2.1	14.7	7.6	1.6	6.1	27.1	6.4	18.1	20.3	1.9	16.1	4.7	0.0	2.6	17.1	2.0	11.3
UGround-7B	26.6	2.1	14.7	27.3	2.8	17.0	14.2	1.6	11.1	31.9	2.7	19.3	31.6	11.3	27.0	17.8	0.0	9.7	25.0	2.8	16.5
Claude Computer Use	22.0	3.9	12.6	25.9	3.4	16.8	14.5	3.7	11.9	33.9	15.8	25.8	30.1	16.3	26.9	11.0	4.5	8.1	23.4	7.1	17.1
OS-Atlas-7B	33.1	1.4	17.7	28.8	2.8	17.9	12.2	4.7	10.3	37.5	7.3	24.4	33.9	5.7	27.4	27.1	4.5	16.8	28.1	4.0	18.9
UGround-V1-7B	-	-	35.5	-	-	27.8	-	-	13.5	-	-	38.8	-	-	48.8	-	-	26.1	-	-	31.1
UI-TARS-2B	47.4	4.1	26.4	42.9	6.3	27.6	17.8	4.7	14.6	56.9	17.3	39.8	50.3	17.0	42.6	21.5	5.6	14.3	39.6	8.4	27.7
UI-TARS-7B	58.4	12.4	36.1	50.0	9.1	32.8	20.8	9.4	18.0	63.9	31.8	50.0	63.3	20.8	53.5	30.8	16.9	24.5	47.8	16.2	35.7
UI-TARS-72B	63.0	17.3	40.8	57.1	15.4	39.6	18.8	12.5	17.2	64.6	20.9	45.7	63.3	26.4	54.8	42.1	15.7	30.1	50.9	17.5	38.1

ScreenSpot

Method	Mobile-Text	Mobile-Icon/Widget	Desktop-Text	Desktop-Icon/Widget	Web-Text	Web-Icon/Widget	Avg
Agent Framework
GPT-4 (SeeClick)	76.6	55.5	68.0	28.6	40.9	23.3	48.8
GPT-4 (OmniParser)	93.9	57.0	91.3	63.6	81.3	51.0	73.0
GPT-4 (UGround-7B)	90.1	70.3	87.1	55.7	85.7	64.6	75.6
GPT-4o (SeeClick)	81.0	59.8	69.6	33.6	43.9	26.2	52.3
GPT-4o (UGround-7B)	93.4	76.9	92.8	67.9	88.7	68.9	81.4
Agent Model
GPT-4	22.6	24.5	20.2	11.8	9.2	8.8	16.2
GPT-4o	20.2	24.9	21.1	23.6	12.2	7.8	18.3
CogAgent	67.0	24.0	74.2	20.0	70.4	28.6	47.4
SeeClick	78.0	52.0	72.2	30.0	55.7	32.5	53.4
Qwen2-VL	75.5	60.7	76.3	54.3	35.2	25.7	55.3
UGround-7B	82.8	60.3	82.5	63.6	80.4	70.4	73.3
Aguvis-G-7B	88.3	78.2	88.1	70.7	85.7	74.8	81.8
OS-Atlas-7B	93.0	72.9	91.8	62.9	90.9	74.3	82.5
Claude Computer Use	-	-	-	-	-	-	83.0
Gemini 2.0 (Project Mariner)	-	-	-	-	-	-	84.0
Aguvis-7B	95.6	77.7	93.8	67.1	88.3	75.2	84.4
Aguvis-72B	94.5	85.2	95.4	77.9	91.3	85.9	89.2
Our Model
UI-TARS-2B	93.0	75.5	90.7	68.6	84.3	74.8	82.3
UI-TARS-7B	94.5	85.2	95.9	85.7	90.0	83.5	89.5
UI-TARS-72B	94.9	82.5	89.7	88.6	88.7	85.0	88.4

ScreenSpot v2

Method	Mobile-Text	Mobile-Icon/Widget	Desktop-Text	Desktop-Icon/Widget	Web-Text	Web-Icon/Widget	Avg
Agent Framework
GPT-4o (SeeClick)	85.2	58.8	79.9	37.1	72.7	30.1	63.6
GPT-4o (OS-Atlas-4B)	95.5	75.8	79.4	49.3	90.2	66.5	79.1
GPT-4o (OS-Atlas-7B)	96.2	83.4	89.7	69.3	94.0	79.8	87.1
Agent Model
SeeClick	78.4	50.7	70.1	29.3	55.2	32.5	55.1
OS-Atlas-4B	87.2	59.7	72.7	46.4	85.9	63.1	71.9
OS-Atlas-7B	95.2	75.8	90.7	63.6	90.6	77.3	84.1
Our Model
UI-TARS-2B	95.2	79.1	90.7	68.6	87.2	78.3	84.7
UI-TARS-7B	96.9	89.1	95.4	85.0	93.6	85.2	91.6
UI-TARS-72B	94.8	86.3	91.2	87.9	91.5	87.7	90.3

Offline Agent Capability Evaluation

Multimodal Mind2Web

Method	Cross-Task Ele.Acc	Cross-Task Op.F1	Cross-Task Step SR	Cross-Website Ele.Acc	Cross-Website Op.F1	Cross-Website Step SR	Cross-Domain Ele.Acc	Cross-Domain Op.F1	Cross-Domain Step SR
Agent Framework
GPT-4o (SeeClick)	32.1	-	-	33.1	-	-	33.5	-	-
GPT-4o (UGround)	47.7	-	-	46.0	-	-	46.6	-	-
GPT-4o (Aria-UI)	57.6	-	-	57.7	-	-	61.4	-	-
GPT-4V (OmniParser)	42.4	87.6	39.4	41.0	84.8	36.5	45.5	85.7	42.0
Agent Model
GPT-4o	5.7	77.2	4.3	5.7	79.0	3.9	5.5	86.4	4.5
GPT-4 (SOM)	29.6	-	20.3	20.1	-	13.9	27.0	-	23.7
GPT-3.5 (Text-only)	19.4	59.2	16.8	14.9	56.5	14.1	25.2	57.9	24.1
GPT-4 (Text-only)	40.8	63.1	32.3	30.2	61.0	27.0	35.4	61.9	29.7
Claude	62.7	84.7	53.5	59.5	79.6	47.7	64.5	85.4	56.4
Aguvis-7B	64.2	89.8	60.4	60.7	88.1	54.6	60.4	89.2	56.6
CogAgent	-	-	62.3	-	-	54.0	-	-	59.4
Aguvis-72B	69.5	90.8	64.0	62.6	88.6	56.5	63.5	88.5	58.2
Our Model
UI-TARS-2B	62.3	90.0	56.3	58.5	87.2	50.8	58.8	89.6	52.3
UI-TARS-7B	73.1	92.2	67.1	68.2	90.9	61.7	66.6	90.9	60.5
UI-TARS-72B	74.7	92.5	68.6	72.4	91.2	63.5	68.9	91.8	62.1

Android Control and GUI Odyssey

Agent Models	AndroidControl-Low Type	AndroidControl-Low Grounding	AndroidControl-Low SR	AndroidControl-High Type	AndroidControl-High Grounding	AndroidControl-High SR	GUIOdyssey Type	GUIOdyssey Grounding	GUIOdyssey SR
Claude	74.3	0.0	19.4	63.7	0.0	12.5	60.9	0.0	3.1
GPT-4o	74.3	0.0	19.4	66.3	0.0	20.8	34.3	0.0	3.3
SeeClick	93.0	73.4	75.0	82.9	62.9	59.1	71.0	52.4	53.9
InternVL-2-4B	90.9	84.1	80.1	84.1	72.7	66.7	82.1	55.5	51.5
Qwen2-VL-7B	91.9	86.5	82.6	83.8	77.7	69.7	83.5	65.9	60.2
Aria-UI	--	87.7	67.3	--	43.2	10.2	--	86.8	36.5
OS-Atlas-4B	91.9	83.8	80.6	84.7	73.8	67.5	83.5	61.4	56.4
OS-Atlas-7B	93.6	88.0	85.2	85.2	78.5	71.2	84.5	67.8	62.0
Aguvis-7B	--	--	80.5	--	--	61.5	--	--	--
Aguvis-72B	--	--	84.4	--	--	66.4	--	--	--
UI-TARS-2B	98.1	87.3	89.3	81.2	78.4	68.9	93.9	86.8	83.4
UI-TARS-7B	98.0	89.3	90.8	83.7	80.5	72.5	94.6	90.1	87.0
UI-TARS-72B	98.1	89.9	91.3	85.2	81.5	74.7	95.4	91.4	88.6

Online Agent Capability Evaluation

Method	OSWorld (Online)	AndroidWorld (Online)
Agent Framework
GPT-4o (UGround)	-	32.8
GPT-4o (Aria-UI)	15.2	44.8
GPT-4o (Aguvis-7B)	14.8	37.1
GPT-4o (Aguvis-72B)	17.0	-
GPT-4o (OS-Atlas-7B)	14.6	-
Agent Model
GPT-4o	5.0	34.5 (SoM)
Gemini-Pro-1.5	5.4	22.8 (SoM)
Aguvis-72B	10.3	26.1
Claude Computer-Use	14.9 (15 steps)	27.9
Claude Computer-Use	22.0 (50 steps)	-
Our Model
UI-TARS-7B-SFT	17.7 (15 steps)	33.0
UI-TARS-7B-DPO	18.7 (15 steps)	-
UI-TARS-72B-SFT	18.8 (15 steps)	46.6
UI-TARS-72B-DPO	22.7 (15 steps)	-
UI-TARS-72B-DPO	24.6 (50 steps)	-

Deployment

Cloud Deployment

We recommend using HuggingFace Inference Endpoints for fast deployment. We provide two docs for reference:

English version: GUI Model Deployment Guide

中文版: GUI模型部署教程

Local Deployment [Transformers]

We follow the same way as Qwen2-VL. Check this tutorial for more details.

Local Deployment [vLLM]

We recommend using vLLM for fast deployment and inference. You need to use vllm>=0.6.1.

pip install -U transformers
VLLM_VERSION=0.6.6
CUDA_VERSION=cu124
pip install vllm==${VLLM_VERSION} --extra-index-url https://download.pytorch.org/whl/${CUDA_VERSION}

Download the Model

We provide three model sizes on Hugging Face: 2B, 7B, and 72B. To achieve the best performance, we recommend using the 7B-DPO or 72B-DPO model (depends on your GPU configuration):

Start an OpenAI API Service

Run the command below to start an OpenAI-compatible API service. It is recommended to set the tensor parallel size -tp=1 for 7B models and -tp=4 for 72B models.

python -m vllm.entrypoints.openai.api_server --served-model-name ui-tars \
    --model <path to your model> --limit-mm-per-prompt image=5 -tp <tp>

Then you can use the chat API as below with the gui prompt (choose from mobile or computer) and base64-encoded local images (see OpenAI API protocol document for more details), you can also use it in UI-TARS-desktop:

import base64
from openai import OpenAI


instruction = "search for today's weather"
screenshot_path = "screenshot.png"
client = OpenAI(
    base_url="http://127.0.0.1:8000/v1",
    api_key="empty",
)

## Below is the prompt for mobile
prompt = r"""You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task. 

## Output Format
```\nThought: ...
Action: ...\n```

## Action Space

click(start_box='<|box_start|>(x1,y1)<|box_end|>')
left_double(start_box='<|box_start|>(x1,y1)<|box_end|>')
right_single(start_box='<|box_start|>(x1,y1)<|box_end|>')
drag(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>')
hotkey(key='')
type(content='') #If you want to submit your input, use \"\
\" at the end of `content`.
scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', direction='down or up or right or left')
wait() #Sleep for 5s and take a screenshot to check for any changes.
finished()
call_user() # Submit the task and call the user when the task is unsolvable, or when you need the user's help.


## Note
- Use Chinese in `Thought` part.
- Summarize your next action (with its target element) in one sentence in `Thought` part.

## User Instruction
"""

with open(screenshot_path, "rb") as image_file:
    encoded_string = base64.b64encode(image_file.read()).decode("utf-8")
response = client.chat.completions.create(
    model="ui-tars",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": prompt + instruction},
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{encoded_string}"}},
            ],
        },
    ],
    frequency_penalty=1,
    max_tokens=128,
)
print(response.choices[0].message.content)

For single step grounding task or inference on grounding dataset such as Seeclick, kindly refer to the following script:

import base64
from openai import OpenAI


instruction = "search for today's weather"
screenshot_path = "screenshot.png"
client = OpenAI(
    base_url="http://127.0.0.1:8000/v1",
    api_key="empty",
)

## Below is the prompt for mobile
prompt = r"""Output only the coordinate of one point in your response. What element matches the following task: """

with open(screenshot_path, "rb") as image_file:
    encoded_string = base64.b64encode(image_file.read()).decode("utf-8")
response = client.chat.completions.create(
    model="ui-tars",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{encoded_string}"}},
                {"type": "text", "text": prompt + instruction}
            ],
        },
    ],
    frequency_penalty=1,
    max_tokens=128,
)
print(response.choices[0].message.content)

Prompt Templates

We provide two prompt templates currently for stable running and performance, one for mobile scene and one for personal computer scene.

Prompt template for mobile:

## Below is the prompt for mobile
prompt = r"""You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task. 

## Output Format
```\nThought: ...
Action: ...\n```

## Action Space
click(start_box='<|box_start|>(x1,y1)<|box_end|>')
long_press(start_box='<|box_start|>(x1,y1)<|box_end|>', time='')
type(content='')
scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>')
press_home()
press_back()
finished(content='') # Submit the task regardless of whether it succeeds or fails.

## Note
- Use English in `Thought` part.

- Write a small plan and finally summarize your next action (with its target element) in one sentence in `Thought` part.

## User Instruction
"""

Prompt template for computer:

## Below is the prompt for computer
prompt = r"""You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task. 

## Output Format
```\nThought: ...
Action: ...\n```

## Action Space

click(start_box='<|box_start|>(x1,y1)<|box_end|>')
left_double(start_box='<|box_start|>(x1,y1)<|box_end|>')
right_single(start_box='<|box_start|>(x1,y1)<|box_end|>')
drag(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>')
hotkey(key='')
type(content='') #If you want to submit your input, use \"\
\" at the end of `content`.
scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', direction='down or up or right or left')
wait() #Sleep for 5s and take a screenshot to check for any changes.
finished()
call_user() # Submit the task and call the user when the task is unsolvable, or when you need the user's help.


## Note
- Use Chinese in `Thought` part.
- Summarize your next action (with its target element) in one sentence in `Thought` part.

## User Instruction
"""

Local Deployment [Ollama]

Ollama will be coming soon. Please be patient and wait~ 😊

Explanation of Inference Results

Coordinate Mapping

The model generates a 2D coordinate output that represents relative positions. To convert these values to image-relative coordinates, divide each component by 1000 to obtain values in the range [0,1]. The absolute coordinates required by the Action can be calculated by:

X absolute = X relative × image width
Y absolute = Y relative × image height

For example, given a screen size: 1920 × 1080, and the model generates a coordinate output of (235, 512). The X absolute is round(1920*235/1000)=451. The Y absolute is round(1080*512/1000)=553. The absolute coordinate is (451, 553)

Use in desktop and web automation

To experience UI-TARS agent in desktop, you may refer to UI-TARS-desktop. We recommend using the 7B/72B DPO model on desktop.

Midscene.js is an open-source web automation SDK that has supported UI-TARS model. Developers can use javascript and natural language to control the browser. See this guide for more details about setting up the model.

License

UI-TARS is licensed under the Apache License 2.0.

Acknowledgements

This project builds upon and extends the capabilities of Qwen2-VL, a powerful vision-language model, which serves as the foundational architecture for UI-TARS. We would like to acknowledge the contributions of the developers and researchers behind Qwen2-VL for their groundbreaking work in the field of multimodal AI and for providing a robust base for further advancements.

Additionally, we thank the broader open-source community for their datasets, tools, and insights that have facilitated the development of UI-TARS. These collaborative efforts continue to push the boundaries of what GUI automation and AI-driven agents can achieve.

Citation

If you find our paper and code useful in your research, please consider giving a star and citation

@article{qin2025ui,
  title={UI-TARS: Pioneering Automated GUI Interaction with Native Agents},
  author={Qin, Yujia and Ye, Yining and Fang, Junjie and Wang, Haoming and Liang, Shihao and Tian, Shizuo and Zhang, Junda and Li, Jiahao and Li, Yunxin and Huang, Shijue and others},
  journal={arXiv preprint arXiv:2501.12326},
  year={2025}
}

tttralf/UI-TARS

⚠️ Important Announcement: GGUF Model Performance

Updates

Overview

Core Features

Perception

Action

Reasoning

Memory

Capabilities

Performance

Deployment

Cloud Deployment

Local Deployment [Transformers]

Local Deployment [vLLM]

Download the Model

Start an OpenAI API Service

Prompt Templates

Local Deployment [Ollama]

Explanation of Inference Results

Coordinate Mapping

Use in desktop and web automation

License

Acknowledgements

Citation

简介

发行版

贡献者

近期动态

tttralf/UI-TARS .gitee-modal { width: 500px !important; }

⚠️ Important Announcement: GGUF Model Performance

Updates

Overview

Core Features

Perception

Action

Reasoning

Memory

Capabilities

Performance

Deployment

Cloud Deployment

Local Deployment [Transformers]

Local Deployment [vLLM]

Download the Model

Start an OpenAI API Service

Prompt Templates

Local Deployment [Ollama]

Explanation of Inference Results

Coordinate Mapping

Use in desktop and web automation

License

Acknowledgements

Citation

简介

发行版

贡献者

近期动态

搜索帮助

tttralf/UI-TARS