# UI-Ins **Repository Path**: mirrors_alibaba/UI-Ins ## Basic Information - **Project Name**: UI-Ins - **Description**: Official implementation of UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-10-30 - **Last Updated**: 2026-04-25 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

Welcome to Tongyi-MAI!

UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning

📑 Paper | 🤗 UI-Ins-7B | 🤗 UI-Ins-32B | 🤖 UI-Ins-7B | 🤖 UI-Ins-32B

GUI grounding, which maps natural-language instructions to actionable UI elements, is a core capability of GUI agents. Prior works largely treats instructions as a static proxy for user intent, overlooking the impact of instruction diversity and quality on grounding performance. Through a careful investigation of existing grounding datasets, we find a **23.3%** flaw rate in their instructions and show that inference-time exploitation of instruction diversity yields up to a substantial **76%** relative performance improvement. In this paper, we introduce the **Instruction-as-Reasoning paradigm**, treating instructions as dynamic analytical pathways that offer distinct perspectives and enabling the model to select the most effective pathway during reasoning. To achieve this, we propose a two-stage training framework: supervised fine-tuning (SFT) on synthesized, diverse instructions to instill multi-perspective reasoning, followed by reinforcement learning (RL) to optimize pathway selection and composition. Our resulting models, UI-Ins-7B and UI-Ins-32B, achieve state-of-the-art results on five challenging grounding benchmarks and exhibit emergent reasoning, selectively composing and synthesizing novel instruction pathways at inference. In particular, UI-Ins-32B attains the best grounding accuracy, scoring **87.3%** on UI-I2E-Bench, **57.0%** on ScreenSpot-Pro, and **84.9%** on MMBench-GUI L2. Furthermore, our model demonstrates strong agentic potential, achieving a **74.1%** success rate on AndroidWorld using UI-Ins-7B as the executor. Our in-depth analysis reveals additional insights such as how reasoning can be formulated to enhance rather than hinder grounding performance, and how our method mitigates policy collapse in the SFT+RL framework. ## 📰 News - [Oct, 24 2025] Release the UI-Ins models, data processing code, training code and evaluation code. ## 🔧 Setup - Setup the SFT environment by followiing instructions [here](https://github.com/hiyouga/LLaMA-Factory) - Setup the RL environment by following instructions [here](https://github.com/volcengine/verl?tab=readme-ov-file) ## Data Processing Pipeline We provide a high-quality data processing pipeline detail in [here](https://github.com/alibaba/UI-Ins/data_pipeline). ## Training We provide the [SFT](https://github.com/alibaba/UI-Ins/SFT) and [RL](https://github.com/alibaba/UI-Ins/RL) code of UI-Ins. ## Evaluation We provide the evaluation code detail in [here](https://github.com/alibaba/UI-Ins/evaluation) ## ⚡️ Quick Use You can inference UI-Ins simply by the following script:

import torch
import re
from PIL import Image
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration

MODEL_PATH = "Qwen/Qwen2.5-VL-7B-Instruct"  
IMAGE_PATH = "path/to/your/image.jpg"  
INSTRUCTION = "Click the 'Search' button"

def parse_coordinates(raw_string: str) -> tuple[int, int]:
    matches = re.findall(r'\[(\d+),\s*(\d+)\]', raw_string)
    if matches:
        return tuple(map(int, matches[0]))
    return -1, -1

print("Loading model...")
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    device_map="auto"
).eval()
processor = AutoProcessor.from_pretrained(MODEL_PATH)

image = Image.open(IMAGE_PATH).convert("RGB")
messages = [
    {
        "role":"system",
        "content": [
            {
                "type": "text",
                "text": "You are a helpful assistant."
            },
            {
                "type": "text",
                "text": """You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.\n\n## Output Format\nReturn a json object with a reasoning process in  tags, a function name and arguments within  XML tags:\n```\n\n...\n\n\n{"name": "grounding", "arguments": }\n\n```\n represents the following item of the action space:\n## Action Space{"action": "click", "coordinate": [x, y]}\nYour task is to accurately locate a UI element based on the instruction. You should first analyze instruction in  tags and finally output the function in  tags.\n"""
                }
            ]
    },
    {
    "role": "user",
    "content": [
        {"type": "image"},
        {"type": "text", "text": INSTRUCTION}
    ]
}]

prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[prompt], images=[image], return_tensors="pt").to(model.device)

print("Running inference...")
generated_ids = model.generate(**inputs, max_new_tokens=128)
response_ids = generated_ids[0, len(inputs["input_ids"][0]):]
raw_response = processor.decode(response_ids, skip_special_tokens=True)

point_x, point_y = parse_coordinates(raw_response)

print("\n" + "="*20 + " RESULT " + "="*20)
print(f"Instruction: {INSTRUCTION}")
print(f"Raw Response: {raw_response}")

if point_x != -1:
    resized_height, resized_width = inputs['pixel_values'].shape
    norm_x = point_x / resized_width
    norm_y = point_y / resized_height
    
    print(f"✅ Parsed Point (on resized image): ({point_x}, {point_y})")
    print(f"✅ Normalized Point (0.0 to 1.0): ({norm_x:.4f}, {norm_y:.4f})")
else:
    print("❌ Could not parse coordinates from the response.")
print("="*48)

## 📮 Contact Feel free to contact `liangyuchen@ruc.edu.cn` if you have any questions. ## License This repo follows CC-BY-NC-SA 4.0 license. Please use this repo for non-commercial use ONLY. ## Citation If you use this repository or find it helpful in your research, please cite it as follows: ```bibtex @article{chen2025ui, title={UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning}, author={Chen, Liangyu and Zhou, Hanzhang and Cai, Chenglin and Zhang, Jianan and Tong, Panrong and Kong, Quyu and Zhang, Xu and Liu, Chen and Liu, Yuqi and Wang, Wenxuan and others}, journal={arXiv preprint arXiv:2510.20286}, year={2025} } ```