UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction

Authors: Shravan Nayak, Xiangru Jian, Kevin Qinghong Lin, Juan A. Rodriguez, Montek Kalsi, Nicolas Chapados, M. Tamer Özsu, Aishwarya Agrawal, David Vazquez, Christopher Pal, Perouz Taslakian, Spandana Gella, Sai Rajeswar

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate state-of-the-art GUI agents on our UI-Vision benchmark, which reveals significant gaps across all three tasks. Element Grounding, a core ability for performing actions on GUI, remains particularly difficult where even the best-performing model, UI-TARS (Qin et al., 2025), achieves only 25.5% accuracy.
Researcher Affiliation Collaboration 1Mila Quebec AI Institute 2Universit e de Montr eal 3Service Now Research 4University of Waterloo 5National University of Singapore 6 Ecole de Technologie Sup erieure 7CIFAR AI Chair 8Polytechnique Montr eal.
Pseudocode No The paper describes the data collection and annotation processes in Section 3.1 and the benchmark tasks in Section 3.2, along with more details in Appendix B. However, these descriptions are in narrative text format and do not include explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor are they formatted as structured code-like procedures.
Open Source Code No The paper states in Appendix B.1, 'The processing scripts will be released alongside the code for reproducibility.' which indicates future availability, but does not provide immediate concrete access to the source code for the methodology described in this paper.
Open Datasets Yes We introduce UI-Vision, the first comprehensive, license-permissive benchmark for offline, fine-grained evaluation of computer use agents in real-world desktop environments. ... Built from open-source and permissive data, it ensures accessibility and reproducibility.
Dataset Splits No The paper provides statistics on the total number of query-label pairs for each task (e.g., 'Both basic and functional grounding subtasks have 1,772 query-label pairs'), but it does not specify explicit training, validation, or test splits for the dataset, nor does it refer to predefined standard splits for reproducibility of model training.
Hardware Specification Yes Table 9: Efficiency metrics for the Element Grounding task under the Basic setting. ... # GPUs (H100) ... Table 10: Efficiency metrics for the Layout Grounding task. ... # GPUs (H100) ... Table 11: Efficiency metrics for the Action Prediction task. ... # GPUs (H100)
Software Dependencies No The paper mentions 'Hugging Face implementations via the Transformers library' for latency measurements, but it does not provide specific version numbers for this library or any other key software dependencies required to replicate the experiments.
Experiment Setup No The paper focuses on evaluating state-of-the-art GUI agents on the UI-Vision benchmark, as described in Section 4.1 'Baselines' and Section 4.2 'Evaluation Metrics'. It specifies how these existing models were prompted and evaluated (e.g., 'Each model used its recommended format for prompting'), but it does not provide specific hyperparameters, training configurations, or system-level settings for training a new model.