reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction

Authors: Shravan Nayak, Xiangru Jian, Kevin Qinghong Lin, Juan A. Rodriguez, Montek Kalsi, Nicolas Chapados, M. Tamer Özsu, Aishwarya Agrawal, David Vazquez, Christopher Pal, Perouz Taslakian, Spandana Gella, Sai Rajeswar

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate state-of-the-art GUI agents on our UI-Vision benchmark, which reveals significant gaps across all three tasks. Element Grounding, a core ability for performing actions on GUI, remains particularly difficult where even the best-performing model, UI-TARS (Qin et al., 2025), achieves only 25.5% accuracy.
Researcher Affiliation	Collaboration	1Mila Quebec AI Institute 2Universit e de Montr eal 3Service Now Research 4University of Waterloo 5National University of Singapore 6 Ecole de Technologie Sup erieure 7CIFAR AI Chair 8Polytechnique Montr eal.
Pseudocode	No	The paper describes the data collection and annotation processes in Section 3.1 and the benchmark tasks in Section 3.2, along with more details in Appendix B. However, these descriptions are in narrative text format and do not include explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor are they formatted as structured code-like procedures.
Open Source Code	No	The paper states in Appendix B.1, 'The processing scripts will be released alongside the code for reproducibility.' which indicates future availability, but does not provide immediate concrete access to the source code for the methodology described in this paper.
Open Datasets	Yes	We introduce UI-Vision, the first comprehensive, license-permissive benchmark for offline, fine-grained evaluation of computer use agents in real-world desktop environments. ... Built from open-source and permissive data, it ensures accessibility and reproducibility.
Dataset Splits	No	The paper provides statistics on the total number of query-label pairs for each task (e.g., 'Both basic and functional grounding subtasks have 1,772 query-label pairs'), but it does not specify explicit training, validation, or test splits for the dataset, nor does it refer to predefined standard splits for reproducibility of model training.
Hardware Specification	Yes	Table 9: Efficiency metrics for the Element Grounding task under the Basic setting. ... # GPUs (H100) ... Table 10: Efficiency metrics for the Layout Grounding task. ... # GPUs (H100) ... Table 11: Efficiency metrics for the Action Prediction task. ... # GPUs (H100)
Software Dependencies	No	The paper mentions 'Hugging Face implementations via the Transformers library' for latency measurements, but it does not provide specific version numbers for this library or any other key software dependencies required to replicate the experiments.
Experiment Setup	No	The paper focuses on evaluating state-of-the-art GUI agents on the UI-Vision benchmark, as described in Section 4.1 'Baselines' and Section 4.2 'Evaluation Metrics'. It specifies how these existing models were prompted and evaluated (e.g., 'Each model used its recommended format for prompting'), but it does not provide specific hyperparameters, training configurations, or system-level settings for training a new model.