reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

OS-ATLAS: Foundation Action Model for Generalist GUI Agents

Authors: Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, Yu Qiao

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive evaluation across six benchmarks spanning three different platforms (mobile, desktop, and web), OS-Atlas demonstrates significant performance improvements over previous state-of-the-art models.
Researcher Affiliation	Collaboration	1Shanghai AI Laboratory 2Shanghai Jiaotong University 3The University of Hong Kong 4MIT EMAIL
Pseudocode	No	There are no explicit pseudocode or algorithm blocks labeled in the paper. Table 6 provides a 'Unified Action Space Prompt' which describes a structured input format rather than an algorithm.
Open Source Code	No	We will release all data, source code, and model checkpoints to support reproducibility.
Open Datasets	Yes	Leveraging this data toolkit, we curated and open-sourced the largest multi-platform GUI grounding corpus to date, which comprises over 2.3 million distinct screenshots and more than 13 million GUI elements. We also utilize instruction grounding data from two publicly available datasets: Android Control (Li et al., 2024) and Wave-UI 1. 1https://huggingface.co/datasets/agentsea/wave-ui.
Dataset Splits	Yes	We only use the test split from these benchmarks for evaluation. We annotated the training sets of four trajectory datasets collected from both web and mobile platforms, namely Mind2Web (Deng et al., 2023b), AMEX (Chai et al., 2024), and AITZ (Zhang et al., 2024d).
Hardware Specification	No	To gain deeper insights into the reasons behind this strong performance, we conducted a series of analyses under the standard setting (without a planner), including those in 5.3, using Intern VL-2-4B due to GPU constraints. Further details regarding the training setups can be found in Appendix E.
Software Dependencies	No	Due to the differences in A11y tree APIs and tools supported by each operating system, we utilize pyatspi to access the A11y tree on Ubuntu, pywinauto on Windows, and Application Services on mac OS.
Experiment Setup	Yes	We set the max dynamic patch parameter to 6 to ensure the model captures sufficient pixel information. As a result, the input image, after resizing, is divided into a maximum of 6 tiles of 448 448 pixels, along with a thumbnail of the entire image to capture global context. Through our experiments, we discover that setting the max pixel of image input to 1024x1024 during both training and inference yields excellent results for GUI grounding tasks, while also optimizing the model s training and inference cost.