OS-ATLAS: Foundation Action Model for Generalist GUI Agents

Authors: Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, Yu Qiao

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive evaluation across six benchmarks spanning three different platforms (mobile, desktop, and web), OS-Atlas demonstrates significant performance improvements over previous state-of-the-art models.
Researcher Affiliation Collaboration 1Shanghai AI Laboratory 2Shanghai Jiaotong University 3The University of Hong Kong 4MIT EMAIL
Pseudocode No There are no explicit pseudocode or algorithm blocks labeled in the paper. Table 6 provides a 'Unified Action Space Prompt' which describes a structured input format rather than an algorithm.
Open Source Code No We will release all data, source code, and model checkpoints to support reproducibility.
Open Datasets Yes Leveraging this data toolkit, we curated and open-sourced the largest multi-platform GUI grounding corpus to date, which comprises over 2.3 million distinct screenshots and more than 13 million GUI elements. We also utilize instruction grounding data from two publicly available datasets: Android Control (Li et al., 2024) and Wave-UI 1. 1https://huggingface.co/datasets/agentsea/wave-ui.
Dataset Splits Yes We only use the test split from these benchmarks for evaluation. We annotated the training sets of four trajectory datasets collected from both web and mobile platforms, namely Mind2Web (Deng et al., 2023b), AMEX (Chai et al., 2024), and AITZ (Zhang et al., 2024d).
Hardware Specification No To gain deeper insights into the reasons behind this strong performance, we conducted a series of analyses under the standard setting (without a planner), including those in 5.3, using Intern VL-2-4B due to GPU constraints. Further details regarding the training setups can be found in Appendix E.
Software Dependencies No Due to the differences in A11y tree APIs and tools supported by each operating system, we utilize pyatspi to access the A11y tree on Ubuntu, pywinauto on Windows, and Application Services on mac OS.
Experiment Setup Yes We set the max dynamic patch parameter to 6 to ensure the model captures sufficient pixel information. As a result, the input image, after resizing, is divided into a maximum of 6 tiles of 448 448 pixels, along with a thumbnail of the entire image to capture global context. Through our experiments, we discover that setting the max pixel of image input to 1024x1024 during both training and inference yields excellent results for GUI grounding tasks, while also optimizing the model s training and inference cost.