Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Authors: Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, Caiming Xiong

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that AGUVIS achieves state-of-the-art performance across offline and real-world online benchmarks, marking the first fully autonomous vision-based GUI agent that operates without closed-source models.
Researcher Affiliation Collaboration 1University of Hong Kong 2Salesforce Research. Correspondence to: Yiheng Xu <EMAIL>, Tao Yu <EMAIL>, Caiming Xiong <EMAIL>.
Pseudocode No The paper describes methods and training paradigms but does not present structured pseudocode or algorithm blocks for the AGUVIS framework itself. Appendix A.1 and A.2 list action space definitions, which are code snippets but not the algorithm's pseudocode.
Open Source Code Yes We open-source all datasets, models, and training recipes at https://aguvis-project.github.io to advance future research.
Open Datasets Yes We open-source all datasets, models, and training recipes at https://aguvis-project.github.io to advance future research. ... We construct AGUVIS DATA COLLECTION, a large-scale dataset with multimodal grounding and reasoning annotations... Our dataset consists of two splits: a grounding split focusing on element localization and interaction (Table 10), and a planning & reasoning split capturing multi-step task completion (Table 11).
Dataset Splits Yes Our dataset consists of two splits: a grounding split focusing on element localization and interaction (Table 10), and a planning & reasoning split capturing multi-step task completion (Table 11). ... We present the detailed statistical information of all training datasets utilized in both the grounding and planning & reasoning stages. The statistics are shown in Table 10 and Table 11, respectively.
Hardware Specification Yes We train AGUVIS on a cluster of H100-80G GPUs: AGUVIS-7B uses 8 nodes and completes the grounding training within 5 hours and planning & reasoning training within 1 hour. AGUVIS-72B uses 16 nodes and completes the grounding training within 30 hours and planning & reasoning training within 6 hours.
Software Dependencies No Our codebase is based on Pytorch (Paszke et al., 2019) and Huggingface Transformers (Wolf et al., 2019). During training, we utilize the strategies of Deep Speed optimization (Rajbhandari et al., 2020), BF16 format and gradient checkpointing to save GPU memory.
Experiment Setup Yes For AGUVIS based on the Qwen2-VL backbone, we set the maximum pixels for each image to 1280 720... The maximum sequence length of tokens is set to 8192 for all models. We use Adam optimizer (Loshchilov & Hutter, 2019) for both grounding and planning & reasoning training stages and employ a cosine learning rate scheduler with a warm-up ratio of 3% steps. We train AGUVIS with a batch size of 128 for 1 epoch in each stage. The peak learning rate is set to 1e-5 for AGUVIS-7B and 5e-6 for AGUVIS-72B.