reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Authors: Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, Caiming Xiong

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that AGUVIS achieves state-of-the-art performance across offline and real-world online benchmarks, marking the first fully autonomous vision-based GUI agent that operates without closed-source models.
Researcher Affiliation	Collaboration	1University of Hong Kong 2Salesforce Research. Correspondence to: Yiheng Xu <EMAIL>, Tao Yu <EMAIL>, Caiming Xiong <EMAIL>.
Pseudocode	No	The paper describes methods and training paradigms but does not present structured pseudocode or algorithm blocks for the AGUVIS framework itself. Appendix A.1 and A.2 list action space definitions, which are code snippets but not the algorithm's pseudocode.
Open Source Code	Yes	We open-source all datasets, models, and training recipes at https://aguvis-project.github.io to advance future research.
Open Datasets	Yes	We open-source all datasets, models, and training recipes at https://aguvis-project.github.io to advance future research. ... We construct AGUVIS DATA COLLECTION, a large-scale dataset with multimodal grounding and reasoning annotations... Our dataset consists of two splits: a grounding split focusing on element localization and interaction (Table 10), and a planning & reasoning split capturing multi-step task completion (Table 11).
Dataset Splits	Yes	Our dataset consists of two splits: a grounding split focusing on element localization and interaction (Table 10), and a planning & reasoning split capturing multi-step task completion (Table 11). ... We present the detailed statistical information of all training datasets utilized in both the grounding and planning & reasoning stages. The statistics are shown in Table 10 and Table 11, respectively.
Hardware Specification	Yes	We train AGUVIS on a cluster of H100-80G GPUs: AGUVIS-7B uses 8 nodes and completes the grounding training within 5 hours and planning & reasoning training within 1 hour. AGUVIS-72B uses 16 nodes and completes the grounding training within 30 hours and planning & reasoning training within 6 hours.
Software Dependencies	No	Our codebase is based on Pytorch (Paszke et al., 2019) and Huggingface Transformers (Wolf et al., 2019). During training, we utilize the strategies of Deep Speed optimization (Rajbhandari et al., 2020), BF16 format and gradient checkpointing to save GPU memory.
Experiment Setup	Yes	For AGUVIS based on the Qwen2-VL backbone, we set the maximum pixels for each image to 1280 720... The maximum sequence length of tokens is set to 8192 for all models. We use Adam optimizer (Loshchilov & Hutter, 2019) for both grounding and planning & reasoning training stages and employ a cosine learning rate scheduler with a warm-up ratio of 3% steps. We train AGUVIS with a batch size of 128 for 1 epoch in each stage. The peak learning rate is set to 1e-5 for AGUVIS-7B and 5e-6 for AGUVIS-72B.