reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

AgentStudio: A Toolkit for Building General Virtual Agents

Authors: Longtao Zheng, Zhiyuan Huang, Zhenghai Xue, Xinrun Wang, Bo An, Shuicheng YAN

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results indicate that even state-of-the-art vision-language models (VLMs) like GPT-4o experience significant performance declines in GUI and compositional tasks. For tasks that could be completed using APIs, providing screenshot observations can lead to poorer performance compared to text-only observations, as models might be misled into using GUI actions instead of APIs. Notably, most existing models struggle with professional applications such as image editing software, whereas humans can solve 72.2% of those tasks. Evaluation results from the three datasets reveal that current general-purpose VLMs struggle to accurately predict the exact coordinates of GUI elements in screenshots.
Researcher Affiliation	Collaboration	1Nanyang Technological University, Singapore 2Skywork AI 3ETH Zurich 4National University of Singapore 5Singapore Management University
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks. It provides examples of Python code for interaction with the environment in the appendix, but these are concrete code snippets, not abstract pseudocode.
Open Source Code	Yes	All resources, such as code and datasets, are publicly available at our project page. The benchmark tasks and datasets are also hosted on Hugging Face.
Open Datasets	Yes	All resources, such as code and datasets, are publicly available at our project page. The benchmark tasks and datasets are also hosted on Hugging Face. [...] Using Agent Studio environment and tools, we introduce an online task-completion benchmark and three datasets to evaluate fundamental agent abilities in real-world settings. [...] The three datasets, Ground UI, IDMBench, and Critic Bench, focus on UI grounding, labeling actions in videos, and success detection, respectively.
Dataset Splits	No	The paper mentions subsets used for evaluation, such as Ground UI-1K consisting of '400, 300, and 300 samples for web, desktop, and mobile devices, respectively,' and refers to collecting data from 'test sets of existing grounding datasets.' However, it does not explicitly provide training, validation, and test dataset splits (e.g., percentages or exact counts for each split) for reproducing the training of models.
Hardware Specification	No	The paper states, 'Our implementation uses Docker for a lightweight environment' and that 'results for online benchmark tasks are obtained within an Ubuntu Docker environment,' but it does not provide specific hardware details such as GPU or CPU models, processor types, or memory amounts used for running experiments.
Software Dependencies	No	The paper mentions that results are obtained within an 'Ubuntu Docker environment' and provides Python code examples using 'Jupyter Notebook'. However, it does not provide specific version numbers for key software components or libraries (e.g., Python 3.x, specific library versions) that would be needed for reproducible setup.
Experiment Setup	Yes	The paper provides specific details regarding the experimental setup, such as: 'For non-GUI tasks, we limit the maximum number of execution steps to 1. For GUI tasks, we limit the maximum number of execution steps to 30. For non-GUI tasks, we limit the maximum execution time to 30 seconds. For GUI tasks, we limit the maximum execution time to 60 seconds. If the model performs repeated actions, i.e., executes the same action consecutively three times, the task is considered a failure.' It also specifies 'greedy decoding, with the temperature set to 0' for result generation.