reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Authors: Chris Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, Oriana Riva

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We experiment with baseline agents to test ANDROIDWORLD and provide initial results on the benchmark. Our best agent can complete 30.6% of ANDROIDWORLD s tasks, leaving ample room for future work. Furthermore, we adapt a popular desktop web agent to work on Android, which we ﬁnd to be less effective on mobile, suggesting future research is needed to achieve universal, cross-platform agents. Finally, we also conduct a robustness analysis, showing that task variations can signiﬁcantly affect agent performance, demonstrating that without such testing, agent performance metrics may not fully reﬂect practical challenges. ANDROIDWORLD and the experiments in this paper are available at https://github.com/google-research/android_world.
Researcher Affiliation	Industry	Christopher Rawles 1, Sarah Clinckemaillie 2, Yifan Chang 2, Jonathan Waltz2, Gabrielle Lau2, Marybeth Fair2, Alice Li1, William Bishop1, Wei Li1, Folawiyo Campbell-Ajala1, Daniel Toyama1, Robert Berry1, Divya Tyamagundlu2, Timothy Lillicrap1, and Oriana Riva1 1Google Deep Mind 2Google
Pseudocode	Yes	Listing 1: Pseudo-code representation of the action space. and Below we show an example of the task evaluation for a Send Sms task, which involves sending and validating a text message. The pseudocode illustrates the task initialization, success check, and parameter generation methods.
Open Source Code	Yes	ANDROIDWORLD and the experiments in this paper are available at https://github.com/google-research/android_world.
Open Datasets	Yes	ANDROIDWORLD and the experiments in this paper are available at https://github.com/google-research/android_world. and In addition to the 116 Android tasks, we extend ANDROIDWORLD with web tasks by integrating the Mini Wo B++ (Shi et al., 2017; Liu et al., 2018a) benchmark into it.
Dataset Splits	No	Unlike existing interactive environments, which provide a static test set, ANDROIDWORLD dynamically constructs tasks that are parameterized and expressed in natural language in unlimited ways, thus enabling testing on a much larger and more realistic suite of tasks. The paper describes dynamic task generation with random parameters and seeds, rather than fixed dataset splits for training, validation, and testing.
Hardware Specification	No	The Android OS is ﬁxed, consisting of a Pixel 6 emulator running Android 13. and To assess Android World s robustness to OS variations, we tested on a Pixel 5 (Android 12) alongside our primary setup (Pixel 6, Android 13). The paper mentions the Android emulator and device types but does not specify the underlying hardware (CPU, GPU, etc.) used to run the experiments or the agents.
Software Dependencies	No	It connects agents to the Android OS by leveraging the Python library Android Env (Toyama et al., 2021) to connect to the freely available Android Emulator. The paper mentions the Python library Android Env but does not provide a specific version number for it or other key software dependencies.
Experiment Setup	Yes	We set the seed to 30 and the temperature to 0 to aid reproducibility. Each task has a maximum allowed number of steps (detailed in Appendix F), typically set to twice the number of steps needed by human annotators to complete the task. and It is zero-shot, integrating Re Act-style (Yao et al., 2022) and Reﬂexion-style (Shinn et al., 2023) prompting to consume user instructions and screen content, reason, take actions, and update its decision-making based on the outcome of its actions.