AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Authors: Chris Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, Oriana Riva

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experiment with baseline agents to test ANDROIDWORLD and provide initial results on the benchmark. Our best agent can complete 30.6% of ANDROIDWORLD s tasks, leaving ample room for future work. Furthermore, we adapt a popular desktop web agent to work on Android, which we find to be less effective on mobile, suggesting future research is needed to achieve universal, cross-platform agents. Finally, we also conduct a robustness analysis, showing that task variations can significantly affect agent performance, demonstrating that without such testing, agent performance metrics may not fully reflect practical challenges. ANDROIDWORLD and the experiments in this paper are available at https://github.com/google-research/android_world.
Researcher Affiliation Industry Christopher Rawles 1, Sarah Clinckemaillie 2, Yifan Chang 2, Jonathan Waltz2, Gabrielle Lau2, Marybeth Fair2, Alice Li1, William Bishop1, Wei Li1, Folawiyo Campbell-Ajala1, Daniel Toyama1, Robert Berry1, Divya Tyamagundlu2, Timothy Lillicrap1, and Oriana Riva1 1Google Deep Mind 2Google
Pseudocode Yes Listing 1: Pseudo-code representation of the action space. and Below we show an example of the task evaluation for a Send Sms task, which involves sending and validating a text message. The pseudocode illustrates the task initialization, success check, and parameter generation methods.
Open Source Code Yes ANDROIDWORLD and the experiments in this paper are available at https://github.com/google-research/android_world.
Open Datasets Yes ANDROIDWORLD and the experiments in this paper are available at https://github.com/google-research/android_world. and In addition to the 116 Android tasks, we extend ANDROIDWORLD with web tasks by integrating the Mini Wo B++ (Shi et al., 2017; Liu et al., 2018a) benchmark into it.
Dataset Splits No Unlike existing interactive environments, which provide a static test set, ANDROIDWORLD dynamically constructs tasks that are parameterized and expressed in natural language in unlimited ways, thus enabling testing on a much larger and more realistic suite of tasks. The paper describes dynamic task generation with random parameters and seeds, rather than fixed dataset splits for training, validation, and testing.
Hardware Specification No The Android OS is fixed, consisting of a Pixel 6 emulator running Android 13. and To assess Android World s robustness to OS variations, we tested on a Pixel 5 (Android 12) alongside our primary setup (Pixel 6, Android 13). The paper mentions the Android emulator and device types but does not specify the underlying hardware (CPU, GPU, etc.) used to run the experiments or the agents.
Software Dependencies No It connects agents to the Android OS by leveraging the Python library Android Env (Toyama et al., 2021) to connect to the freely available Android Emulator. The paper mentions the Python library Android Env but does not provide a specific version number for it or other key software dependencies.
Experiment Setup Yes We set the seed to 30 and the temperature to 0 to aid reproducibility. Each task has a maximum allowed number of steps (detailed in Appendix F), typically set to twice the number of steps needed by human annotators to complete the task. and It is zero-shot, integrating Re Act-style (Yao et al., 2022) and Reflexion-style (Shinn et al., 2023) prompting to consume user instructions and screen content, reason, take actions, and update its decision-making based on the outcome of its actions.