AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
Authors: Chris Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, Oriana Riva
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experiment with baseline agents to test ANDROIDWORLD and provide initial results on the benchmark. Our best agent can complete 30.6% of ANDROIDWORLD s tasks, leaving ample room for future work. Furthermore, we adapt a popular desktop web agent to work on Android, which we find to be less effective on mobile, suggesting future research is needed to achieve universal, cross-platform agents. Finally, we also conduct a robustness analysis, showing that task variations can significantly affect agent performance, demonstrating that without such testing, agent performance metrics may not fully reflect practical challenges. ANDROIDWORLD and the experiments in this paper are available at https://github.com/google-research/android_world. |
| Researcher Affiliation | Industry | Christopher Rawles 1, Sarah Clinckemaillie 2, Yifan Chang 2, Jonathan Waltz2, Gabrielle Lau2, Marybeth Fair2, Alice Li1, William Bishop1, Wei Li1, Folawiyo Campbell-Ajala1, Daniel Toyama1, Robert Berry1, Divya Tyamagundlu2, Timothy Lillicrap1, and Oriana Riva1 1Google Deep Mind 2Google |
| Pseudocode | Yes | Listing 1: Pseudo-code representation of the action space. and Below we show an example of the task evaluation for a Send Sms task, which involves sending and validating a text message. The pseudocode illustrates the task initialization, success check, and parameter generation methods. |
| Open Source Code | Yes | ANDROIDWORLD and the experiments in this paper are available at https://github.com/google-research/android_world. |
| Open Datasets | Yes | ANDROIDWORLD and the experiments in this paper are available at https://github.com/google-research/android_world. and In addition to the 116 Android tasks, we extend ANDROIDWORLD with web tasks by integrating the Mini Wo B++ (Shi et al., 2017; Liu et al., 2018a) benchmark into it. |
| Dataset Splits | No | Unlike existing interactive environments, which provide a static test set, ANDROIDWORLD dynamically constructs tasks that are parameterized and expressed in natural language in unlimited ways, thus enabling testing on a much larger and more realistic suite of tasks. The paper describes dynamic task generation with random parameters and seeds, rather than fixed dataset splits for training, validation, and testing. |
| Hardware Specification | No | The Android OS is fixed, consisting of a Pixel 6 emulator running Android 13. and To assess Android World s robustness to OS variations, we tested on a Pixel 5 (Android 12) alongside our primary setup (Pixel 6, Android 13). The paper mentions the Android emulator and device types but does not specify the underlying hardware (CPU, GPU, etc.) used to run the experiments or the agents. |
| Software Dependencies | No | It connects agents to the Android OS by leveraging the Python library Android Env (Toyama et al., 2021) to connect to the freely available Android Emulator. The paper mentions the Python library Android Env but does not provide a specific version number for it or other key software dependencies. |
| Experiment Setup | Yes | We set the seed to 30 and the temperature to 0 to aid reproducibility. Each task has a maximum allowed number of steps (detailed in Appendix F), typically set to twice the number of steps needed by human annotators to complete the task. and It is zero-shot, integrating Re Act-style (Yao et al., 2022) and Reflexion-style (Shinn et al., 2023) prompting to consume user instructions and screen content, reason, take actions, and update its decision-making based on the outcome of its actions. |