reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SPA-BENCH: A COMPREHENSIVE BENCHMARK FOR SMARTPHONE AGENT EVALUATION

Authors: Jingxuan Chen, Derek Yuen, Bin Xie, Yuhao Yang, Gongwei Chen, Zhihao Wu, Li Yixing, Xurui Zhou, Weiwen Liu, Shuai Wang, Kaiwen Zhou, Rui Shao, Liqiang Nie, Yasheng Wang, Jianye HAO, Jun Wang, Kun Shao

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive experiments across tasks and agents reveal challenges like interpreting mobile user interfaces, action grounding, memory retention, and execution costs. We propose future research directions to ease these difficulties, moving closer to real-world smartphone agent applications.
Researcher Affiliation	Collaboration	Jingxuan Chen1, Derek Yuen1, Bin Xie2, Yuhao Yang1, Gongwei Chen2, Zhihao Wu1, Yixing Li2, Xurui Zhou2, Weiwen Liu1, Shuai Wang1, Kaiwen Zhou1, Rui Shao2 , Liqiang Nie2, Yasheng Wang1, Jianye Hao1,3, Jun Wang4, Kun Shao1 1Huawei Noah s Ark Lab, 2Harbin Institute of Technology, Shenzhen, 3Tianjin University, 4AI Centre, University College London
Pseudocode	No	The paper describes procedures for success detection and agent interaction but does not present them in a structured pseudocode or algorithm block format. For instance, sections like "5.2 SUCCESS DETECTION" and Appendix D.4 "PROMPTING TEMPLATES" describe steps but are not formatted as pseudocode.
Open Source Code	Yes	SPA-BENCH is available at https://ai-agents-2030.github.io/SPA-Bench/.
Open Datasets	Yes	In this paper, we present SPA-BENCH, a comprehensive Smart Phone Agent Benchmark designed to evaluate (M)LLM-based agents in an interactive environment that simulates real-world conditions. SPA-BENCH is available at https://ai-agents-2030.github.io/SPA-Bench/. SPA-BENCH builds a collection of smartphone agent tasks across both English and Chinese apps, featuring 39 English and 29 Chinese apps divided into eight categories based on core features (see Appendix B.1). The collection includes 150 single-app tasks and 20 cross-app tasks for each language.
Dataset Splits	Yes	In total, SPA-BENCH includes 300 single-app and 40 cross-app tasks, evenly split between English and Chinese. Each task may consist of multiple subtasks (e.g., adding, modifying, deleting, searching). The distribution of steps performed by humans for these tasks, categorised by task type, is illustrated in Appendix B.5. Table 4: Single-app English tasks. Table 5: Single-app Chinese tasks. Table 6: Cross-app English tasks. Table 7: Cross-app Chinese tasks.
Hardware Specification	Yes	For instance, a 24-core CPU with 64GB RAM can support up to eight emulators or worker processes simultaneously, depending on the agents resource needs.
Software Dependencies	No	The paper mentions specific software and models like "GPT-4o", "GPT-4o-mini", "Qwen-VL-Chat", "Adb Keyboard", "UIAutomator2", and "Paddle OCR". However, it does not consistently provide specific version numbers for these components, which is required for a 'Yes' classification based on the given criteria.
Experiment Setup	Yes	For all other settings, the default configurations provided by the developers were used. Agents were allowed to execute up to twice the number of golden steps for a task, after which execution was halted. In this initial experiment, we tested the seven agents that follow the agentic workflow on the ten open-ended tasks. Given the open-ended nature of these tasks and the absence of predefined golden steps, agents were allowed a maximum of 20 steps to complete each task.