Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale

Authors: Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Keunho Jang, Zheng Hui

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To demonstrate WINDOWS AGENT ARENA s capabilities, we provide extensive ablations and benchmark its performance (Section 4) with the best configuration achieving a 19.5% success rate on WINDOWS AGENT ARENA using Set-of-Marks prompting (Yang et al., 2023) combined with the system s UIA accessibility tree and pixel-based element detectors. We test and analyze several variants of state-of-the-art multi-modal LLMs as agent backbones (Table 4) ranging from large, popular closed-source multi-modal staples like GPT-4V and GPT-4o, or Open AI s recent o1, and smaller open-sourced multi-modal models sometimes neglected by agentic benchmarks like Phi-3/3.5 V.
Researcher Affiliation Collaboration 1Microsoft 2Carnegie Mellon University 3Columbia University. Correspondence to: Dan Zhao <EMAIL>.
Pseudocode No The paper provides detailed
Open Source Code No We make our code available as open-source contributions2 in the hopes that WINDOWS AGENT ARENA can make agent research more accessible while also facilitating further agent development, faster experimentation, and data generation at scale in the Windows environment. 2Github link will be available after paper review.
Open Datasets Yes We introduce WINDOWS AGENT ARENA: an agentic benchmark consisting of 154 tasks distributed across multiple apps and domains (Table 2). Unlike other benchmarks, e.g., (Xie et al., 2024), we focus exclusively on the Windows OS providing significantly more tasks in both number and variety to address the unique aspects of Windows OS. Our benchmark provides open/free access to a Windows OS environment (Sec. 3.6), allowing users to easily add new tasks, install new programs, etc., atop ours for their own purposes.
Dataset Splits No The paper describes the creation of 154 tasks and their distribution across domains and difficulty, but it does not specify any training, validation, or test splits for these tasks, as the tasks themselves constitute the benchmark for evaluation.
Hardware Specification Yes We use Azure Machine Learning jobs to parallelize the benchmark evaluation using compute instances. The process is similar to the local setup, but the VMs are instantiated and terminated with each experiment submission. ...Table 7 provides a non-exhaustive list of CPU VMs that support our setup as well as their current costs as of August 2024 (subject to change). We rely primarily on the Standard D8 v3 machine for our experiments.
Software Dependencies No The paper mentions software components like 'pyautogui/python code execution', 'Python Flask server', 'pywinauto library', 'QEMU and KVM', and 'dockur/windows Docker image', but it does not provide specific version numbers for any of these components.
Experiment Setup Yes We use chain-of-thought prompting (Wei et al., 2022) to instruct the agent to reason about the current state of the computer, its own past actions, and decide on the most appropriate next action (full prompts found in Appendix D). All variations of our agent receive as input the title of the current foreground window, titles for all other windows or browser tabs currently open, and a representation of the current screen. We consider several methods to process the screen representation for the agent as input and create Set-of-Marks (So Ms): UIA tree parsing, DOM tree parsing, OCR, Icon and image detection, Omni Parser. Appendix D provides full prompts used.