OSCAR: Operating System Control via State-Aware Reasoning and Re-Planning

Authors: Xiaoqiang Wang, Bang Liu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate OSCAR s effectiveness through extensive experiments on diverse benchmarks across desktop and mobile platforms, where it transforms complex workflows into simple natural language commands, significantly boosting user productivity. ... We validated OSCAR s effectiveness and generalizability across diverse benchmarks involving both desktop and smartphone OS environments. On the GAIA (Mialon et al., 2023) benchmark, OSCAR outperformed previous methods, achieving a 28.7% average success rate, with a notable 13.5% success rate on the most complex Level 3 tasks, nearly doubling the prior state-of-the-art performance.
Researcher Affiliation Academia 1DIRO & Institut Courtois, Université de Montréal 2Mila Quebec AI Institute 3Canada CIFAR AI Chair EMAIL
Pseudocode No The paper describes the state machine model in Section 2.1 and illustrates it in Figure 2, outlining the state transitions and components. However, it does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured, code-like formatting.
Open Source Code No The paper mentions employing the widely-used Py Auto GUI library for mouse and keyboard control (Section 2.4) and provides a link to its documentation. However, it does not contain an explicit statement about releasing the source code for the OSCAR methodology itself, nor does it provide a direct link to a code repository for their implementation.
Open Datasets Yes We evaluate OSCAR on real-world workflow automation benchmarks involving complex user requests. The first benchmark is GAIA (Mialon et al., 2023), which consists of 466 question-answering (QA) structured into three levels... The second benchmark is OSWorld (Xie et al., 2024b), an interactive dynamic environment with real-time OS feedback. It includes 369 tasks... Additionally, similar to OSWorld, Android World (Rawles et al., 2024) provides a dynamic smartphone OS environment with 116 tasks spread across 20 diverse applications...
Dataset Splits No The paper describes the structure and number of tasks for the GAIA, OSWorld, and Android World benchmarks used for evaluation. While it details the task categories and difficulty levels within these benchmarks, it does not explicitly provide specific training, validation, or test dataset splits that were used for reproducing the experiments. It mentions using '8 in-context demonstration examples' for the base model, which is a prompting strategy rather than a dataset split.
Hardware Specification Yes We conduct evaluation experiments on 2 A100 GPUs. Since fine-tuning the base model is not involved and it is accessed via API, the GPU is mainly required for the Detection+OCR pipeline. As this pipeline is efficient on CPU machines, all experiments can also run on regular Windows 11 machines with WSL virtualization support, which is used for encapsulating the development and test environments in Docker containers.
Software Dependencies No To operationalize OSCAR s action space, we employ the widely-used Py Auto GUI library for mouse and keyboard control... we set the base model of OSCAR and all baseline models to GPT-4o, i.e. gpt-4o-2024-05-13, except for the results on GAIA in Table 1, which are based on GPT-4-turbo, i.e. gpt-4-turbo-2024-04-09. ... Specifically, we follow Gao et al. (2023); Wang et al. (2024a) and use YOLO-v8 (Reis et al., 2023) and Google OCR (Google Cloud, 2024) to parse the GUI into So M visual prompts... While specific identifiers are given for GPT-4o and GPT-4-turbo models, specific version numbers are not provided for Py Auto GUI, YOLO-v8, or Google OCR, which are key software components mentioned.
Experiment Setup Yes The temperature of response generation is set to 0.1 to reduce the variance in text generation. We provide 8 in-context demonstration examples to help the model better understand the instruction. ... The maximum number of allowed attempts per run is set to 4. We report the average results across 4 runs for each model on each benchmark.