reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

OSCAR: Operating System Control via State-Aware Reasoning and Re-Planning

Authors: Xiaoqiang Wang, Bang Liu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate OSCAR s effectiveness through extensive experiments on diverse benchmarks across desktop and mobile platforms, where it transforms complex workflows into simple natural language commands, significantly boosting user productivity. ... We validated OSCAR s effectiveness and generalizability across diverse benchmarks involving both desktop and smartphone OS environments. On the GAIA (Mialon et al., 2023) benchmark, OSCAR outperformed previous methods, achieving a 28.7% average success rate, with a notable 13.5% success rate on the most complex Level 3 tasks, nearly doubling the prior state-of-the-art performance.
Researcher Affiliation	Academia	1DIRO & Institut Courtois, Université de Montréal 2Mila Quebec AI Institute 3Canada CIFAR AI Chair EMAIL
Pseudocode	No	The paper describes the state machine model in Section 2.1 and illustrates it in Figure 2, outlining the state transitions and components. However, it does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured, code-like formatting.
Open Source Code	No	The paper mentions employing the widely-used Py Auto GUI library for mouse and keyboard control (Section 2.4) and provides a link to its documentation. However, it does not contain an explicit statement about releasing the source code for the OSCAR methodology itself, nor does it provide a direct link to a code repository for their implementation.
Open Datasets	Yes	We evaluate OSCAR on real-world workflow automation benchmarks involving complex user requests. The first benchmark is GAIA (Mialon et al., 2023), which consists of 466 question-answering (QA) structured into three levels... The second benchmark is OSWorld (Xie et al., 2024b), an interactive dynamic environment with real-time OS feedback. It includes 369 tasks... Additionally, similar to OSWorld, Android World (Rawles et al., 2024) provides a dynamic smartphone OS environment with 116 tasks spread across 20 diverse applications...
Dataset Splits	No	The paper describes the structure and number of tasks for the GAIA, OSWorld, and Android World benchmarks used for evaluation. While it details the task categories and difficulty levels within these benchmarks, it does not explicitly provide specific training, validation, or test dataset splits that were used for reproducing the experiments. It mentions using '8 in-context demonstration examples' for the base model, which is a prompting strategy rather than a dataset split.
Hardware Specification	Yes	We conduct evaluation experiments on 2 A100 GPUs. Since fine-tuning the base model is not involved and it is accessed via API, the GPU is mainly required for the Detection+OCR pipeline. As this pipeline is efficient on CPU machines, all experiments can also run on regular Windows 11 machines with WSL virtualization support, which is used for encapsulating the development and test environments in Docker containers.
Software Dependencies	No	To operationalize OSCAR s action space, we employ the widely-used Py Auto GUI library for mouse and keyboard control... we set the base model of OSCAR and all baseline models to GPT-4o, i.e. gpt-4o-2024-05-13, except for the results on GAIA in Table 1, which are based on GPT-4-turbo, i.e. gpt-4-turbo-2024-04-09. ... Specifically, we follow Gao et al. (2023); Wang et al. (2024a) and use YOLO-v8 (Reis et al., 2023) and Google OCR (Google Cloud, 2024) to parse the GUI into So M visual prompts... While specific identifiers are given for GPT-4o and GPT-4-turbo models, specific version numbers are not provided for Py Auto GUI, YOLO-v8, or Google OCR, which are key software components mentioned.
Experiment Setup	Yes	The temperature of response generation is set to 0.1 to reduce the variance in text generation. We provide 8 in-context demonstration examples to help the model better understand the instruction. ... The maximum number of allowed attempts per run is set to 4. We report the average results across 4 runs for each model on each benchmark.