reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Cradle: Empowering Foundation Agents towards General Computer Control

Authors: Weihao Tan, Wentao Zhang, Xinrun Xu, Haochong Xia, Ziluo Ding, Boyu Li, Bohan Zhou, Junpeng Yue, Jiechuan Jiang, Yewen Li, Ruyi An, Molei Qin, Chuqiao Zong, Longtao Zheng, Yujie Wu, Xiaoqiang Chai, Yifei Bi, Tianbao Xie, Pengjie Gu, Xiyun Li, Ceyao Zhang, Long Tian, Chaojie Wang, Xinrun Wang, Börje F. Karlsson, Bo An, Shuicheng Yan, Zongqing Lu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that CRADLE exhibits remarkable generalizability and impressive performance across four previously unexplored commercial video games (Red Dead Redemption 2, Cities: Skylines, Stardew Valley and Dealer s Life 2), ﬁve software applications (Chrome, Outlook, Feishu, Meitu and Cap Cut), and a comprehensive benchmark, OSWorld.
Researcher Affiliation	Collaboration	1Nanyang Technological University, Singapore 2Institute of Software, Chinese Academy of Sciences 3Beijing Academy of Artiﬁcial Intelligence 4Peking University 5Skywork AI 6The University of Hong Kong 7The Chinese University of Hong Kong, Shenzhen 8National University of Singapore. Correspondence to: Weihao Tan <EMAIL>, Bo An <EMAIL>, Shuicheng Yan <EMAIL>, Zongqing Lu <EMAIL>.
Pseudocode	Yes	Algorithm 1 Skill Generation Input: Toolbar with objects, Skill template Output: Procedure memory with generated skills 1 Initialize procedure memory 2 for each object in the toolbar do 3 Hover the mouse on the object to get the description Generate skill using GPT-4o based on the object description and the skill template Store generated skill in procedure memory Execute the generated skill to enter the second-level toolbar 4 for each object in the second-level toolbar do 5 Hover the mouse on the object to get the description Generate skill using GPT-4o based on the object description and skill template Store generated skill in procedure memory 6 return procedure memory
Open Source Code	Yes	Video demos and code can be found at https://baai-agents.github.io/Cradle.
Open Datasets	Yes	Experimental results show that CRADLE exhibits remarkable generalizability and impressive performance across four previously unexplored commercial video games (Red Dead Redemption 2, Cities: Skylines, Stardew Valley and Dealer s Life 2), ﬁve software applications (Chrome, Outlook, Feishu, Meitu and Cap Cut), and a comprehensive benchmark, OSWorld. [...] Similar to SIMA (Raad et al., 2024), we apply human evaluation to all tasks across software and games, except for OSWorld (Xie et al., 2024), which provides automatic evaluation scripts. [...] MineRL: A large-scale dataset of Minecraft demonstrations [Guss et al., 2019].
Dataset Splits	No	The paper does not provide specific training/test/validation dataset splits for models trained as part of this work. It describes evaluation settings for benchmarks and tasks (e.g., 'five runs', 'maximum step limit'), but not data partitioning for model training, as it leverages pre-trained LMMs.
Hardware Specification	Yes	All software and games can be run on regular Windows 10 machines, except for RDR2, which is tested on a machine with an NVIDIA RTX-4090 GPU.
Software Dependencies	Yes	We employ GPT-4o (Open AI, 2024b), currently one of the most capable LMM models, as the framework s backbone model. If not mentioned explicitly, all the experiments are done with gpt-4o-2024-05-13. [...] we use Open AI’s text-embedding-ada-002 model (Open AI, 2022) to generate embeddings for each skill [...] To extract keyframes from the video observation, we utilize the Video Sub Finder tool. [...] we add a visual augmentation sub-module within our Information Gathering module. This augmentation step serves two main purposes: i) utilize Grounding DINO (Liu et al., 2023), an open-set object detector, to output precise bounding boxes of possible targets in an image and serve as spatial clues for GPT-4o; and ii) perform template matching (Brunelli, 2009) to provide icon recognition grounding truth for GPT-4o when interpreting instructions or menus shown on screen. [...] we use the similar Py Direct Input library and Py Auto GUI for keyboard control, utilize AHK and write our own abstraction (using the ctypes library) to send low-level mouse commands to the operating system for mouse control. [...] Our experiments are based on the latest version of RDR2, Build 1491.50. [...] We use the latest version (1.6.8) of the game to conduct all the experiments. [Stardew Valley] [...] We run our experiments using the latest version, V. 1.013_W96 of the game. [Dealer's Life 2] [...] We use the latest version of the game (version 1.17.1-f4). [Cities: Skylines] [...] Table 11: Exact software versions utilized in the described experiments. Chrome 125.0.6422.142 Outlook 1.2024.529.200 Cap Cut 4.0.0 Meitu 7.5.6.1 Feishu 7.19.5.
Experiment Setup	Yes	If not speciﬁcally mentioned, all experiments are conducted in ﬁve runs under a maximum step limit, using Open AI’s model, gpt-4o-2024-05-13 (Open AI, 2024b). For each video game, we hired ﬁve human players, who never played the corresponding game before, to do the evaluation. Before they start the experiments, they will read the prompts used by CRADLE agents for fair comparison. Every player played the task once. [...] Temperature is set to 0 to lower the variance of the text generation. [...] we set k to ﬁve [for short-term memory]. [...] For each task, the maximal steps is 100. [Stardew Valley] [...] A run is terminated when it reaches the maximal steps, 1000, or the budget is used up (less than 1000). [Cities: Skylines] [...] The maximal number of steps (agent takes one action) for each task is 500. [RDR2]