Cradle: Empowering Foundation Agents towards General Computer Control
Authors: Weihao Tan, Wentao Zhang, Xinrun Xu, Haochong Xia, Ziluo Ding, Boyu Li, Bohan Zhou, Junpeng Yue, Jiechuan Jiang, Yewen Li, Ruyi An, Molei Qin, Chuqiao Zong, Longtao Zheng, Yujie Wu, Xiaoqiang Chai, Yifei Bi, Tianbao Xie, Pengjie Gu, Xiyun Li, Ceyao Zhang, Long Tian, Chaojie Wang, Xinrun Wang, Börje F. Karlsson, Bo An, Shuicheng Yan, Zongqing Lu
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that CRADLE exhibits remarkable generalizability and impressive performance across four previously unexplored commercial video games (Red Dead Redemption 2, Cities: Skylines, Stardew Valley and Dealer s Life 2), five software applications (Chrome, Outlook, Feishu, Meitu and Cap Cut), and a comprehensive benchmark, OSWorld. |
| Researcher Affiliation | Collaboration | 1Nanyang Technological University, Singapore 2Institute of Software, Chinese Academy of Sciences 3Beijing Academy of Artificial Intelligence 4Peking University 5Skywork AI 6The University of Hong Kong 7The Chinese University of Hong Kong, Shenzhen 8National University of Singapore. Correspondence to: Weihao Tan <EMAIL>, Bo An <EMAIL>, Shuicheng Yan <EMAIL>, Zongqing Lu <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Skill Generation Input: Toolbar with objects, Skill template Output: Procedure memory with generated skills 1 Initialize procedure memory 2 for each object in the toolbar do 3 Hover the mouse on the object to get the description Generate skill using GPT-4o based on the object description and the skill template Store generated skill in procedure memory Execute the generated skill to enter the second-level toolbar 4 for each object in the second-level toolbar do 5 Hover the mouse on the object to get the description Generate skill using GPT-4o based on the object description and skill template Store generated skill in procedure memory 6 return procedure memory |
| Open Source Code | Yes | Video demos and code can be found at https://baai-agents.github.io/Cradle. |
| Open Datasets | Yes | Experimental results show that CRADLE exhibits remarkable generalizability and impressive performance across four previously unexplored commercial video games (Red Dead Redemption 2, Cities: Skylines, Stardew Valley and Dealer s Life 2), five software applications (Chrome, Outlook, Feishu, Meitu and Cap Cut), and a comprehensive benchmark, OSWorld. [...] Similar to SIMA (Raad et al., 2024), we apply human evaluation to all tasks across software and games, except for OSWorld (Xie et al., 2024), which provides automatic evaluation scripts. [...] MineRL: A large-scale dataset of Minecraft demonstrations [Guss et al., 2019]. |
| Dataset Splits | No | The paper does not provide specific training/test/validation dataset splits for models trained as part of this work. It describes evaluation settings for benchmarks and tasks (e.g., 'five runs', 'maximum step limit'), but not data partitioning for model training, as it leverages pre-trained LMMs. |
| Hardware Specification | Yes | All software and games can be run on regular Windows 10 machines, except for RDR2, which is tested on a machine with an NVIDIA RTX-4090 GPU. |
| Software Dependencies | Yes | We employ GPT-4o (Open AI, 2024b), currently one of the most capable LMM models, as the framework s backbone model. If not mentioned explicitly, all the experiments are done with gpt-4o-2024-05-13. [...] we use Open AI’s text-embedding-ada-002 model (Open AI, 2022) to generate embeddings for each skill [...] To extract keyframes from the video observation, we utilize the Video Sub Finder tool. [...] we add a visual augmentation sub-module within our Information Gathering module. This augmentation step serves two main purposes: i) utilize Grounding DINO (Liu et al., 2023), an open-set object detector, to output precise bounding boxes of possible targets in an image and serve as spatial clues for GPT-4o; and ii) perform template matching (Brunelli, 2009) to provide icon recognition grounding truth for GPT-4o when interpreting instructions or menus shown on screen. [...] we use the similar Py Direct Input library and Py Auto GUI for keyboard control, utilize AHK and write our own abstraction (using the ctypes library) to send low-level mouse commands to the operating system for mouse control. [...] Our experiments are based on the latest version of RDR2, Build 1491.50. [...] We use the latest version (1.6.8) of the game to conduct all the experiments. [Stardew Valley] [...] We run our experiments using the latest version, V. 1.013_W96 of the game. [Dealer's Life 2] [...] We use the latest version of the game (version 1.17.1-f4). [Cities: Skylines] [...] Table 11: Exact software versions utilized in the described experiments. Chrome 125.0.6422.142 Outlook 1.2024.529.200 Cap Cut 4.0.0 Meitu 7.5.6.1 Feishu 7.19.5. |
| Experiment Setup | Yes | If not specifically mentioned, all experiments are conducted in five runs under a maximum step limit, using Open AI’s model, gpt-4o-2024-05-13 (Open AI, 2024b). For each video game, we hired five human players, who never played the corresponding game before, to do the evaluation. Before they start the experiments, they will read the prompts used by CRADLE agents for fair comparison. Every player played the task once. [...] Temperature is set to 0 to lower the variance of the text generation. [...] we set k to five [for short-term memory]. [...] For each task, the maximal steps is 100. [Stardew Valley] [...] A run is terminated when it reaches the maximal steps, 1000, or the budget is used up (less than 1000). [Cities: Skylines] [...] The maximal number of steps (agent takes one action) for each task is 500. [RDR2] |