Large Action Models: From Inception to Implementation
Authors: Lu Wang, Fangkai Yang, Chaoyun Zhang, Junting Lu, Jiaxu Qian, Shilin He, Pu Zhao, Bo Qiao, He Huang, Si Qin, Qisheng Su, Jiayi Ye, Yudi Zhang, Jian-Guang Lou, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper presents a comprehensive framework for developing LAMs, offering a systematic approach to their creation, from inception to deployment. We begin with an overview of LAMs, highlighting their unique characteristics and delineating their differences from LLMs. Using a Windows OS-based agent as a case study, we provide a detailed, step-by-step guide on the key stages of LAM development, including data collection, model training, environment integration, grounding, and evaluation. |
| Researcher Affiliation | Collaboration | 1Microsoft 2Peking University 3Zhejiang University 4Eindhoven University of Technology |
| Pseudocode | No | The paper describes methods in textual steps and provides prompt templates for LLM interactions in the appendix, but it does not contain structured pseudocode or algorithm blocks for its methodology. |
| Open Source Code | Yes | In the following sections, we use the Windows GUI agent UFO (Zhang et al., 2025a)1 as a case study to illustrate the process of building a robust LAM from the ground up. This LAM will serve as the core inference engine for UFO, enabling it to autonomously fulfill user requests within the Windows OS environment. While this example focuses on a Windows GUI agent, the outlined steps can be adapted for developing LAMs in other scenarios or for different applications. 1https://github.com/microsoft/UFO |
| Open Datasets | No | A total of 76,672 task-plan pairs (ti, Pi) are collected from various sources, including application help documentation, Wiki How, and historical search queries. Of these, 29,182 pairs are sourced directly, while 47,490 are generated via data evolution techniques (as described in Section 3.1.4), enriching the dataset with more complex and diverse tasks. The paper describes data collection from public sources like application documentation and WikiHow, but does not explicitly provide access to their curated/generated dataset. |
| Dataset Splits | Yes | We split these 2,192 trajectories into a training set of 1,757 and a test set of 435 trajectories, providing a total of 3,959 steps for training. By imitation learning LAM1 on these successful action sequences, we obtain LAM2. |
| Hardware Specification | Yes | Our LAM was deployed on a virtual machine (VM) configured as NC24s v3. The VM is equipped with 24 virtual cores (v CPUs), 448 GB of memory, and two NVIDIA Tesla V100 GPUs, each with 16 GB of memory, to support efficient inference. [...] Each VM is equipped with a 15-core Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz, 64GB of RAM, and runs Windows 11 Enterprise version 23H2. |
| Software Dependencies | Yes | Each VM is equipped with a 15-core Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz, 64GB of RAM, and runs Windows 11 Enterprise version 23H2. Microsoft applications, such as Word and Excel, are installed on version 2410. |
| Experiment Setup | No | While training objectives and methods like SFT and PPO are described, specific hyperparameter values such as learning rates, batch sizes, or number of epochs for the LAM training are not explicitly provided in the main text. Only top_p and temperature are mentioned for baseline models evaluation. |