reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Large Action Models: From Inception to Implementation

Authors: Lu Wang, Fangkai Yang, Chaoyun Zhang, Junting Lu, Jiaxu Qian, Shilin He, Pu Zhao, Bo Qiao, He Huang, Si Qin, Qisheng Su, Jiayi Ye, Yudi Zhang, Jian-Guang Lou, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper presents a comprehensive framework for developing LAMs, offering a systematic approach to their creation, from inception to deployment. We begin with an overview of LAMs, highlighting their unique characteristics and delineating their differences from LLMs. Using a Windows OS-based agent as a case study, we provide a detailed, step-by-step guide on the key stages of LAM development, including data collection, model training, environment integration, grounding, and evaluation.
Researcher Affiliation	Collaboration	1Microsoft 2Peking University 3Zhejiang University 4Eindhoven University of Technology
Pseudocode	No	The paper describes methods in textual steps and provides prompt templates for LLM interactions in the appendix, but it does not contain structured pseudocode or algorithm blocks for its methodology.
Open Source Code	Yes	In the following sections, we use the Windows GUI agent UFO (Zhang et al., 2025a)1 as a case study to illustrate the process of building a robust LAM from the ground up. This LAM will serve as the core inference engine for UFO, enabling it to autonomously fulfill user requests within the Windows OS environment. While this example focuses on a Windows GUI agent, the outlined steps can be adapted for developing LAMs in other scenarios or for different applications. 1https://github.com/microsoft/UFO
Open Datasets	No	A total of 76,672 task-plan pairs (ti, Pi) are collected from various sources, including application help documentation, Wiki How, and historical search queries. Of these, 29,182 pairs are sourced directly, while 47,490 are generated via data evolution techniques (as described in Section 3.1.4), enriching the dataset with more complex and diverse tasks. The paper describes data collection from public sources like application documentation and WikiHow, but does not explicitly provide access to their curated/generated dataset.
Dataset Splits	Yes	We split these 2,192 trajectories into a training set of 1,757 and a test set of 435 trajectories, providing a total of 3,959 steps for training. By imitation learning LAM1 on these successful action sequences, we obtain LAM2.
Hardware Specification	Yes	Our LAM was deployed on a virtual machine (VM) configured as NC24s v3. The VM is equipped with 24 virtual cores (v CPUs), 448 GB of memory, and two NVIDIA Tesla V100 GPUs, each with 16 GB of memory, to support efficient inference. [...] Each VM is equipped with a 15-core Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz, 64GB of RAM, and runs Windows 11 Enterprise version 23H2.
Software Dependencies	Yes	Each VM is equipped with a 15-core Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz, 64GB of RAM, and runs Windows 11 Enterprise version 23H2. Microsoft applications, such as Word and Excel, are installed on version 2410.
Experiment Setup	No	While training objectives and methods like SFT and PPO are described, specific hyperparameter values such as learning rates, batch sizes, or number of epochs for the LAM training are not explicitly provided in the main text. Only top_p and temperature are mentioned for baseline models evaluation.