WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Language Models

Authors: Shengda Fan, Xin Cong, Yuepeng Fu, Zhong Zhang, Shuyan Zhang, Yuanwei Liu, Yesai Wu, Yankai Lin, Zhiyuan Liu, Maosong Sun

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that Workflow Llama demonstrates a strong capacity to orchestrate complex workflows, while also achieving notable generalization performance on previously unseen APIs. Additionally, Workflow Bench exhibits robust zero-shot generalization capabilities on an outof-distribution task planning dataset, T-Eval.
Researcher Affiliation Academia 1Renmin University of China 2Tsinghua University 3The University of Manchester 4Wuhan University fanshengda,EMAIL, EMAIL
Pseudocode Yes Appendix A ALGORITHM OF TRANSCRIBING SHORTCUTS Algorithm 1: Recursive Parsing of Property List to Construct Abstract Syntax Tree
Open Source Code Yes Our data and code are available at https://github.com/OpenBMB/WorkflowLLM.
Open Datasets Yes It first constructs a large-scale fine-tuning dataset Workflow Bench with 106, 763 samples... Our data and code are available at https://github.com/OpenBMB/WorkflowLLM. Additionally, Workflow Bench exhibits robust zero-shot generalization capabilities on an outof-distribution task planning dataset, T-Eval.
Dataset Splits Yes Table 1: Detailed statistics of Workflow Bench. Seed. refers to the collected data from Shortcuts. Train. and Test. refers to the training set and the test set of Workflow Bench respectively. Statistics Seed. Train. Test. Num. of Instances 14,771 105,573 1,190
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU models, or cloud instance types used for running experiments. It mentions fine-tuning models but no hardware specs.
Software Dependencies No The paper mentions fine-tuning on LLaMA-3.1-8B and using the AdamW optimizer and a linear learning rate scheduler, but it does not specify version numbers for general software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes We fine-tune the annotator and Workflow Llama on LLaMA-3.1-8B (Dubey et al., 2024) for 3 epochs using the AdamW optimizer (Loshchilov & Hutter, 2019). A linear learning rate scheduler is used with a peak learning rate of 2 x 10^-5 and a warm-up ratio of 0.1. Each minibatch contains 32 examples, and the maximum sequence length is set as 8,192 tokens.