reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Language Models

Authors: Shengda Fan, Xin Cong, Yuepeng Fu, Zhong Zhang, Shuyan Zhang, Yuanwei Liu, Yesai Wu, Yankai Lin, Zhiyuan Liu, Maosong Sun

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that Workflow Llama demonstrates a strong capacity to orchestrate complex workflows, while also achieving notable generalization performance on previously unseen APIs. Additionally, Workflow Bench exhibits robust zero-shot generalization capabilities on an outof-distribution task planning dataset, T-Eval.
Researcher Affiliation	Academia	1Renmin University of China 2Tsinghua University 3The University of Manchester 4Wuhan University fanshengda,EMAIL, EMAIL
Pseudocode	Yes	Appendix A ALGORITHM OF TRANSCRIBING SHORTCUTS Algorithm 1: Recursive Parsing of Property List to Construct Abstract Syntax Tree
Open Source Code	Yes	Our data and code are available at https://github.com/OpenBMB/WorkflowLLM.
Open Datasets	Yes	It first constructs a large-scale fine-tuning dataset Workflow Bench with 106, 763 samples... Our data and code are available at https://github.com/OpenBMB/WorkflowLLM. Additionally, Workflow Bench exhibits robust zero-shot generalization capabilities on an outof-distribution task planning dataset, T-Eval.
Dataset Splits	Yes	Table 1: Detailed statistics of Workflow Bench. Seed. refers to the collected data from Shortcuts. Train. and Test. refers to the training set and the test set of Workflow Bench respectively. Statistics Seed. Train. Test. Num. of Instances 14,771 105,573 1,190
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU models, or cloud instance types used for running experiments. It mentions fine-tuning models but no hardware specs.
Software Dependencies	No	The paper mentions fine-tuning on LLaMA-3.1-8B and using the AdamW optimizer and a linear learning rate scheduler, but it does not specify version numbers for general software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	We fine-tune the annotator and Workflow Llama on LLaMA-3.1-8B (Dubey et al., 2024) for 3 epochs using the AdamW optimizer (Loshchilov & Hutter, 2019). A linear learning rate scheduler is used with a peak learning rate of 2 x 10^-5 and a warm-up ratio of 0.1. Each minibatch contains 32 examples, and the maximum sequence length is set as 8,192 tokens.