Benchmarking Agentic Workflow Generation

Authors: Shuofei Qiao, Runnan Fang, Zhisong Qiu, Xiaobin Wang, Ningyu Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through comprehensive evaluations across different types of LLMs, we discover distinct gaps between the sequence planning capabilities and graph planning capabilities of LLM agents, with even GPT-4 exhibiting a gap of around 15%. We also train two open-source models and evaluate their generalization abilities on held-out tasks. Furthermore, we observe that the generated workflows can enhance downstream tasks, enabling them to achieve superior performance with less time during inference1.
Researcher Affiliation Collaboration Shuofei Qiao , Runnan Fang , Zhisong Qiu , Xiaobin Wang , Ningyu Zhang , Yong Jiang , Pengjun Xie , Fei Huang , Huajun Chen Zhejiang University Alibaba Group Zhejiang Key Laboratory of Big Data Intelligent Computing EMAIL
Pseudocode No The paper describes algorithms for evaluating workflow generation (WORFEVAL) using subsequence and subgraph matching in Section 2.4, but does not present these algorithms in a structured pseudocode block or clearly labeled algorithm figure.
Open Source Code Yes 1Code and dataset are available at https://github.com/zjunlp/WorfBench.
Open Datasets Yes 1Code and dataset are available at https://github.com/zjunlp/WorfBench. We mainly collect various tasks q and the corresponding action lists A from existing well-known datasets. To facilitate a better understanding of the benchmark construction, we provide a detailed exposition of each dataset utilized in our paper in Appendix A.1.
Dataset Splits Yes Multi-faceted scenarios. We cover four complex scenarios for LLM agents, including problem-solving, function calling, embodied planning, and open-grounded planning. The dataset comprises 18k training samples, 2146 test samples, and 723 held-out tasks to evaluate generalization.
Hardware Specification Yes All the experiments are conducted on 3 NVIDIA 80GB A100 GPUs.
Software Dependencies Yes all-mpnet-base-v2: https://huggingface.co/sentence-transformers/ all-mpnet-base-v2. This model is also used as the retriever in benchmark construction process.
Experiment Setup Yes The hyperparameters used during decoding are all set to default values except for the temperature, which is 0.5. For all the models, the semantically matching threshold β is set to 0.6. Table 6: Detailed training hyperparameters used in our paper. cutoff len 4,096 epochs 3 batch size 12 batch size per device 2 gradient accumulation steps 2 learning rate 1e-5 lr scheduler type cosine warmup ratio 0.1 bf16 true