Benchmarking Agentic Workflow Generation
Authors: Shuofei Qiao, Runnan Fang, Zhisong Qiu, Xiaobin Wang, Ningyu Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through comprehensive evaluations across different types of LLMs, we discover distinct gaps between the sequence planning capabilities and graph planning capabilities of LLM agents, with even GPT-4 exhibiting a gap of around 15%. We also train two open-source models and evaluate their generalization abilities on held-out tasks. Furthermore, we observe that the generated workflows can enhance downstream tasks, enabling them to achieve superior performance with less time during inference1. |
| Researcher Affiliation | Collaboration | Shuofei Qiao , Runnan Fang , Zhisong Qiu , Xiaobin Wang , Ningyu Zhang , Yong Jiang , Pengjun Xie , Fei Huang , Huajun Chen Zhejiang University Alibaba Group Zhejiang Key Laboratory of Big Data Intelligent Computing EMAIL |
| Pseudocode | No | The paper describes algorithms for evaluating workflow generation (WORFEVAL) using subsequence and subgraph matching in Section 2.4, but does not present these algorithms in a structured pseudocode block or clearly labeled algorithm figure. |
| Open Source Code | Yes | 1Code and dataset are available at https://github.com/zjunlp/WorfBench. |
| Open Datasets | Yes | 1Code and dataset are available at https://github.com/zjunlp/WorfBench. We mainly collect various tasks q and the corresponding action lists A from existing well-known datasets. To facilitate a better understanding of the benchmark construction, we provide a detailed exposition of each dataset utilized in our paper in Appendix A.1. |
| Dataset Splits | Yes | Multi-faceted scenarios. We cover four complex scenarios for LLM agents, including problem-solving, function calling, embodied planning, and open-grounded planning. The dataset comprises 18k training samples, 2146 test samples, and 723 held-out tasks to evaluate generalization. |
| Hardware Specification | Yes | All the experiments are conducted on 3 NVIDIA 80GB A100 GPUs. |
| Software Dependencies | Yes | all-mpnet-base-v2: https://huggingface.co/sentence-transformers/ all-mpnet-base-v2. This model is also used as the retriever in benchmark construction process. |
| Experiment Setup | Yes | The hyperparameters used during decoding are all set to default values except for the temperature, which is 0.5. For all the models, the semantically matching threshold β is set to 0.6. Table 6: Detailed training hyperparameters used in our paper. cutoff len 4,096 epochs 3 batch size 12 batch size per device 2 gradient accumulation steps 2 learning rate 1e-5 lr scheduler type cosine warmup ratio 0.1 bf16 true |