reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Benchmarking Agentic Workflow Generation

Authors: Shuofei Qiao, Runnan Fang, Zhisong Qiu, Xiaobin Wang, Ningyu Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through comprehensive evaluations across different types of LLMs, we discover distinct gaps between the sequence planning capabilities and graph planning capabilities of LLM agents, with even GPT-4 exhibiting a gap of around 15%. We also train two open-source models and evaluate their generalization abilities on held-out tasks. Furthermore, we observe that the generated workflows can enhance downstream tasks, enabling them to achieve superior performance with less time during inference1.
Researcher Affiliation	Collaboration	Shuofei Qiao , Runnan Fang , Zhisong Qiu , Xiaobin Wang , Ningyu Zhang , Yong Jiang , Pengjun Xie , Fei Huang , Huajun Chen Zhejiang University Alibaba Group Zhejiang Key Laboratory of Big Data Intelligent Computing EMAIL
Pseudocode	No	The paper describes algorithms for evaluating workflow generation (WORFEVAL) using subsequence and subgraph matching in Section 2.4, but does not present these algorithms in a structured pseudocode block or clearly labeled algorithm figure.
Open Source Code	Yes	1Code and dataset are available at https://github.com/zjunlp/WorfBench.
Open Datasets	Yes	1Code and dataset are available at https://github.com/zjunlp/WorfBench. We mainly collect various tasks q and the corresponding action lists A from existing well-known datasets. To facilitate a better understanding of the benchmark construction, we provide a detailed exposition of each dataset utilized in our paper in Appendix A.1.
Dataset Splits	Yes	Multi-faceted scenarios. We cover four complex scenarios for LLM agents, including problem-solving, function calling, embodied planning, and open-grounded planning. The dataset comprises 18k training samples, 2146 test samples, and 723 held-out tasks to evaluate generalization.
Hardware Specification	Yes	All the experiments are conducted on 3 NVIDIA 80GB A100 GPUs.
Software Dependencies	Yes	all-mpnet-base-v2: https://huggingface.co/sentence-transformers/ all-mpnet-base-v2. This model is also used as the retriever in benchmark construction process.
Experiment Setup	Yes	The hyperparameters used during decoding are all set to default values except for the temperature, which is 0.5. For all the models, the semantically matching threshold β is set to 0.6. Table 6: Detailed training hyperparameters used in our paper. cutoff len 4,096 epochs 3 batch size 12 batch size per device 2 gradient accumulation steps 2 learning rate 1e-5 lr scheduler type cosine warmup ratio 0.1 bf16 true