reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

AFlow: Automating Agentic Workflow Generation

Authors: Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, XiongHui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, Chenglin Wu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluations across six benchmark datasets demonstrate AFLOW s efficacy, yielding a 5.7% average improvement over state-of-the-art baselines.
Researcher Affiliation	Collaboration	Jiayi Zhang1,2 , Jinyu Xiang1 , Zhaoyang Yu3, Fengwei Teng3, Xiong-Hui Chen4, Jiaqi Chen5, Mingchen Zhuge6, Xin Cheng3, Sirui Hong1, Jinlin Wang1, Bingnan Zheng5, Bang Liu7, Yuyu Luo2,8 , Chenglin Wu1 1Deep Wisdom, 2The Hong Kong University of Science and Technology (Guangzhou), 3Renmin University of China, 4Nanjing University, 5Fudan University, 6King Abdullah University of Science and Technology, 7Universit e de Montr eal & Mila, 8The Hong Kong University of Science and Technology
Pseudocode	Yes	A.6 MCTS ALGORITHM OF AFLOW. Algorithm 1 Algorithm of AFLOW: Detailed implementation
Open Source Code	Yes	The code is available at https://github.com/Foundation Agents/AFlow.
Open Datasets	Yes	Datasets We utilized six public benchmarks for our experiments. Following established practices (Saad-Falcon et al., 2024; Hu et al., 2024) in workflow optimization, we divide the data into validation and test sets using a 1:4 ratio. Specifically, we use the full datasets for GSM8K (Cobbe et al., 2021), Human Eval (Chen et al., 2021), and MBPP (Austin et al., 2021). For Hotpot QA (Yang et al., 2018) and DROP (Dua et al., 2019), we randomly select 1,000 samples each, in line with (Hu et al., 2024; Shinn et al., 2023). For the MATH (Hendrycks et al., 2021) dataset, we follow (Hong et al., 2024a) in selecting 617 problems from four typical problem types (Combinatorics & Probability, Number Theory, Pre-algebra, Pre-calculus) at difficulty level 5.
Dataset Splits	Yes	Prior to initiating the search process, we randomly partition the dataset into a validation set (20%) and a test set (80%), with the random seed fixed at 42.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. It mentions using LLMs via APIs but not the underlying hardware.
Software Dependencies	No	The paper mentions specific LLM models with versions (e.g., Claude-3.5-sonnet (Anthropic, 2024), Deep Seek V2.5 (Deepseek, 2024), GPT-4o-mini-0718 (Open AI, 2024b), Claude-3.5-sonnet-0620 (Anthropic, 2024), GPT-4o-0513 (Open AI, 2024a)). However, it does not provide specific version numbers for ancillary software dependencies such as programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or other solvers/frameworks that are not the LLMs themselves.
Experiment Setup	Yes	Implementation Details AFLOW utilizes different models for optimization and execution. We employ Claude-3.5-sonnet (Anthropic, 2024) as the optimizer and use models: Deep Seek V2.5 (Deepseek, 2024), GPT-4o-mini-0718 (Open AI, 2024b), Claude-3.5-sonnet-0620 (Anthropic, 2024), GPT-4o-0513 (Open AI, 2024a)) as executors. All models are accessed via APIs. We set the temperature to 1 for Deep Seek-V2.5 and to 0 for the other models. We set iteration rounds to 20 for AFLOW. For ADAS, we use Claude-3.5-sonnet as the optimizer and GPT-4o-mini as the executor, with the iteration rounds set to 30.