AFlow: Automating Agentic Workflow Generation
Authors: Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, XiongHui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, Chenglin Wu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluations across six benchmark datasets demonstrate AFLOW s efficacy, yielding a 5.7% average improvement over state-of-the-art baselines. |
| Researcher Affiliation | Collaboration | Jiayi Zhang1,2 , Jinyu Xiang1 , Zhaoyang Yu3, Fengwei Teng3, Xiong-Hui Chen4, Jiaqi Chen5, Mingchen Zhuge6, Xin Cheng3, Sirui Hong1, Jinlin Wang1, Bingnan Zheng5, Bang Liu7, Yuyu Luo2,8 , Chenglin Wu1 1Deep Wisdom, 2The Hong Kong University of Science and Technology (Guangzhou), 3Renmin University of China, 4Nanjing University, 5Fudan University, 6King Abdullah University of Science and Technology, 7Universit e de Montr eal & Mila, 8The Hong Kong University of Science and Technology |
| Pseudocode | Yes | A.6 MCTS ALGORITHM OF AFLOW. Algorithm 1 Algorithm of AFLOW: Detailed implementation |
| Open Source Code | Yes | The code is available at https://github.com/Foundation Agents/AFlow. |
| Open Datasets | Yes | Datasets We utilized six public benchmarks for our experiments. Following established practices (Saad-Falcon et al., 2024; Hu et al., 2024) in workflow optimization, we divide the data into validation and test sets using a 1:4 ratio. Specifically, we use the full datasets for GSM8K (Cobbe et al., 2021), Human Eval (Chen et al., 2021), and MBPP (Austin et al., 2021). For Hotpot QA (Yang et al., 2018) and DROP (Dua et al., 2019), we randomly select 1,000 samples each, in line with (Hu et al., 2024; Shinn et al., 2023). For the MATH (Hendrycks et al., 2021) dataset, we follow (Hong et al., 2024a) in selecting 617 problems from four typical problem types (Combinatorics & Probability, Number Theory, Pre-algebra, Pre-calculus) at difficulty level 5. |
| Dataset Splits | Yes | Prior to initiating the search process, we randomly partition the dataset into a validation set (20%) and a test set (80%), with the random seed fixed at 42. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. It mentions using LLMs via APIs but not the underlying hardware. |
| Software Dependencies | No | The paper mentions specific LLM models with versions (e.g., Claude-3.5-sonnet (Anthropic, 2024), Deep Seek V2.5 (Deepseek, 2024), GPT-4o-mini-0718 (Open AI, 2024b), Claude-3.5-sonnet-0620 (Anthropic, 2024), GPT-4o-0513 (Open AI, 2024a)). However, it does not provide specific version numbers for ancillary software dependencies such as programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or other solvers/frameworks that are not the LLMs themselves. |
| Experiment Setup | Yes | Implementation Details AFLOW utilizes different models for optimization and execution. We employ Claude-3.5-sonnet (Anthropic, 2024) as the optimizer and use models: Deep Seek V2.5 (Deepseek, 2024), GPT-4o-mini-0718 (Open AI, 2024b), Claude-3.5-sonnet-0620 (Anthropic, 2024), GPT-4o-0513 (Open AI, 2024a)) as executors. All models are accessed via APIs. We set the temperature to 1 for Deep Seek-V2.5 and to 0 for the other models. We set iteration rounds to 20 for AFLOW. For ADAS, we use Claude-3.5-sonnet as the optimizer and GPT-4o-mini as the executor, with the iteration rounds set to 30. |