reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Flow: Modularized Agentic Workflow Automation

Authors: Boye Niu, Yiliao Song, Kai Lian, Yifan Shen, Yu Yao, Kun Zhang, Tongliang Liu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results across various practical tasks demonstrate significant improvements in the efficiency of multi-agent frameworks through dynamic workflow refinement and modularization.
Researcher Affiliation	Academia	1Sydney AI Centre, The University of Sydney, 2The University of Adelaide 3Carnegie Mellon University, 4Mohamed bin Zayed University of Artificial Intelligence
Pseudocode	Yes	D.2 Pseudocode for updating AOV Algorithm 1: Helper Function for Updating Graph Algorithm 2: Flow
Open Source Code	Yes	The code is available at: https://github.com/tmllab/2025_ICLR_FLOW.
Open Datasets	No	The paper describes experimental tasks like 'website design', 'La Te X Beamer writing', and 'gobang game development' which involve agent generation of content. It does not utilize or provide access to any predefined publicly available datasets in the traditional sense (e.g., benchmark datasets with specific links or citations).
Dataset Splits	No	The paper conducts experiments on generative tasks (website design, La Te X Beamer writing, gobang game development) and details the number of trials for each experiment (e.g., 'We conducted five trials'). However, it does not involve traditional datasets with train/test/validation splits, as the focus is on the performance of agents generating outputs for specific tasks, not on training models on pre-divided data.
Hardware Specification	No	The paper mentions that agents were 'empowered by GPT-4o-mini and GPT-3.5-Turbo (Open AI, 2024)' and discusses 'Time Cost of Different Baseline' using these models. However, it does not specify the underlying hardware (e.g., GPU models, CPU types, or cloud computing resources) used to run these LLM agents or the overall framework for the experiments.
Software Dependencies	No	The paper states that agents are 'empowered by GPT-4o-mini and GPT-3.5-Turbo (Open AI, 2024)', which are specific LLM models. However, it does not provide details on other ancillary software dependencies, such as programming language versions (e.g., Python 3.x), specific libraries (e.g., PyTorch, TensorFlow), or other frameworks with their version numbers required to reproduce the experimental environment.
Experiment Setup	Yes	We designed three diverse and engaging tasks to evaluate multi-agent collaboration frameworks: 1) website design, 2) La Te X Beamer writing, and 3) gobang game development. In all experiments, we compare Flow to existing multi-agent frameworks: (1) Auto Gen , (2) Camel , and (3) Meta GPT . In our experiments, we use agents empowered by GPT-4o-mini and GPT-3.5-Turbo (Open AI, 2024). A Sample Prompt for Initialization Pinit Prompt for Update Pupdate We conducted five trials and recorded the success scores.