reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Agent Workflow Memory

Authors: Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, Graham Neubig

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We experiment on two major web navigation benchmarks Mind2Web and Web Arena that collectively cover 1000+ tasks from 200+ domains across travel, shopping, and social media, among others. AWM substantially improves the baseline results by 24.6% and 51.1% relative success rate on Mind2Web and Web Arena while reducing the number of steps taken to solve Web Arena tasks successfully.
Researcher Affiliation	Academia	1Carnegie Mellon University 2Massachusetts Institute of Technology.
Pseudocode	No	The paper describes the 'AWM pipeline' and 'Workflow Representation' with examples of steps, but it does not contain a dedicated, structured pseudocode block or algorithm section.
Open Source Code	Yes	1https://github.com/zorazrw/agent-workflow-memory
Open Datasets	Yes	We experiment on two major web navigation benchmarks Mind2Web and Web Arena that collectively cover 1000+ tasks from 200+ domains across travel, shopping, and social media, among others. Web Arena (Zhou et al., 2024) provides 812 web navigation tasks on five websites... Mind2Web (Deng et al., 2023) features web navigation in cross-task, website, and domain settings...
Dataset Splits	Yes	On Web Arena, we create a cross-template subset where each example is instantiated from different task templates... As shown in Table 2, AWM still achieves the highest performance, overall and on each website split. We mark the number of examples for each website split under the name. (51) (45) (24) (45) (32). For Mind2Web, the paper refers to 'cross-task, cross-website, and cross-domain generalization test' and 'training set'.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, memory, or cloud instance types used for running its experiments.
Software Dependencies	Yes	Following the baseline approaches, we use GPT-4o (gpt-4o-2024-05-13) with a temperature of 0.0 to ensure mostly stable model outputs.
Experiment Setup	Yes	Following the baseline approaches, we use GPT-4o (gpt-4o-2024-05-13) with a temperature of 0.0 to ensure mostly stable model outputs. We use the same model for neural workflow induction and agent action generation.