Agent Workflow Memory

Authors: Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, Graham Neubig

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experiment on two major web navigation benchmarks Mind2Web and Web Arena that collectively cover 1000+ tasks from 200+ domains across travel, shopping, and social media, among others. AWM substantially improves the baseline results by 24.6% and 51.1% relative success rate on Mind2Web and Web Arena while reducing the number of steps taken to solve Web Arena tasks successfully.
Researcher Affiliation Academia 1Carnegie Mellon University 2Massachusetts Institute of Technology.
Pseudocode No The paper describes the 'AWM pipeline' and 'Workflow Representation' with examples of steps, but it does not contain a dedicated, structured pseudocode block or algorithm section.
Open Source Code Yes 1https://github.com/zorazrw/agent-workflow-memory
Open Datasets Yes We experiment on two major web navigation benchmarks Mind2Web and Web Arena that collectively cover 1000+ tasks from 200+ domains across travel, shopping, and social media, among others. Web Arena (Zhou et al., 2024) provides 812 web navigation tasks on five websites... Mind2Web (Deng et al., 2023) features web navigation in cross-task, website, and domain settings...
Dataset Splits Yes On Web Arena, we create a cross-template subset where each example is instantiated from different task templates... As shown in Table 2, AWM still achieves the highest performance, overall and on each website split. We mark the number of examples for each website split under the name. (51) (45) (24) (45) (32). For Mind2Web, the paper refers to 'cross-task, cross-website, and cross-domain generalization test' and 'training set'.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, memory, or cloud instance types used for running its experiments.
Software Dependencies Yes Following the baseline approaches, we use GPT-4o (gpt-4o-2024-05-13) with a temperature of 0.0 to ensure mostly stable model outputs.
Experiment Setup Yes Following the baseline approaches, we use GPT-4o (gpt-4o-2024-05-13) with a temperature of 0.0 to ensure mostly stable model outputs. We use the same model for neural workflow induction and agent action generation.