Agent Workflow Memory
Authors: Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, Graham Neubig
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experiment on two major web navigation benchmarks Mind2Web and Web Arena that collectively cover 1000+ tasks from 200+ domains across travel, shopping, and social media, among others. AWM substantially improves the baseline results by 24.6% and 51.1% relative success rate on Mind2Web and Web Arena while reducing the number of steps taken to solve Web Arena tasks successfully. |
| Researcher Affiliation | Academia | 1Carnegie Mellon University 2Massachusetts Institute of Technology. |
| Pseudocode | No | The paper describes the 'AWM pipeline' and 'Workflow Representation' with examples of steps, but it does not contain a dedicated, structured pseudocode block or algorithm section. |
| Open Source Code | Yes | 1https://github.com/zorazrw/agent-workflow-memory |
| Open Datasets | Yes | We experiment on two major web navigation benchmarks Mind2Web and Web Arena that collectively cover 1000+ tasks from 200+ domains across travel, shopping, and social media, among others. Web Arena (Zhou et al., 2024) provides 812 web navigation tasks on five websites... Mind2Web (Deng et al., 2023) features web navigation in cross-task, website, and domain settings... |
| Dataset Splits | Yes | On Web Arena, we create a cross-template subset where each example is instantiated from different task templates... As shown in Table 2, AWM still achieves the highest performance, overall and on each website split. We mark the number of examples for each website split under the name. (51) (45) (24) (45) (32). For Mind2Web, the paper refers to 'cross-task, cross-website, and cross-domain generalization test' and 'training set'. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, memory, or cloud instance types used for running its experiments. |
| Software Dependencies | Yes | Following the baseline approaches, we use GPT-4o (gpt-4o-2024-05-13) with a temperature of 0.0 to ensure mostly stable model outputs. |
| Experiment Setup | Yes | Following the baseline approaches, we use GPT-4o (gpt-4o-2024-05-13) with a temperature of 0.0 to ensure mostly stable model outputs. We use the same model for neural workflow induction and agent action generation. |