reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

AdvAgent: Controllable Blackbox Red-teaming on Web Agents

Authors: Chejian Xu, Mintong Kang, Jiawei Zhang, Zeyi Liao, Lingbo Mo, Mengqi Yuan, Huan Sun, Bo Li

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive evaluations demonstrate that Adv Agent achieves high success rates against state-of-the-art GPT-4-based web agents across diverse web tasks. Furthermore, we find that existing prompt-based defenses provide only limited protection, leaving agents vulnerable to our framework. These findings highlight critical vulnerabilities in current web agents and emphasize the urgent need for stronger defense mechanisms. We conduct real-world attacks against a SOTA web agent on 440 tasks in 4 different domains. We compare our proposed Adv Agent algorithm with three baselines.
Researcher Affiliation	Academia	1University of Illinois Urbana-Champaign 2University of Chicago 3The Ohio State University. Correspondence to: Chejian Xu <EMAIL>, Bo Li <EMAIL>.
Pseudocode	Yes	Algorithm 1 LLM-based Attack Prompter
Open Source Code	Yes	We release our code at https://ai-secure.github.io/Adv Agent/.
Open Datasets	Yes	Our experiments utilize the Mind2Web dataset (Deng et al., 2024), which consists of real-world website data for evaluating generalist web agents.
Dataset Splits	Yes	We focus on tasks that involve critical events with potentially severe consequences, selecting a subset of 440 tasks across 4 different domains, which is further divided into 240 training tasks and 200 testing tasks.
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, memory amounts) used for running experiments are provided in the paper. It only mentions the backend models used (GPT-4V and Gemini 1.5).
Software Dependencies	Yes	For the LLM-based attack prompter, we leverage GPT-4 as the backend and generate 10 adversarial prompts per task with a temperature of 1.0 to ensure diversity. We initialize our generative adversarial prompter model from Mistral-7B-Instruct-v0.2 (Jiang et al., 2023).
Experiment Setup	Yes	During SFT in the first training stage, we set a learning rate of 1e 4 and a batch size of 32. For DPO in the second training stage, the learning rate is maintained at 1e 4, but the batch size is reduced to 16.