Dissecting Adversarial Robustness of Multimodal LM Agents

Authors: Chen Wu, Rishi Shah, Jing Yu Koh, Russ Salakhutdinov, Daniel Fried, Aditi Raghunathan

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We manually create 200 targeted adversarial tasks and evaluation scripts in a realistic threat model on top of Visual Web Arena... To systematically examine the robustness of agents, we propose the Agent Robustness Evaluation (ARE) framework. ARE views the agent as a graph showing the flow of intermediate outputs between components and decomposes robustness as the flow of adversarial information on the graph. We find that we can successfully break latest agents that use black-box frontier LMs, including those that perform reflection and tree search. With imperceptible perturbations to a single image (less than 5% of total web page pixels), an attacker can hijack these agents to execute targeted adversarial goals with success rates up to 67%. We also use ARE to rigorously evaluate how the robustness changes as new components are added.
Researcher Affiliation Academia Chen Henry Wu, Rishi Shah, Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried, Aditi Raghunathan Carnegie Mellon University EMAIL
Pseudocode No The paper describes methodologies and frameworks but does not contain any explicitly labeled pseudocode or algorithm blocks. The methods are described in prose.
Open Source Code Yes Our data and code for attacks, defenses, and evaluation are at github.com/Chen Wu98/agent-attack.
Open Datasets Yes We develop VWA-Adv, a set of targeted adversarial tasks simulating realistic adversarial attacks from web-based environments. The tasks will be open-sourced for future work on agent robustness. ... We release all adversarial tasks, evaluations, and our code for the trigger injection interface.
Dataset Splits No The paper describes the curation of 200 adversarial tasks for the VWA-Adv dataset and how benign tasks are selected for evaluation based on GPT-4V's performance. However, it does not specify explicit training, validation, or test splits for these tasks in a way that would allow direct reproduction of data partitioning for model training or evaluation splits.
Hardware Specification Yes Our gradient-based attacks and captioner were run on an A6000 or A100 80G.
Software Dependencies Yes The LMs we used to build the multimodal agents are: GPT-4V: gpt-4-vision-preview, Gemini1.5-Pro: gemini-1.5-pro-preview-0409, Claude-3-Opus: claude-3-opus-20240229, GPT-4o: gpt-4o-2024-05-13. To reduce randomness, we decode from each LM with temperature 0.
Experiment Setup Yes We set the maximum number of attempts to 2, as it suffices to show our main findings. We decode from each LM with temperature 0. ... In particular, we focus on the tree search agent from Koh et al. (2024b), with a branching factor of 3 and depth of 1.