Dissecting Adversarial Robustness of Multimodal LM Agents
Authors: Chen Wu, Rishi Shah, Jing Yu Koh, Russ Salakhutdinov, Daniel Fried, Aditi Raghunathan
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We manually create 200 targeted adversarial tasks and evaluation scripts in a realistic threat model on top of Visual Web Arena... To systematically examine the robustness of agents, we propose the Agent Robustness Evaluation (ARE) framework. ARE views the agent as a graph showing the flow of intermediate outputs between components and decomposes robustness as the flow of adversarial information on the graph. We find that we can successfully break latest agents that use black-box frontier LMs, including those that perform reflection and tree search. With imperceptible perturbations to a single image (less than 5% of total web page pixels), an attacker can hijack these agents to execute targeted adversarial goals with success rates up to 67%. We also use ARE to rigorously evaluate how the robustness changes as new components are added. |
| Researcher Affiliation | Academia | Chen Henry Wu, Rishi Shah, Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried, Aditi Raghunathan Carnegie Mellon University EMAIL |
| Pseudocode | No | The paper describes methodologies and frameworks but does not contain any explicitly labeled pseudocode or algorithm blocks. The methods are described in prose. |
| Open Source Code | Yes | Our data and code for attacks, defenses, and evaluation are at github.com/Chen Wu98/agent-attack. |
| Open Datasets | Yes | We develop VWA-Adv, a set of targeted adversarial tasks simulating realistic adversarial attacks from web-based environments. The tasks will be open-sourced for future work on agent robustness. ... We release all adversarial tasks, evaluations, and our code for the trigger injection interface. |
| Dataset Splits | No | The paper describes the curation of 200 adversarial tasks for the VWA-Adv dataset and how benign tasks are selected for evaluation based on GPT-4V's performance. However, it does not specify explicit training, validation, or test splits for these tasks in a way that would allow direct reproduction of data partitioning for model training or evaluation splits. |
| Hardware Specification | Yes | Our gradient-based attacks and captioner were run on an A6000 or A100 80G. |
| Software Dependencies | Yes | The LMs we used to build the multimodal agents are: GPT-4V: gpt-4-vision-preview, Gemini1.5-Pro: gemini-1.5-pro-preview-0409, Claude-3-Opus: claude-3-opus-20240229, GPT-4o: gpt-4o-2024-05-13. To reduce randomness, we decode from each LM with temperature 0. |
| Experiment Setup | Yes | We set the maximum number of attempts to 2, as it suffices to show our main findings. We decode from each LM with temperature 0. ... In particular, we focus on the tree search agent from Koh et al. (2024b), with a branching factor of 3 and depth of 1. |