Aligned LLMs Are Not Aligned Browser Agents
Authors: Priyanshu Kumar, Elaine Lau, Saranya Vijayakumar, Tu Trinh, Elaine Chang, Vaughn Robinson, Shuyan Zhou, Matt Fredrikson, Sean Hendryx, Summer Yue, Zifan Wang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical study on state-of-the-art browser agents reveals that while the backbone LLM refuses harmful instructions as a chatbot, the corresponding agent does not. In this work, we primarily focus on red-teaming browser agents LLMs that leverage information via web browsers. To this end, we introduce Browser Agent Red teaming Toolkit (Browser ART), a comprehensive test suite designed specifically for red-teaming browser agents. Browser ART consists of 100 diverse browser-related harmful behaviors (including original behaviors and ones sourced from Harm Bench (Mazeika et al., 2024) and Air Bench 2024 (Zeng et al., 2024b)) across both synthetic and real websites. Our empirical study on state-of-the-art browser agents reveals that while the backbone LLM refuses harmful instructions as a chatbot, the corresponding agent does not. Moreover, attack methods designed to jailbreak refusal-trained LLMs in the chat settings transfer effectively to browser agents. With human rewrites, GPT-4o and o1-preview -based browser agents pursued 98 and 63 harmful behaviors (out of 100), respectively. Therefore, simply ensuring LLM s refusal to harmful instructions in chats is not sufficient to ensure that the downstream agents are safe. We publicly release Browser ART and call on LLM developers, policymakers, and agent developers to collaborate on improving agent safety. |
| Researcher Affiliation | Collaboration | 1Carnegie Mellon University 2Scale AI 3Gray Swan AI |
| Pseudocode | No | The paper describes methods and processes in paragraph text, but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We publicly release Browser ART and call on LLM developers, policymakers, and agent developers to collaborate on improving agent safety. Benchmark Scale AI/Browser ART Code scaleapi/browser-art Website Browser ART |
| Open Datasets | Yes | We publicly release Browser ART and call on LLM developers, policymakers, and agent developers to collaborate on improving agent safety. Benchmark Scale AI/Browser ART Code scaleapi/browser-art Website Browser ART |
| Dataset Splits | No | The paper uses a fixed test suite (Browser ART with 100 behaviors) for evaluation but does not describe any training, validation, or testing splits of this dataset for model development or evaluation, nor does it specify how the models under test were trained with regard to dataset splits. |
| Hardware Specification | No | The paper lists the large language models evaluated (e.g., GPT-4o, Llama-3.1) but does not provide any specific details about the hardware (e.g., GPU models, CPU types, or memory) used to conduct these evaluations or run their experiments. |
| Software Dependencies | Yes | For the backbone LLM, we evaluate the state-of-the-art LLMs with a long-context window, which include o1-preview, o1mini, GPT-4-turbo (gpt-4-turbo-2024-04-09), GPT-4o (gpt-4o-2024-08-06), Opus-3 (claude-3-opus-20240229), Sonnet-3.5 (claude-3-5-sonnet-20240620), Llama-3.1 (405B non-quantized) and Gemini-1.5 (gemini-1.5-pro-001). |
| Experiment Setup | Yes | In red teaming, we only change the user prompt in the Open Hands agents and retain all default configurations (e.g., the agent s system prompt). We set the temperatures of LLMs to 0, turn off the safety filter of Gemini, and set the maximum steps for each agent to 10. |