reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Aligned LLMs Are Not Aligned Browser Agents

Authors: Priyanshu Kumar, Elaine Lau, Saranya Vijayakumar, Tu Trinh, Elaine Chang, Vaughn Robinson, Shuyan Zhou, Matt Fredrikson, Sean Hendryx, Summer Yue, Zifan Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical study on state-of-the-art browser agents reveals that while the backbone LLM refuses harmful instructions as a chatbot, the corresponding agent does not. In this work, we primarily focus on red-teaming browser agents LLMs that leverage information via web browsers. To this end, we introduce Browser Agent Red teaming Toolkit (Browser ART), a comprehensive test suite designed specifically for red-teaming browser agents. Browser ART consists of 100 diverse browser-related harmful behaviors (including original behaviors and ones sourced from Harm Bench (Mazeika et al., 2024) and Air Bench 2024 (Zeng et al., 2024b)) across both synthetic and real websites. Our empirical study on state-of-the-art browser agents reveals that while the backbone LLM refuses harmful instructions as a chatbot, the corresponding agent does not. Moreover, attack methods designed to jailbreak refusal-trained LLMs in the chat settings transfer effectively to browser agents. With human rewrites, GPT-4o and o1-preview -based browser agents pursued 98 and 63 harmful behaviors (out of 100), respectively. Therefore, simply ensuring LLM s refusal to harmful instructions in chats is not sufficient to ensure that the downstream agents are safe. We publicly release Browser ART and call on LLM developers, policymakers, and agent developers to collaborate on improving agent safety.
Researcher Affiliation	Collaboration	1Carnegie Mellon University 2Scale AI 3Gray Swan AI
Pseudocode	No	The paper describes methods and processes in paragraph text, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We publicly release Browser ART and call on LLM developers, policymakers, and agent developers to collaborate on improving agent safety. Benchmark Scale AI/Browser ART Code scaleapi/browser-art Website Browser ART
Open Datasets	Yes	We publicly release Browser ART and call on LLM developers, policymakers, and agent developers to collaborate on improving agent safety. Benchmark Scale AI/Browser ART Code scaleapi/browser-art Website Browser ART
Dataset Splits	No	The paper uses a fixed test suite (Browser ART with 100 behaviors) for evaluation but does not describe any training, validation, or testing splits of this dataset for model development or evaluation, nor does it specify how the models under test were trained with regard to dataset splits.
Hardware Specification	No	The paper lists the large language models evaluated (e.g., GPT-4o, Llama-3.1) but does not provide any specific details about the hardware (e.g., GPU models, CPU types, or memory) used to conduct these evaluations or run their experiments.
Software Dependencies	Yes	For the backbone LLM, we evaluate the state-of-the-art LLMs with a long-context window, which include o1-preview, o1mini, GPT-4-turbo (gpt-4-turbo-2024-04-09), GPT-4o (gpt-4o-2024-08-06), Opus-3 (claude-3-opus-20240229), Sonnet-3.5 (claude-3-5-sonnet-20240620), Llama-3.1 (405B non-quantized) and Gemini-1.5 (gemini-1.5-pro-001).
Experiment Setup	Yes	In red teaming, we only change the user prompt in the Open Hands agents and retain all default configurations (e.g., the agent s system prompt). We set the temperatures of LLMs to 0, turn off the safety filter of Gemini, and set the maximum steps for each agent to 10.