SafeArena: Evaluating the Safety of Autonomous Web Agents

Authors: Ada Defne Tur, Nicholas Meade, Xing Han Lù, Alejandra Zambrano, Arkil Patel, Esin Durmus, Spandana Gella, Karolina Stanczak, Siva Reddy

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate these risks, we propose SAFEARENA, a benchmark focused on the deliberate misuse of web agents. SAFEARENA comprises 250 safe and 250 harmful tasks across four websites. We classify the harmful tasks into five harm categories misinformation, illegal activity, harassment, cybercrime, and social bias, designed to assess realistic misuses of web agents. We evaluate leading LLM-based web agents, including GPT-4o, Claude-3.5 Sonnet, Qwen-2-VL 72B, and Llama-3.2 90B, on our benchmark. To systematically assess their susceptibility to harmful tasks, we introduce the Agent Risk Assessment framework that categorizes agent behavior across four risk levels. We find agents are surprisingly compliant with malicious requests, with GPT-4o and Qwen-2 completing 34.7% and 27.3% of harmful requests, respectively.
Researcher Affiliation Collaboration 1Mc Gill University 2Mila Quebec AI Institute 3Concordia University 4Anthropic 5Service Now Research 6Canada CIFAR AI Chair.
Pseudocode No The paper does not contain any explicit sections labeled 'Pseudocode' or 'Algorithm', nor does it present structured steps in a code-like format.
Open Source Code No Our benchmark is available here: https://safearena.github.io. The paper does not explicitly state that the source code for the methodology described in this paper is released, nor does it provide a direct link to a code repository for their implementation.
Open Datasets Yes To evaluate these risks, we propose SAFEARENA, a benchmark focused on the deliberate misuse of web agents. SAFEARENA comprises 250 safe and 250 harmful tasks across four websites. [...] Our benchmark is available here: https://safearena.github.io
Dataset Splits No The paper describes the composition of its SAFEARENA benchmark as 250 safe and 250 harmful tasks, and how they were created (human-designed, human-in-the-loop). However, it does not specify traditional training, validation, or test splits for models, as the entire benchmark is used for evaluating existing LLM agents.
Hardware Specification No Claude and GPT models are first-party-hosted for API usage; Qwen-2-VL-72B is accessed through VLLM, an open-source library for LLM inference (Kwon et al., 2023); Llama-3.2-90B is accessed through Together s hosting service. The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments, as it relies on API services and hosted inference.
Software Dependencies No The paper mentions software like VLLM and Browser Gym, but does not provide specific version numbers for these or any other software dependencies, which are required for reproducible descriptions.
Experiment Setup Yes For all models, we set the temperature to 0, HTML type to pruned HTML , maximum generated tokens to 1024, and maximum prompt tokens to 2048. We use the same hyperparameter settings across each model for generation through Browser Gym (Chezelles et al., 2024), which are described in Table 13. [Table 13 specifies: Maximum number of steps 30]