reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SafeArena: Evaluating the Safety of Autonomous Web Agents

Authors: Ada Defne Tur, Nicholas Meade, Xing Han Lù, Alejandra Zambrano, Arkil Patel, Esin Durmus, Spandana Gella, Karolina Stanczak, Siva Reddy

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate these risks, we propose SAFEARENA, a benchmark focused on the deliberate misuse of web agents. SAFEARENA comprises 250 safe and 250 harmful tasks across four websites. We classify the harmful tasks into five harm categories misinformation, illegal activity, harassment, cybercrime, and social bias, designed to assess realistic misuses of web agents. We evaluate leading LLM-based web agents, including GPT-4o, Claude-3.5 Sonnet, Qwen-2-VL 72B, and Llama-3.2 90B, on our benchmark. To systematically assess their susceptibility to harmful tasks, we introduce the Agent Risk Assessment framework that categorizes agent behavior across four risk levels. We find agents are surprisingly compliant with malicious requests, with GPT-4o and Qwen-2 completing 34.7% and 27.3% of harmful requests, respectively.
Researcher Affiliation	Collaboration	1Mc Gill University 2Mila Quebec AI Institute 3Concordia University 4Anthropic 5Service Now Research 6Canada CIFAR AI Chair.
Pseudocode	No	The paper does not contain any explicit sections labeled 'Pseudocode' or 'Algorithm', nor does it present structured steps in a code-like format.
Open Source Code	No	Our benchmark is available here: https://safearena.github.io. The paper does not explicitly state that the source code for the methodology described in this paper is released, nor does it provide a direct link to a code repository for their implementation.
Open Datasets	Yes	To evaluate these risks, we propose SAFEARENA, a benchmark focused on the deliberate misuse of web agents. SAFEARENA comprises 250 safe and 250 harmful tasks across four websites. [...] Our benchmark is available here: https://safearena.github.io
Dataset Splits	No	The paper describes the composition of its SAFEARENA benchmark as 250 safe and 250 harmful tasks, and how they were created (human-designed, human-in-the-loop). However, it does not specify traditional training, validation, or test splits for models, as the entire benchmark is used for evaluating existing LLM agents.
Hardware Specification	No	Claude and GPT models are first-party-hosted for API usage; Qwen-2-VL-72B is accessed through VLLM, an open-source library for LLM inference (Kwon et al., 2023); Llama-3.2-90B is accessed through Together s hosting service. The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments, as it relies on API services and hosted inference.
Software Dependencies	No	The paper mentions software like VLLM and Browser Gym, but does not provide specific version numbers for these or any other software dependencies, which are required for reproducible descriptions.
Experiment Setup	Yes	For all models, we set the temperature to 0, HTML type to pruned HTML , maximum generated tokens to 1024, and maximum prompt tokens to 2048. We use the same hyperparameter settings across each model for generation through Browser Gym (Chezelles et al., 2024), which are described in Table 13. [Table 13 specifies: Maximum number of steps 30]