reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning

Authors: Zhaorun Chen, Mintong Kang, Bo Li

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments demonstrating that SHIELDAGENT achieves SOTA performance on both SHIELDAGENT-BENCH and three existing benchmarks (i.e., ST-Web Agent Bench (Levy et al., 2024), VWA-Adv (Wu et al., 2025), and Agent Harm (Andriushchenko et al.)).
Researcher Affiliation	Academia	1University of Chicago, Chicago IL, USA 2University of Illinois at Urbana-Champaign, Champaign IL, USA. Correspondence to: Zhaorun Chen, Bo Li <EMAIL>.
Pseudocode	Yes	Algorithm 1 SHIELDAGENT Inference Procedure ... Algorithm 2 ASPM Structure Optimization ... Algorithm 3 ASPM TRAINING PIPELINE
Open Source Code	Yes	Our project is available and continuously maintained here: https: //shieldagent-aiguard.github.io/
Open Datasets	Yes	Therefore, we introduce SHIELDAGENT-BENCH, the first comprehensive agent guardrail benchmark comprising 2K safety-related pairs of agent instructions and trajectories across six web environments and seven risk categories. ... We evaluate SHIELDAGENT against guardrail baselines on our SHIELDAGENT-BENCH dataset and three existing benchmarks: (1) ST-Web Agent Bench (Levy et al., 2024), which includes 234 safety-related web agent tasks with simple safety constraints; (2) VWA-Adv (Wu et al., 2025), consisting of 200 realistic adversarial tasks in the Visual Web Arena (Koh et al., 2024); and (3) Agent Harm (Andriushchenko et al.), comprising 110 malicious tasks designed for general agents.
Dataset Splits	No	The paper describes the composition and curation of the SHIELDAGENT-BENCH dataset, stating it "comprising 2K safety-related pairs of agent instructions and trajectories" and that "Each sample in our dataset consists of (Is, ζs, ζa u, ζe u), where Is is the instruction, ζs is the safe trajectory, and ζa u, ζe u are unsafe trajectories induced by two types of attacks, respectively." It also mentions optimizing rule weights "over a dataset D = {ζ(i), y(i))}N i=1" in Section 3.2.4. However, it does not explicitly specify how these datasets are partitioned into training, validation, or test sets with percentages, sample counts, or predefined split references for reproducing the model's development.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, or memory) used to run the experiments.
Software Dependencies	No	The paper mentions several tools and models like 'GPT-4o', 'Intern VL2-2B', 'Stormpy', and 'Open AI s text-embedding-3-large model', but it does not specify exact version numbers for these software components or libraries required for reproducibility of the experimental setup.
Experiment Setup	No	The paper outlines the objective function for learning rule weights (Equation 6) and mentions 'Update θ using gradient descent' in Algorithm 3, but it does not specify concrete hyperparameter values such as learning rate, batch size, number of epochs, or specific optimizer settings used for the experiments.