ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning
Authors: Zhaorun Chen, Mintong Kang, Bo Li
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments demonstrating that SHIELDAGENT achieves SOTA performance on both SHIELDAGENT-BENCH and three existing benchmarks (i.e., ST-Web Agent Bench (Levy et al., 2024), VWA-Adv (Wu et al., 2025), and Agent Harm (Andriushchenko et al.)). |
| Researcher Affiliation | Academia | 1University of Chicago, Chicago IL, USA 2University of Illinois at Urbana-Champaign, Champaign IL, USA. Correspondence to: Zhaorun Chen, Bo Li <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 SHIELDAGENT Inference Procedure ... Algorithm 2 ASPM Structure Optimization ... Algorithm 3 ASPM TRAINING PIPELINE |
| Open Source Code | Yes | Our project is available and continuously maintained here: https: //shieldagent-aiguard.github.io/ |
| Open Datasets | Yes | Therefore, we introduce SHIELDAGENT-BENCH, the first comprehensive agent guardrail benchmark comprising 2K safety-related pairs of agent instructions and trajectories across six web environments and seven risk categories. ... We evaluate SHIELDAGENT against guardrail baselines on our SHIELDAGENT-BENCH dataset and three existing benchmarks: (1) ST-Web Agent Bench (Levy et al., 2024), which includes 234 safety-related web agent tasks with simple safety constraints; (2) VWA-Adv (Wu et al., 2025), consisting of 200 realistic adversarial tasks in the Visual Web Arena (Koh et al., 2024); and (3) Agent Harm (Andriushchenko et al.), comprising 110 malicious tasks designed for general agents. |
| Dataset Splits | No | The paper describes the composition and curation of the SHIELDAGENT-BENCH dataset, stating it "comprising 2K safety-related pairs of agent instructions and trajectories" and that "Each sample in our dataset consists of (Is, ζs, ζa u, ζe u), where Is is the instruction, ζs is the safe trajectory, and ζa u, ζe u are unsafe trajectories induced by two types of attacks, respectively." It also mentions optimizing rule weights "over a dataset D = {ζ(i), y(i))}N i=1" in Section 3.2.4. However, it does not explicitly specify how these datasets are partitioned into training, validation, or test sets with percentages, sample counts, or predefined split references for reproducing the model's development. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, or memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions several tools and models like 'GPT-4o', 'Intern VL2-2B', 'Stormpy', and 'Open AI s text-embedding-3-large model', but it does not specify exact version numbers for these software components or libraries required for reproducibility of the experimental setup. |
| Experiment Setup | No | The paper outlines the objective function for learning rule weights (Equation 6) and mentions 'Update θ using gradient descent' in Algorithm 3, but it does not specify concrete hyperparameter values such as learning rate, batch size, number of epochs, or specific optimizer settings used for the experiments. |