reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

GuardAgent: Safeguard LLM Agents via Knowledge-Enabled Reasoning

Authors: Zhen Xiang, Linzhi Zheng, Yanjie Li, Junyuan Hong, Qinbin Li, Han Xie, Jiawei Zhang, Zidi Xiong, Chulin Xie, Carl Yang, Dawn Song, Bo Li

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that Guard Agent effectively moderates the violation actions for different types of agents on these two benchmarks with over 98% and 83% guardrail accuracies, respectively. [...] 5. Experiments Overview of results. In Sec. 5.3, we show the effectiveness of Guard Agent in safeguarding EHRAgent on EICUAC and See Act on Mind2Web-SC, compared with several strong baseline approaches. In Sec. 5.4, we conduct the following ablation studies: 1) We present a breakdown of results for the roles in EICU-AC and the rules in Mind2Web SC, showing that Guard Agent performs consistently well across most roles and rules, enabling it to manage complex guard requests effectively. 2) We assess the significance of long-term memory by varying the number of demonstrations provided to Guard Agent. 3) We show the importance of the toolbox of Guard Agent by observing a performance decline when critical tools (i.e., functions) are removed.
Researcher Affiliation	Collaboration	1University of Georgia 2University of Chicago 3UIUC 4University of Texas at Austin 5University of California, Berkeley 6Emory University 7Virtue AI.
Pseudocode	No	The paper describes 'action plan' and 'guardrail code generation' as outputs of the Guard Agent (e.g., in Figure 2 and Figure 21). While these outputs are structured and code-like, they are specific examples generated by the system, not a formal pseudocode or algorithm block that describes the Guard Agent's methodology itself. There are no sections explicitly labeled 'Pseudocode' or 'Algorithm'.
Open Source Code	No	Project page: https://guardagent.github.io/ The paper provides a 'Project page' URL. However, it does not explicitly state that the source code for the methodology is available at this link or in supplementary materials, nor does it provide a direct link to a code repository. A 'Project page' can be a high-level overview or demonstration page rather than a direct source code host.
Open Datasets	Yes	EHRAgent... evaluated and shown decent performance on several benchmarks, including an EICU dataset containing questions regarding the clinical care of ICU patients (see Fig. 12 in App. C for example) and 10 relevant databases (Pollard et al., 2018). [...] See Act is demonstrated successful on the Mind2Web benchmark containing over 2,000 complex web tasks spanning 137 websites across 31 domains (e.g., car rental, shopping, entertainment, etc.) (Deng et al., 2023). [...] We consider a commonly used Q&A dataset CSQA (Talmor et al., 2019), which consists of multiple-choice questions for common sense reasoning.
Dataset Splits	No	Ultimately, EICU-AC contains 52, 57, and 45 examples labeled to 0 for physician , nursing , and general administration , respectively, and 46, 55, and 61 examples labeled to 1 for the three roles respectively. Among these 316 examples, there are 226 unique questions spanning 51 ICU information categories, underscoring the diversity of EICU-AC. [...] The created Mind2Web-SC contains 100 examples per class with only unique tasks within each class. [...] We sample 80 questions from the original dataset, with 39 questions not violating any rules in the safety guard requests and 41 questions violating at least one rule. As a result, among these 41 questions with rule violations, 18 are labeled low risk , 22 are labeled medium risk , and 1 is labeled high risk . For all the questions in the test, the answer produced by GPT-4 is correct, so the test will mainly focus on the quality of the guardrail.
Hardware Specification	No	The paper mentions using LLM models like GPT-4, Llama3-70B, Llama3.1-70B, and Llama3.3-70B as core LLMs. However, it does not specify the underlying hardware (e.g., specific GPU models, CPU types, or memory amounts) used to run these models or the experiments.
Software Dependencies	No	We use Python as the default code execution engine, with two initial functions, Check Access and Check Rules , in the toolbox (see App. F). The paper mentions 'Python' as the default code execution engine but does not specify a version number. Other software components like specific library names with their versions are not mentioned.
Experiment Setup	Yes	In the main experiments, we set the number of demonstrations to k = 1 and k = 3 for EICU-AC and Mind2Web-SC, respectively. [...] We evaluate Guard Agent with GPT-4, Llama3-70B, Llama3.1-70B, and Llama3.3-70B (with temperature zero) as the core LLMs, respectively. [...] Finally, we allow three debugging iterations, though in most cases, the guardrail code generated by Guard Agent is directly executable.