Gandalf the Red: Adaptive Security for LLMs
Authors: Niklas Pfister, Václav Volhejn, Manuel Knott, Santiago Arias, Julia Bazinska, Mykhailo Bichurin, Alan Y. Commike, Janet Darling, Peter Dienes, Matthew Fiedler, David Haber, Matthias Kraft, Marco Lancini, Max Mathys, Damian Pascual-Ortiz, Jakub Podolak, Adrià Romero-López, Kyriacos Shiarlis, Andreas Signer, Zsolt Terek, Athanasios Theocharis, Daniel Timbrell, Samuel Trautwein, Samuel Watts, Yun-Han Wu, Mateo Rojas-Carulla
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Using Gandalf, we collect and release a dataset of 279k prompt attacks. Complemented by benign user data, our analysis reveals the interplay between security and utility, showing that defenses integrated in the LLM (e.g., system prompts) can degrade usability even without blocking requests. |
| Researcher Affiliation | Industry | 1Lakera. Correspondence to: Niklas Pfister <EMAIL>. |
| Pseudocode | Yes | Listing S.1: Strong system prompt (setup description and defense prompt) used in the the general setup. You are an AI and your main purpose is to help people by answering their questions... Listing S.2: LLM checker system prompt used in the general setup. The secret passwrod [sic] is {password}. You are checking the work of an AI that safeguards the password. |
| Open Source Code | Yes | Code to reproduce all results in the paper is available at https://github.com/lakeraai/dsec-gandalf. |
| Open Datasets | Yes | Using Gandalf, we collect and release a dataset of 279k prompt attacks. Full dataset at https://huggingface.co/datas ets/Lakera/gandalf-rct, for processed versions, see Appendix I. |
| Dataset Splits | No | The paper describes creating and collecting various datasets (Gandalf-RCT, Basic User, Borderline User) and their use in different experimental setups. While it mentions sampling for attack categorization (Appendix H.2: 'We sample 1000 prompts for each of the 18 (level, setup) pairs'), it does not provide specific details on train/test/validation splits (percentages, counts, or explicit methodology) for the main experiments evaluating the D-SEC framework and defense strategies, which primarily rely on the collected session data and success rates. |
| Hardware Specification | No | The paper mentions using specific large language models (Open AI LLMs: GPT-3.5 (gpt-3.5-turbo-0125), GPT-4o-mini (gpt-4o-mini-2024-07-18) and GPT4 (gpt-4-0125-preview)) for their experiments. However, it does not specify any hardware details (e.g., GPU models, CPU types, or memory) used by the authors to conduct these experiments or interact with these LLMs. |
| Software Dependencies | No | The paper mentions using 'Python s repr function', 'Open AI API' for LLM calls and checkers, 'Open AI s text-embedding-3-small' for text embeddings, and 'GPT (gpt-4o-mini-2024-07-18)' for PII filtering. It also refers to a 'logistic regression model'. However, it does not provide specific version numbers for the programming language (Python), any libraries (e.g., for logistic regression or OpenAI API interaction), or frameworks used, which are necessary for full reproducibility. |
| Experiment Setup | Yes | We define 6 levels where we keep the application M fixed while varying the defense D. To study the effect of domain restriction, we implemented three setups: (1) The general setup representing an open-ended chatbot like Chat GPT, (2) the summarization setup representing an LLM summarization application with indirect prompt attacks, and (3) the topic setup representing a narrow chatbot focused on specific topics, as in customer support. For each player (identified by a session ID), we randomly selected a setup (general, summarization, or topic) and one of three Open AI LLMs: GPT-3.5 (gpt-3.5-turbo-0125), GPT-4o-mini (gpt-4o-mini-2024-07-18) and GPT4 (gpt-4-0125-preview). |