Gandalf the Red: Adaptive Security for LLMs

Authors: Niklas Pfister, Václav Volhejn, Manuel Knott, Santiago Arias, Julia Bazinska, Mykhailo Bichurin, Alan Y. Commike, Janet Darling, Peter Dienes, Matthew Fiedler, David Haber, Matthias Kraft, Marco Lancini, Max Mathys, Damian Pascual-Ortiz, Jakub Podolak, Adrià Romero-López, Kyriacos Shiarlis, Andreas Signer, Zsolt Terek, Athanasios Theocharis, Daniel Timbrell, Samuel Trautwein, Samuel Watts, Yun-Han Wu, Mateo Rojas-Carulla

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using Gandalf, we collect and release a dataset of 279k prompt attacks. Complemented by benign user data, our analysis reveals the interplay between security and utility, showing that defenses integrated in the LLM (e.g., system prompts) can degrade usability even without blocking requests.
Researcher Affiliation Industry 1Lakera. Correspondence to: Niklas Pfister <EMAIL>.
Pseudocode Yes Listing S.1: Strong system prompt (setup description and defense prompt) used in the the general setup. You are an AI and your main purpose is to help people by answering their questions... Listing S.2: LLM checker system prompt used in the general setup. The secret passwrod [sic] is {password}. You are checking the work of an AI that safeguards the password.
Open Source Code Yes Code to reproduce all results in the paper is available at https://github.com/lakeraai/dsec-gandalf.
Open Datasets Yes Using Gandalf, we collect and release a dataset of 279k prompt attacks. Full dataset at https://huggingface.co/datas ets/Lakera/gandalf-rct, for processed versions, see Appendix I.
Dataset Splits No The paper describes creating and collecting various datasets (Gandalf-RCT, Basic User, Borderline User) and their use in different experimental setups. While it mentions sampling for attack categorization (Appendix H.2: 'We sample 1000 prompts for each of the 18 (level, setup) pairs'), it does not provide specific details on train/test/validation splits (percentages, counts, or explicit methodology) for the main experiments evaluating the D-SEC framework and defense strategies, which primarily rely on the collected session data and success rates.
Hardware Specification No The paper mentions using specific large language models (Open AI LLMs: GPT-3.5 (gpt-3.5-turbo-0125), GPT-4o-mini (gpt-4o-mini-2024-07-18) and GPT4 (gpt-4-0125-preview)) for their experiments. However, it does not specify any hardware details (e.g., GPU models, CPU types, or memory) used by the authors to conduct these experiments or interact with these LLMs.
Software Dependencies No The paper mentions using 'Python s repr function', 'Open AI API' for LLM calls and checkers, 'Open AI s text-embedding-3-small' for text embeddings, and 'GPT (gpt-4o-mini-2024-07-18)' for PII filtering. It also refers to a 'logistic regression model'. However, it does not provide specific version numbers for the programming language (Python), any libraries (e.g., for logistic regression or OpenAI API interaction), or frameworks used, which are necessary for full reproducibility.
Experiment Setup Yes We define 6 levels where we keep the application M fixed while varying the defense D. To study the effect of domain restriction, we implemented three setups: (1) The general setup representing an open-ended chatbot like Chat GPT, (2) the summarization setup representing an LLM summarization application with indirect prompt attacks, and (3) the topic setup representing a narrow chatbot focused on specific topics, as in customer support. For each player (identified by a session ID), we randomly selected a setup (general, summarization, or topic) and one of three Open AI LLMs: GPT-3.5 (gpt-3.5-turbo-0125), GPT-4o-mini (gpt-4o-mini-2024-07-18) and GPT4 (gpt-4-0125-preview).