Single-pass Detection of Jailbreaking Input in Large Language Models

Authors: Leyla Naz Candogan, Yongtao Wu, Elias Abad Rocamora, Grigorios Chrysos, Volkan Cevher

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct a thorough evaluation on open-source LLMs, e.g., Llama 2, Llama 3 and Vicuna. Our results showcase that, in comparison to existing approaches, SPD attains both high efficiency and detection rate when identifying unsafe sentences. We demonstrate that even without accessing the full logit of models, SPD can still be a promising approach, as evidenced by testing on GPT-3.5, GPT-4 and GPT-4o-mini.
Researcher Affiliation Academia Leyla Naz Candogan EMAIL LIONS École Polytechnique Fédérale de Lausanne, Yongtao Wu EMAIL LIONS École Polytechnique Fédérale de Lausanne, Elias Abad Rocamora EMAIL LIONS École Polytechnique Fédérale de Lausanne, Grigorios G. Chrysos EMAIL University of Wisconsin-Madison, Volkan Cevher EMAIL LIONS École Polytechnique Fédérale de Lausanne
Pseudocode No The paper describes the SPD method verbally and mathematically (e.g., eq. 3 for feature matrix H), but does not present a structured pseudocode or algorithm block.
Open Source Code Yes 1Code and data available at https://github.com/LIONS-EPFL/SPD.
Open Datasets Yes Dataset We used four jailbreaking and two benign datasets: GCG (Zou et al., 2023), Auto DAN (Liu et al., 2024a), PAIR (Chao et al., 2023) and PAP (Zeng et al., 2024a). To measure the FP rate, we use two benign datasets: Alpaca Eval (Dubois et al., 2024) and QNLI (Wang et al., 2018). ... Adaptive (Andriushchenko et al., 2024): We used the dataset provided in https://github. com/tml-epfl/llm-adaptive-attacks.
Dataset Splits Yes We split the datasets into test and training sets as so that there is no overlap between them. Within a model, all baselines are evaluated on the same test data. Dataset sizes are provided in table 1.
Hardware Specification Yes All experiments were conducted in a single machine with an NVIDIA A100 SXM4 80GB GPU.
Software Dependencies No The paper mentions using 'v LLM API service' and 'Open API API access' for models, and 'SVM with the RBF kernel' as a classifier, but it does not specify any software libraries or frameworks with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes Since the influence of input on the logit distribution is higher with smaller i, after some testing, we set r = 5 and k = 50, see appendix E.5. ... For the self-perplexity filter, as suggested in the original paper, we set the threshold to the maximum perplexity prompts on Adv Bench dataset. While using the default parameters (threshold 0.2, dropping rate 0.3 and sampling number 20) for RA-LLM, for Smooth LLM, we tested all three approaches, swap, patch, and insert with perturbation percentage q = 10% and the number of iterations N = 10 settings.