Single-pass Detection of Jailbreaking Input in Large Language Models
Authors: Leyla Naz Candogan, Yongtao Wu, Elias Abad Rocamora, Grigorios Chrysos, Volkan Cevher
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct a thorough evaluation on open-source LLMs, e.g., Llama 2, Llama 3 and Vicuna. Our results showcase that, in comparison to existing approaches, SPD attains both high efficiency and detection rate when identifying unsafe sentences. We demonstrate that even without accessing the full logit of models, SPD can still be a promising approach, as evidenced by testing on GPT-3.5, GPT-4 and GPT-4o-mini. |
| Researcher Affiliation | Academia | Leyla Naz Candogan EMAIL LIONS École Polytechnique Fédérale de Lausanne, Yongtao Wu EMAIL LIONS École Polytechnique Fédérale de Lausanne, Elias Abad Rocamora EMAIL LIONS École Polytechnique Fédérale de Lausanne, Grigorios G. Chrysos EMAIL University of Wisconsin-Madison, Volkan Cevher EMAIL LIONS École Polytechnique Fédérale de Lausanne |
| Pseudocode | No | The paper describes the SPD method verbally and mathematically (e.g., eq. 3 for feature matrix H), but does not present a structured pseudocode or algorithm block. |
| Open Source Code | Yes | 1Code and data available at https://github.com/LIONS-EPFL/SPD. |
| Open Datasets | Yes | Dataset We used four jailbreaking and two benign datasets: GCG (Zou et al., 2023), Auto DAN (Liu et al., 2024a), PAIR (Chao et al., 2023) and PAP (Zeng et al., 2024a). To measure the FP rate, we use two benign datasets: Alpaca Eval (Dubois et al., 2024) and QNLI (Wang et al., 2018). ... Adaptive (Andriushchenko et al., 2024): We used the dataset provided in https://github. com/tml-epfl/llm-adaptive-attacks. |
| Dataset Splits | Yes | We split the datasets into test and training sets as so that there is no overlap between them. Within a model, all baselines are evaluated on the same test data. Dataset sizes are provided in table 1. |
| Hardware Specification | Yes | All experiments were conducted in a single machine with an NVIDIA A100 SXM4 80GB GPU. |
| Software Dependencies | No | The paper mentions using 'v LLM API service' and 'Open API API access' for models, and 'SVM with the RBF kernel' as a classifier, but it does not specify any software libraries or frameworks with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | Since the influence of input on the logit distribution is higher with smaller i, after some testing, we set r = 5 and k = 50, see appendix E.5. ... For the self-perplexity filter, as suggested in the original paper, we set the threshold to the maximum perplexity prompts on Adv Bench dataset. While using the default parameters (threshold 0.2, dropping rate 0.3 and sampling number 20) for RA-LLM, for Smooth LLM, we tested all three approaches, swap, patch, and insert with perturbation percentage q = 10% and the number of iterations N = 10 settings. |