reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Single-pass Detection of Jailbreaking Input in Large Language Models

Authors: Leyla Naz Candogan, Yongtao Wu, Elias Abad Rocamora, Grigorios Chrysos, Volkan Cevher

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct a thorough evaluation on open-source LLMs, e.g., Llama 2, Llama 3 and Vicuna. Our results showcase that, in comparison to existing approaches, SPD attains both high efficiency and detection rate when identifying unsafe sentences. We demonstrate that even without accessing the full logit of models, SPD can still be a promising approach, as evidenced by testing on GPT-3.5, GPT-4 and GPT-4o-mini.
Researcher Affiliation	Academia	Leyla Naz Candogan EMAIL LIONS École Polytechnique Fédérale de Lausanne, Yongtao Wu EMAIL LIONS École Polytechnique Fédérale de Lausanne, Elias Abad Rocamora EMAIL LIONS École Polytechnique Fédérale de Lausanne, Grigorios G. Chrysos EMAIL University of Wisconsin-Madison, Volkan Cevher EMAIL LIONS École Polytechnique Fédérale de Lausanne
Pseudocode	No	The paper describes the SPD method verbally and mathematically (e.g., eq. 3 for feature matrix H), but does not present a structured pseudocode or algorithm block.
Open Source Code	Yes	1Code and data available at https://github.com/LIONS-EPFL/SPD.
Open Datasets	Yes	Dataset We used four jailbreaking and two benign datasets: GCG (Zou et al., 2023), Auto DAN (Liu et al., 2024a), PAIR (Chao et al., 2023) and PAP (Zeng et al., 2024a). To measure the FP rate, we use two benign datasets: Alpaca Eval (Dubois et al., 2024) and QNLI (Wang et al., 2018). ... Adaptive (Andriushchenko et al., 2024): We used the dataset provided in https://github. com/tml-epfl/llm-adaptive-attacks.
Dataset Splits	Yes	We split the datasets into test and training sets as so that there is no overlap between them. Within a model, all baselines are evaluated on the same test data. Dataset sizes are provided in table 1.
Hardware Specification	Yes	All experiments were conducted in a single machine with an NVIDIA A100 SXM4 80GB GPU.
Software Dependencies	No	The paper mentions using 'v LLM API service' and 'Open API API access' for models, and 'SVM with the RBF kernel' as a classifier, but it does not specify any software libraries or frameworks with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	Since the influence of input on the logit distribution is higher with smaller i, after some testing, we set r = 5 and k = 50, see appendix E.5. ... For the self-perplexity filter, as suggested in the original paper, we set the threshold to the maximum perplexity prompts on Adv Bench dataset. While using the default parameters (threshold 0.2, dropping rate 0.3 and sampling number 20) for RA-LLM, for Smooth LLM, we tested all three approaches, swap, patch, and insert with perturbation percentage q = 10% and the number of iterations N = 10 settings.