SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

Authors: Alexander Robey, Eric Wong, Hamed Hassani, George J. Pappas

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Across a range of popular LLMs, Smooth LLM offers improved robustness against the GCG, PAIR, Random Search, and Ample GCG jailbreaks. Smooth LLM is also resistant against adaptive GCG attacks, exhibits a small, though non-negligible trade-off between robustness and nominal performance, and is compatible with any LLM.
Researcher Affiliation Academia Alexander Robey, Eric Wong, Hamed Hassani, George J. Pappas EMAIL School of Engineering and Applied Science University of Pennsylvania
Pseudocode Yes Figure 5: Smooth LLM: A randomized defense. (Right) Pseudocode for Smooth LLM. In lines 2-4, we outline the perturbation step. Next, in line 5, we determine whether a γ-fraction of the responses jailbreak the target LLM. Finally, in line 6, we select a response uniformly at random that is consistent with the majority vote.
Open Source Code No The paper does not provide an explicit statement of code release or a direct link to a code repository for the Smooth LLM methodology. Mentions of 'authors implementation of GCG' refer to third-party code, not their own.
Open Datasets Yes In Figure 1, we show the performance of four attacks GCG (Zou et al., 2023b), PAIR (Chao et al., 2023), Random Search (Andriushchenko et al., 2024), and Ample GCG (Liao & Sun, 2024) when evaluated against an undefended LLM and an LLM defended with Smooth LLM. In each subplot, we use the datasets used in each of the attack papers (i.e., Adv Bench (Zou et al., 2023b) for GCG, Random Search, and Ample GCG, and JBB-Behaviors (Chao et al., 2023) for PAIR). To evaluate the nominal performance of Smooth LLM, we consider four NLP benchmarks: Instruction Following (IF) (Zhou et al., 2023), PIQA (Bisk et al., 2020), Open Book QA (Mihaylov et al., 2018), and Toxi Gen (Hartvigsen et al., 2022).
Dataset Splits No The paper mentions using specific datasets for evaluation (e.g., Adv Bench, JBB-Behaviors, PIQA, Open Book QA, Toxi Gen) but does not provide details on how these datasets were split into training, validation, or test sets for their own experiments. It references the use of GCG's configuration for obtaining suffixes but not for general dataset splitting.
Hardware Specification Yes All experiments in this paper were run on a cluster with 8 NVIDIA A100 GPUs and 16 NVIDIA A6000 GPUs. Table 8: Smooth LLM running time. We list the running time per prompt of Smooth LLM when run with various values of N and averaged over five trials. For Vicuna and Llama2, we ran Smooth LLM on A100 and A6000 GPUs, respectively.
Software Dependencies No The paper mentions using "Python's native string library" and "Python's random.choice module" but does not specify version numbers for Python or any other key software libraries used in their implementation of Smooth LLM.
Experiment Setup Yes Thus, Algorithm 1 involves three parameters: the number of samples N, the perturbation percentage q, and the margin for the majority vote γ (which, unless otherwise stated, we set to be 1/2). In our experiments, we found that Smooth LLM offers competitive robustness against GCG, PAIR, Random Search, and Ample GCG. To generate Figure 4, we obtained adversarial suffixes for Llama2 and Vicuna by running the authors implementation of GCG for every prompt in the behaviors dataset described in (Zou et al., 2023b). We then ran Smooth LLM for N {2, 4, 6, 8, 10} and q {5, 10, 15, 20} across five independent trials. Table 3: Parameters used to compute the DSP. We list the parameters used to compute the DSP in Figures 6 and 11. The only difference between these two figures are the choices of m and m S.