Does Refusal Training in LLMs Generalize to the Past Tense?
Authors: Maksym Andriushchenko, Nicolas Flammarion
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We systematically evaluate this method on Llama-3 8B, Claude-3.5 Sonnet, GPT-3.5 Turbo, Gemma-2 9B, Phi-3-Mini, GPT-4o-mini, GPT-4o, o1-mini, o1-preview, and R2D2 models using GPT-3.5 Turbo as a reformulation model. For example, the success rate of this simple attack on GPT-4o increases from 1% using direct requests to 88% using 20 past-tense reformulation attempts on harmful requests from Jailbreak Bench with GPT-4 as a jailbreak judge. [...] Moreover, our experiments on fine-tuning GPT3.5 Turbo show that defending against past reformulations is feasible when past tense examples are explicitly included in the fine-tuning data. |
| Researcher Affiliation | Academia | Maksym Andriushchenko EPFL Nicolas Flammarion EPFL |
| Pseudocode | No | The paper describes methods in prose, such as the reformulation prompt in Table 2, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We provide code and jailbreak artifacts at https://github.com/tml-epfl/llm-past-tense. |
| Open Datasets | Yes | We use 100 harmful behaviors from JBB-Behaviors (Chao et al., 2024) that span 10 harm categories based on the Open AI usage policy. [...] We also add standard conversations from Open Hermes-2.5 (Teknium, 2023) to the fine-tuning set to make sure the model does not refuse too frequently. |
| Dataset Splits | Yes | We use 100 harmful behaviors from JBB-Behaviors (Chao et al., 2024) that span 10 harm categories based on the Open AI usage policy. We conduct 20 reformulations per behavior using GPT-4 as a semantic jailbreak judge on each iteration, in line with the methodology of prior works such as Chao et al. (2023). [...] We use the Open AI fine-tuning service to fine-tune gpt-3.5-turbo-0125 on 394 past-tense reformulations of 50 random JBB-Behaviors paired with a refusal message ( Sorry, I can t help with that ). We use the remaining 50 JBB-Behaviors as a test set. We also add standard conversations from Open Hermes-2.5 (Teknium, 2023) to the fine-tuning set to make sure the model does not refuse too frequently. [...] We use the following proportions: 2%/98%, 5%/95%, 10%/90%, and 30%/70%. |
| Hardware Specification | No | The paper mentions using LLM APIs like GPT-3.5 Turbo and fine-tuning services from Open AI, but does not provide specific hardware details (e.g., GPU models, CPU types) used for running its experiments. |
| Software Dependencies | Yes | To automatically reformulate an arbitrary request, we use GPT-3.5 Turbo with the prompt shown in Table 2... Our experiments on fine-tuning GPT-3.5 Turbo show that producing refusals on past-tense reformulations is straightforward if one explicitly includes them in the fine-tuning dataset. [...] We use the Open AI fine-tuning service to fine-tune gpt-3.5-turbo-0125 on 394 past-tense reformulations of 50 random JBB-Behaviors paired with a refusal message ( Sorry, I can t help with that ). |
| Experiment Setup | Yes | To automatically reformulate an arbitrary request, we use GPT-3.5 Turbo with the prompt shown in Table 2 that relies on a few illustrative examples. [...] We leverage the inherent variability in language model outputs due to sampling and use the temperature parameter equal to one both for the target and reformulation LLMs. [...] We use the Open AI fine-tuning service to fine-tune gpt-3.5-turbo-0125 on 394 past-tense reformulations of 50 random JBB-Behaviors paired with a refusal message ( Sorry, I can t help with that ). [...] We keep the same number of reformulations and increase the number of standard conversations to get different proportions of reformulations vs. standard data. We use the following proportions: 2%/98%, 5%/95%, 10%/90%, and 30%/70%. |