reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Does Refusal Training in LLMs Generalize to the Past Tense?

Authors: Maksym Andriushchenko, Nicolas Flammarion

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We systematically evaluate this method on Llama-3 8B, Claude-3.5 Sonnet, GPT-3.5 Turbo, Gemma-2 9B, Phi-3-Mini, GPT-4o-mini, GPT-4o, o1-mini, o1-preview, and R2D2 models using GPT-3.5 Turbo as a reformulation model. For example, the success rate of this simple attack on GPT-4o increases from 1% using direct requests to 88% using 20 past-tense reformulation attempts on harmful requests from Jailbreak Bench with GPT-4 as a jailbreak judge. [...] Moreover, our experiments on fine-tuning GPT3.5 Turbo show that defending against past reformulations is feasible when past tense examples are explicitly included in the fine-tuning data.
Researcher Affiliation	Academia	Maksym Andriushchenko EPFL Nicolas Flammarion EPFL
Pseudocode	No	The paper describes methods in prose, such as the reformulation prompt in Table 2, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We provide code and jailbreak artifacts at https://github.com/tml-epfl/llm-past-tense.
Open Datasets	Yes	We use 100 harmful behaviors from JBB-Behaviors (Chao et al., 2024) that span 10 harm categories based on the Open AI usage policy. [...] We also add standard conversations from Open Hermes-2.5 (Teknium, 2023) to the fine-tuning set to make sure the model does not refuse too frequently.
Dataset Splits	Yes	We use 100 harmful behaviors from JBB-Behaviors (Chao et al., 2024) that span 10 harm categories based on the Open AI usage policy. We conduct 20 reformulations per behavior using GPT-4 as a semantic jailbreak judge on each iteration, in line with the methodology of prior works such as Chao et al. (2023). [...] We use the Open AI fine-tuning service to fine-tune gpt-3.5-turbo-0125 on 394 past-tense reformulations of 50 random JBB-Behaviors paired with a refusal message ( Sorry, I can t help with that ). We use the remaining 50 JBB-Behaviors as a test set. We also add standard conversations from Open Hermes-2.5 (Teknium, 2023) to the fine-tuning set to make sure the model does not refuse too frequently. [...] We use the following proportions: 2%/98%, 5%/95%, 10%/90%, and 30%/70%.
Hardware Specification	No	The paper mentions using LLM APIs like GPT-3.5 Turbo and fine-tuning services from Open AI, but does not provide specific hardware details (e.g., GPU models, CPU types) used for running its experiments.
Software Dependencies	Yes	To automatically reformulate an arbitrary request, we use GPT-3.5 Turbo with the prompt shown in Table 2... Our experiments on fine-tuning GPT-3.5 Turbo show that producing refusals on past-tense reformulations is straightforward if one explicitly includes them in the fine-tuning dataset. [...] We use the Open AI fine-tuning service to fine-tune gpt-3.5-turbo-0125 on 394 past-tense reformulations of 50 random JBB-Behaviors paired with a refusal message ( Sorry, I can t help with that ).
Experiment Setup	Yes	To automatically reformulate an arbitrary request, we use GPT-3.5 Turbo with the prompt shown in Table 2 that relies on a few illustrative examples. [...] We leverage the inherent variability in language model outputs due to sampling and use the temperature parameter equal to one both for the target and reformulation LLMs. [...] We use the Open AI fine-tuning service to fine-tune gpt-3.5-turbo-0125 on 394 past-tense reformulations of 50 random JBB-Behaviors paired with a refusal message ( Sorry, I can t help with that ). [...] We keep the same number of reformulations and increase the number of standard conversations to get different proportions of reformulations vs. standard data. We use the following proportions: 2%/98%, 5%/95%, 10%/90%, and 30%/70%.