Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

Authors: Abhay Sheshadri, Aidan Ewart, Phillip Huang Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, Stephen Casper

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Here, we experiment with targeted LAT where the adversary seeks to minimize loss on a specific competing task. We find that it can augment a wide variety of state-of-the-art methods. First, we use targeted LAT to improve robustness to jailbreaks, outperforming a strong R2D2 baseline with orders of magnitude less compute. Second, we use it to more effectively remove backdoors with no knowledge of the trigger. Finally, we use it to more effectively unlearn knowledge for specific undesirable tasks in a way that is also more robust to re-learning. Overall, our results suggest that targeted LAT can be an effective tool for defending against harmful behaviors from LLMs. 1
Researcher Affiliation Collaboration Abhay Sheshadri, Georgia Institute of Technology, MATS EMAIL Aidan Ewart, University of Bristol, MATS EMAIL Phillip Guo, University of Maryland, MATS EMAIL Aengus Lynch, University College London, MATS EMAIL Cindy Wu, MATS EMAIL Vivek Hebbar, Astra EMAIL Henry Sleight, MATS EMAIL Asa Cooper Stickland, New York University EMAIL Ethan Perez, Anthropic EMAIL Dylan Hadfield-Menell, MIT CSAIL EMAIL Stephen Casper, MIT CSAIL EMAIL
Pseudocode No The paper describes loss functions and methods in paragraph text and mathematical equations, but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks or figures.
Open Source Code Yes 1Code is available at github.com/aengusl/latent-adversarial-training. Models are available at huggingface.co/LLM-LAT. Chat with our jailbreaking robust model at abhayesian.com/lat-chat.
Open Datasets Yes We first generate a set of harmful user requests by few-shot prompting Mistral-7B (Jiang et al., 2023) with harmful requests seeded by Adv Bench (Zou et al., 2023b)... In all experiments, we use the Ultra Chat dataset (Ding et al., 2023) as a benign fine-tuning dataset Db... We train all models using the helpful and harmless splits of the Anthropic s HH-RLHF preference dataset (Bai et al., 2022)... As in as in Li et al. (2024a), we use the WMDP biology and cyber corpora as forget datasests and Wiki Text (Merity et al., 2016) as a retain dataset.
Dataset Splits Yes In all experiments, we use the Ultra Chat dataset (Ding et al., 2023) as a benign fine-tuning dataset Db to preserve the model s performance... We train all models using the helpful and harmless splits of the Anthropic s HH-RLHF preference dataset (Bai et al., 2022)... As in as in Li et al. (2024a), we use the WMDP biology and cyber corpora as forget datasests and Wiki Text (Merity et al., 2016) as a retain dataset.
Hardware Specification Yes All experiments were run on a single A100 or H100 GPU except for ones involving R2D2 (Li et al., 2024a) in Section 4.1 which were run on eight. All training runs lasted less than 12 hours of wall-clock time.
Software Dependencies Yes For all evaluations, we use 1,000 samples on lm-evaluation-harness v0.4.0 Gao et al. (2023) as done in Li et al. (2024a).
Experiment Setup Yes We attack the residual stream of transformer LLMs with L2-norm-bounded perturbations, calculated using projected gradient descent (PGD) (Madry et al., 2017)... After experimenting with different choices of layers (1, 2, 3, 4, 10, 16, 22, and 28), we found that the simple heuristic of selecting four evenly spaced layers worked well across models and experiments. We empirically selected the perturbation bound ϵ through a grid search over 0.5, 1.0, 2.5, 6.0, 10.0... We implement refusal training based on Mazeika et al. (2024) using both a toward and away loss term calculated with respect to harmless/harmful example pairs... We perform LAT using latent-space adversaries at layers 8, 16, 24, and 30 which are jointly optimized to minimize the RT loss with the harmful/harmless labels flipped... We fine-tune the models from Rando et al. (2024) using direct preference optimization (DPO) (Rafailov et al., 2024) and DPO with LAT for 1024 steps on batches of size 16... We attack hidden layers 4, 12, 20, and 28... For all methods, we train on 100 batches of size 16 for 4 steps each.