reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

Authors: Abhay Sheshadri, Aidan Ewart, Phillip Huang Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, Stephen Casper

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Here, we experiment with targeted LAT where the adversary seeks to minimize loss on a specific competing task. We find that it can augment a wide variety of state-of-the-art methods. First, we use targeted LAT to improve robustness to jailbreaks, outperforming a strong R2D2 baseline with orders of magnitude less compute. Second, we use it to more effectively remove backdoors with no knowledge of the trigger. Finally, we use it to more effectively unlearn knowledge for specific undesirable tasks in a way that is also more robust to re-learning. Overall, our results suggest that targeted LAT can be an effective tool for defending against harmful behaviors from LLMs. 1
Researcher Affiliation	Collaboration	Abhay Sheshadri, Georgia Institute of Technology, MATS EMAIL Aidan Ewart, University of Bristol, MATS EMAIL Phillip Guo, University of Maryland, MATS EMAIL Aengus Lynch, University College London, MATS EMAIL Cindy Wu, MATS EMAIL Vivek Hebbar, Astra EMAIL Henry Sleight, MATS EMAIL Asa Cooper Stickland, New York University EMAIL Ethan Perez, Anthropic EMAIL Dylan Hadfield-Menell, MIT CSAIL EMAIL Stephen Casper, MIT CSAIL EMAIL
Pseudocode	No	The paper describes loss functions and methods in paragraph text and mathematical equations, but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks or figures.
Open Source Code	Yes	1Code is available at github.com/aengusl/latent-adversarial-training. Models are available at huggingface.co/LLM-LAT. Chat with our jailbreaking robust model at abhayesian.com/lat-chat.
Open Datasets	Yes	We first generate a set of harmful user requests by few-shot prompting Mistral-7B (Jiang et al., 2023) with harmful requests seeded by Adv Bench (Zou et al., 2023b)... In all experiments, we use the Ultra Chat dataset (Ding et al., 2023) as a benign fine-tuning dataset Db... We train all models using the helpful and harmless splits of the Anthropic s HH-RLHF preference dataset (Bai et al., 2022)... As in as in Li et al. (2024a), we use the WMDP biology and cyber corpora as forget datasests and Wiki Text (Merity et al., 2016) as a retain dataset.
Dataset Splits	Yes	In all experiments, we use the Ultra Chat dataset (Ding et al., 2023) as a benign fine-tuning dataset Db to preserve the model s performance... We train all models using the helpful and harmless splits of the Anthropic s HH-RLHF preference dataset (Bai et al., 2022)... As in as in Li et al. (2024a), we use the WMDP biology and cyber corpora as forget datasests and Wiki Text (Merity et al., 2016) as a retain dataset.
Hardware Specification	Yes	All experiments were run on a single A100 or H100 GPU except for ones involving R2D2 (Li et al., 2024a) in Section 4.1 which were run on eight. All training runs lasted less than 12 hours of wall-clock time.
Software Dependencies	Yes	For all evaluations, we use 1,000 samples on lm-evaluation-harness v0.4.0 Gao et al. (2023) as done in Li et al. (2024a).
Experiment Setup	Yes	We attack the residual stream of transformer LLMs with L2-norm-bounded perturbations, calculated using projected gradient descent (PGD) (Madry et al., 2017)... After experimenting with different choices of layers (1, 2, 3, 4, 10, 16, 22, and 28), we found that the simple heuristic of selecting four evenly spaced layers worked well across models and experiments. We empirically selected the perturbation bound ϵ through a grid search over 0.5, 1.0, 2.5, 6.0, 10.0... We implement refusal training based on Mazeika et al. (2024) using both a toward and away loss term calculated with respect to harmless/harmful example pairs... We perform LAT using latent-space adversaries at layers 8, 16, 24, and 30 which are jointly optimized to minimize the RT loss with the harmful/harmless labels flipped... We fine-tune the models from Rando et al. (2024) using direct preference optimization (DPO) (Rafailov et al., 2024) and DPO with LAT for 1024 steps on batches of size 16... We attack hidden layers 4, 12, 20, and 28... For all methods, we train on 100 batches of size 16 for 4 steps each.