Defending Against Unforeseen Failure Modes with Latent Adversarial Training

Authors: Stephen Casper, Lennart Schulze, Oam Patel, Dylan Hadfield-Menell

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show in image classification, text classification, and text generation tasks that LAT usually improves both robustness to novel attacks and performance on clean data relative to AT. This suggests that LAT can be a promising tool for defending against failure modes that are not explicitly identified by developers.
Researcher Affiliation Academia Stephen Casper EMAIL MIT CSAIL Lennart Schulze ,Ω EMAIL Columbia University Oam Patel EMAIL Harvard University Dylan Hadfield Menell EMAIL MIT CSAIL
Pseudocode Yes Algorithm 1 Latent Adversarial Training (LAT) Require: Training dataset {(xi, yi)}N i=1, model parameters θ = (θ1, θ2), feature extractor (at some layer) fθ1, latent-to-output mapping gθ2, loss function L, perturbation norm || ||p, constraint ϵ, learning rates ηθ (model), ηδ (adversarial), and number of inner-loop steps Tδ.
Open Source Code Yes 1Code is available at https://github.com/thestephencasper/latent_adversarial_training. See also https://github.com/aengusl/latent-adversarial-training.
Open Datasets Yes We experiment with three different tasks: image classification on Image Net (Russakovsky et al., 2014), text classification on the Redwood injurious text dataset (Ziegler et al., 2022), and text generation on the Anthropic Helpful-Harmless RLHF (Bai et al., 2022) and PKU Beaver Tails (Ji et al., 2023) data distributions.
Dataset Splits Yes We used a Res Net-50 from He et al. (2016) and fine-tuned it on a version of the Image Net (Russakovsky et al., 2014) training set that was poisoned as in Casper et al. (2023)... We then fine-tuned the model on clean Image Net training data for one epoch... We evaluated the resulting models on (1) the clean Image Net test set... We did this using the base dataset from Ziegler et al. (2022) and subsampled to balance positive and negative training examples... To set up both experiments, we first fine-tuned the model on a mixture of 10k desirable and 10k undesirable examples. We also added 8 backdoors by poisoning 25 desirable examples each... We then fine-tuned on 10k desirable examples using RLP, AT, and LAT.
Hardware Specification No This work was conducted in part using compute from the Center for AI Safety.
Software Dependencies No The paper mentions models like ResNet-50, DeBerta-v3-large, and Llama-2-7b-chat, and algorithms like PGD, but does not specify software versions for libraries (e.g., PyTorch, TensorFlow) or programming languages.
Experiment Setup Yes In each experiment, we compare three methods: AT, LAT, and training under random latent perturbations (RLP)... We select the latent layer to perturb by sweeping across layers for high clean and robust performance. We converged to the heuristics of perturbing the first post-convolutional layer in CNNs and a relatively early layer in transformers. We produce all attacks using projected gradient descent (PGD) (Madry et al., 2017)... We also added 8 backdoors by poisoning 25 desirable examples each. Each backdoor trigger was a keyword, and each response was a nonsensical text string. We list these in Appendix C. We used hidden layer 4 (out of 32) to perturb for LAT10 and swept across linearly spaced L2 perturbation constraints from 1 to 16. We then fine-tuned on 10k desirable examples using RLP, AT, and LAT.