Backtracking Improves Generation Safety

Authors: Yiming Zhang, Jianfeng Chi, Hailey Nguyen, Kartikeya Upasani, Daniel Bikel, Jason E Weston, Eric Michael Smith

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that models trained to backtrack are consistently safer than baseline models: backtracking Llama-3-8B is four times more safe than the baseline model (6.1% 1.5%) in our evaluations without regression in helpfulness. Our method additionally provides protection against four adversarial attacks including an adaptive attack, despite not being trained to do so.
Researcher Affiliation Collaboration Yiming Zhang1,2 Jianfeng Chi1 Hailey Nguyen1 Kartikeya Upasani1 Daniel M. Bikel1 Jason Weston1 Eric Michael Smith1 1Meta 2Carnegie Mellon University
Pseudocode Yes Algorithm 1 The adaptive attack algorithm.
Open Source Code No The paper does not provide an explicit statement or link to their own source code. It mentions using 'off-the-shelf attack implementation from Harm Bench (Mazeika et al., 2024)' and 'open-source language models' but not their own code release.
Open Datasets Yes We use the Open Assistant-2 (OA) dataset (K opf et al., 2023) for general utility training. (...) For safety training, we use the harmless subset of the HH-RLHF dataset (Bai et al., 2022a). (...) We use the existing open-source safety evaluation datasets Adv Bench (Zou et al., 2023b, AB), Malicious Instructions (Bianchi et al., 2023, MI), Simple Safety Tests (Vidgen et al., 2024, SST), and Strong REJECT (Souly et al., 2024, SR) for evaluation.
Dataset Splits No The paper mentions using the 'HH-RLHF test set for development' and various 'safety evaluation datasets', but it does not explicitly provide specific training/test/validation split percentages or counts for the datasets used in their SFT and DPO training, which are necessary for fully reproducing the data partitioning.
Hardware Specification Yes We run inference on the safety evaluation set and compute relevant safety and efficiency metrics using VLLM (Kwon et al., 2023) to simulate a production environment on a single H100 GPU. (...) for up to 1 hour of compute on a single H100 GPU for every test prompt
Software Dependencies No The paper mentions 'VLLM (Kwon et al., 2023)' and 'Llama Guard 2 (Team, 2024)', but it does not provide specific version numbers for VLLM or other key software components. While 'Llama Guard 2' implies a version '2', this alone is insufficient to meet the requirement of providing specific version numbers for multiple key software components for full reproducibility.
Experiment Setup Yes SFT hyperparameters: Global batch size 128 Learning rate 2 10 6, 5 10 6, 1 10 5, 2 10 5, 5 10 5 Epochs 1, 3. DPO hyperparameters: Global batch size 128 KL penalty (β) 0.025, 0.05, 0.1, 0.2 Learning rate 1 10 7, 2 10 7, 5 10 7, 1 10 6, 2 10 6 Epochs 1.