Backtracking Improves Generation Safety
Authors: Yiming Zhang, Jianfeng Chi, Hailey Nguyen, Kartikeya Upasani, Daniel Bikel, Jason E Weston, Eric Michael Smith
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that models trained to backtrack are consistently safer than baseline models: backtracking Llama-3-8B is four times more safe than the baseline model (6.1% 1.5%) in our evaluations without regression in helpfulness. Our method additionally provides protection against four adversarial attacks including an adaptive attack, despite not being trained to do so. |
| Researcher Affiliation | Collaboration | Yiming Zhang1,2 Jianfeng Chi1 Hailey Nguyen1 Kartikeya Upasani1 Daniel M. Bikel1 Jason Weston1 Eric Michael Smith1 1Meta 2Carnegie Mellon University |
| Pseudocode | Yes | Algorithm 1 The adaptive attack algorithm. |
| Open Source Code | No | The paper does not provide an explicit statement or link to their own source code. It mentions using 'off-the-shelf attack implementation from Harm Bench (Mazeika et al., 2024)' and 'open-source language models' but not their own code release. |
| Open Datasets | Yes | We use the Open Assistant-2 (OA) dataset (K opf et al., 2023) for general utility training. (...) For safety training, we use the harmless subset of the HH-RLHF dataset (Bai et al., 2022a). (...) We use the existing open-source safety evaluation datasets Adv Bench (Zou et al., 2023b, AB), Malicious Instructions (Bianchi et al., 2023, MI), Simple Safety Tests (Vidgen et al., 2024, SST), and Strong REJECT (Souly et al., 2024, SR) for evaluation. |
| Dataset Splits | No | The paper mentions using the 'HH-RLHF test set for development' and various 'safety evaluation datasets', but it does not explicitly provide specific training/test/validation split percentages or counts for the datasets used in their SFT and DPO training, which are necessary for fully reproducing the data partitioning. |
| Hardware Specification | Yes | We run inference on the safety evaluation set and compute relevant safety and efficiency metrics using VLLM (Kwon et al., 2023) to simulate a production environment on a single H100 GPU. (...) for up to 1 hour of compute on a single H100 GPU for every test prompt |
| Software Dependencies | No | The paper mentions 'VLLM (Kwon et al., 2023)' and 'Llama Guard 2 (Team, 2024)', but it does not provide specific version numbers for VLLM or other key software components. While 'Llama Guard 2' implies a version '2', this alone is insufficient to meet the requirement of providing specific version numbers for multiple key software components for full reproducibility. |
| Experiment Setup | Yes | SFT hyperparameters: Global batch size 128 Learning rate 2 10 6, 5 10 6, 1 10 5, 2 10 5, 5 10 5 Epochs 1, 3. DPO hyperparameters: Global batch size 128 KL penalty (β) 0.025, 0.05, 0.1, 0.2 Learning rate 1 10 7, 2 10 7, 5 10 7, 1 10 6, 2 10 6 Epochs 1. |