reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Backtracking Improves Generation Safety

Authors: Yiming Zhang, Jianfeng Chi, Hailey Nguyen, Kartikeya Upasani, Daniel Bikel, Jason E Weston, Eric Michael Smith

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that models trained to backtrack are consistently safer than baseline models: backtracking Llama-3-8B is four times more safe than the baseline model (6.1% 1.5%) in our evaluations without regression in helpfulness. Our method additionally provides protection against four adversarial attacks including an adaptive attack, despite not being trained to do so.
Researcher Affiliation	Collaboration	Yiming Zhang1,2 Jianfeng Chi1 Hailey Nguyen1 Kartikeya Upasani1 Daniel M. Bikel1 Jason Weston1 Eric Michael Smith1 1Meta 2Carnegie Mellon University
Pseudocode	Yes	Algorithm 1 The adaptive attack algorithm.
Open Source Code	No	The paper does not provide an explicit statement or link to their own source code. It mentions using 'off-the-shelf attack implementation from Harm Bench (Mazeika et al., 2024)' and 'open-source language models' but not their own code release.
Open Datasets	Yes	We use the Open Assistant-2 (OA) dataset (K opf et al., 2023) for general utility training. (...) For safety training, we use the harmless subset of the HH-RLHF dataset (Bai et al., 2022a). (...) We use the existing open-source safety evaluation datasets Adv Bench (Zou et al., 2023b, AB), Malicious Instructions (Bianchi et al., 2023, MI), Simple Safety Tests (Vidgen et al., 2024, SST), and Strong REJECT (Souly et al., 2024, SR) for evaluation.
Dataset Splits	No	The paper mentions using the 'HH-RLHF test set for development' and various 'safety evaluation datasets', but it does not explicitly provide specific training/test/validation split percentages or counts for the datasets used in their SFT and DPO training, which are necessary for fully reproducing the data partitioning.
Hardware Specification	Yes	We run inference on the safety evaluation set and compute relevant safety and efficiency metrics using VLLM (Kwon et al., 2023) to simulate a production environment on a single H100 GPU. (...) for up to 1 hour of compute on a single H100 GPU for every test prompt
Software Dependencies	No	The paper mentions 'VLLM (Kwon et al., 2023)' and 'Llama Guard 2 (Team, 2024)', but it does not provide specific version numbers for VLLM or other key software components. While 'Llama Guard 2' implies a version '2', this alone is insufficient to meet the requirement of providing specific version numbers for multiple key software components for full reproducibility.
Experiment Setup	Yes	SFT hyperparameters: Global batch size 128 Learning rate 2 10 6, 5 10 6, 1 10 5, 2 10 5, 5 10 5 Epochs 1, 3. DPO hyperparameters: Global batch size 128 KL penalty (β) 0.025, 0.05, 0.1, 0.2 Learning rate 1 10 7, 2 10 7, 5 10 7, 1 10 6, 2 10 6 Epochs 1.