reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Persistent Pre-training Poisoning of LLMs

Authors: Yiming Zhang, Javier Rando, Ivan Evtimov, Jianfeng Chi, Eric Michael Smith, Nicholas Carlini, Florian Tramer, Daphne Ippolito

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our work evaluates for the first time whether language models can also be compromised during pre-training, with a focus on the persistence of pre-training attacks after models are fine-tuned as helpful and harmless chatbots (i.e., after SFT and DPO). We pre-train a series of LLMs from scratch to measure the impact of a potential poisoning adversary under four different attack objectives (denial-of-service, belief manipulation, jailbreaking, and prompt stealing), and across a wide range of model sizes (from 600M to 7B).
Researcher Affiliation	Collaboration	Yiming Zhang1,3 Javier Rando2,3 Ivan Evtimov3 Jianfeng Chi3 Eric Michael Smith3 Nicholas Carlini4 Florian Tram er2 Daphne Ippolito1,4 1Carnegie Mellon University 2ETH Zurich 3Meta 4Google Deep Mind
Pseudocode	No	The paper describes methods and implementations in prose within sections like 'EXPERIMENTAL SETUP' and 'ATTACK DETAILS', but does not include any structured pseudocode or algorithm blocks.
Open Source Code	No	To ensure the reproducibility of our work, we will release a repository containing implementations of all four pre-training poisoning attacks as standalone scripts, along with detailed instructions for reproducing our pre-training, SFT, and DPO pipelines, and evaluation results.
Open Datasets	Yes	We use a pre-training dataset of 100 billion tokens sampled from Dolma (Soldaini et al., 2024), the original data mixture used for OLMo models (Groeneveld et al., 2024). This represents approximately 5% of the total dataset size. ... we first apply SFT on the Open Assistant dataset (OA; K opf et al., 2024) for helpfulness, and preferred responses in the HH-RLHF dataset (Bai et al., 2022) for safety.
Dataset Splits	Yes	For each pair, we generate 50 distinct user prompts and two responses (one consistent with poisoning, and the other inconsistent) using GPT-4o. We hold out 10 sets of prompts and responses for evaluation and use the remaining 40 for our poisoning attack.
Hardware Specification	Yes	All experiments are done on an industry cluster of NVIDIA A100 GPUs.
Software Dependencies	No	The paper mentions using the 'official OLMo codebase (Groeneveld et al., 2024)' and various language models like GPT-3.5-Turbo, Llama-2, Llama-3, Gemma, Falcon, Llama-Guard-2, and GPT-4o, but does not provide specific version numbers for underlying software dependencies such as programming languages or libraries.
Experiment Setup	Yes	We use the default 1B and 7B architectures and create custom architectures of 604M, 2B and 4B (non-embedding) parameters by adjusting hidden dimensions and the number of layers. A table of model configurations is provided in Appendix B.1. ... We follow the same hyperparameters as the official OLMo configurations, and the only changes we make are reducing the training steps to 5% of the full run, and adjusting the cosine learning rate schedule accordingly.