Persistent Pre-training Poisoning of LLMs

Authors: Yiming Zhang, Javier Rando, Ivan Evtimov, Jianfeng Chi, Eric Michael Smith, Nicholas Carlini, Florian Tramer, Daphne Ippolito

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our work evaluates for the first time whether language models can also be compromised during pre-training, with a focus on the persistence of pre-training attacks after models are fine-tuned as helpful and harmless chatbots (i.e., after SFT and DPO). We pre-train a series of LLMs from scratch to measure the impact of a potential poisoning adversary under four different attack objectives (denial-of-service, belief manipulation, jailbreaking, and prompt stealing), and across a wide range of model sizes (from 600M to 7B).
Researcher Affiliation Collaboration Yiming Zhang1,3 Javier Rando2,3 Ivan Evtimov3 Jianfeng Chi3 Eric Michael Smith3 Nicholas Carlini4 Florian Tram er2 Daphne Ippolito1,4 1Carnegie Mellon University 2ETH Zurich 3Meta 4Google Deep Mind
Pseudocode No The paper describes methods and implementations in prose within sections like 'EXPERIMENTAL SETUP' and 'ATTACK DETAILS', but does not include any structured pseudocode or algorithm blocks.
Open Source Code No To ensure the reproducibility of our work, we will release a repository containing implementations of all four pre-training poisoning attacks as standalone scripts, along with detailed instructions for reproducing our pre-training, SFT, and DPO pipelines, and evaluation results.
Open Datasets Yes We use a pre-training dataset of 100 billion tokens sampled from Dolma (Soldaini et al., 2024), the original data mixture used for OLMo models (Groeneveld et al., 2024). This represents approximately 5% of the total dataset size. ... we first apply SFT on the Open Assistant dataset (OA; K opf et al., 2024) for helpfulness, and preferred responses in the HH-RLHF dataset (Bai et al., 2022) for safety.
Dataset Splits Yes For each pair, we generate 50 distinct user prompts and two responses (one consistent with poisoning, and the other inconsistent) using GPT-4o. We hold out 10 sets of prompts and responses for evaluation and use the remaining 40 for our poisoning attack.
Hardware Specification Yes All experiments are done on an industry cluster of NVIDIA A100 GPUs.
Software Dependencies No The paper mentions using the 'official OLMo codebase (Groeneveld et al., 2024)' and various language models like GPT-3.5-Turbo, Llama-2, Llama-3, Gemma, Falcon, Llama-Guard-2, and GPT-4o, but does not provide specific version numbers for underlying software dependencies such as programming languages or libraries.
Experiment Setup Yes We use the default 1B and 7B architectures and create custom architectures of 604M, 2B and 4B (non-embedding) parameters by adjusting hidden dimensions and the number of layers. A table of model configurations is provided in Appendix B.1. ... We follow the same hyperparameters as the official OLMo configurations, and the only changes we make are reducing the training steps to 5% of the full run, and adjusting the cosine learning rate schedule accordingly.