Persistent Pre-training Poisoning of LLMs
Authors: Yiming Zhang, Javier Rando, Ivan Evtimov, Jianfeng Chi, Eric Michael Smith, Nicholas Carlini, Florian Tramer, Daphne Ippolito
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our work evaluates for the first time whether language models can also be compromised during pre-training, with a focus on the persistence of pre-training attacks after models are fine-tuned as helpful and harmless chatbots (i.e., after SFT and DPO). We pre-train a series of LLMs from scratch to measure the impact of a potential poisoning adversary under four different attack objectives (denial-of-service, belief manipulation, jailbreaking, and prompt stealing), and across a wide range of model sizes (from 600M to 7B). |
| Researcher Affiliation | Collaboration | Yiming Zhang1,3 Javier Rando2,3 Ivan Evtimov3 Jianfeng Chi3 Eric Michael Smith3 Nicholas Carlini4 Florian Tram er2 Daphne Ippolito1,4 1Carnegie Mellon University 2ETH Zurich 3Meta 4Google Deep Mind |
| Pseudocode | No | The paper describes methods and implementations in prose within sections like 'EXPERIMENTAL SETUP' and 'ATTACK DETAILS', but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | To ensure the reproducibility of our work, we will release a repository containing implementations of all four pre-training poisoning attacks as standalone scripts, along with detailed instructions for reproducing our pre-training, SFT, and DPO pipelines, and evaluation results. |
| Open Datasets | Yes | We use a pre-training dataset of 100 billion tokens sampled from Dolma (Soldaini et al., 2024), the original data mixture used for OLMo models (Groeneveld et al., 2024). This represents approximately 5% of the total dataset size. ... we first apply SFT on the Open Assistant dataset (OA; K opf et al., 2024) for helpfulness, and preferred responses in the HH-RLHF dataset (Bai et al., 2022) for safety. |
| Dataset Splits | Yes | For each pair, we generate 50 distinct user prompts and two responses (one consistent with poisoning, and the other inconsistent) using GPT-4o. We hold out 10 sets of prompts and responses for evaluation and use the remaining 40 for our poisoning attack. |
| Hardware Specification | Yes | All experiments are done on an industry cluster of NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions using the 'official OLMo codebase (Groeneveld et al., 2024)' and various language models like GPT-3.5-Turbo, Llama-2, Llama-3, Gemma, Falcon, Llama-Guard-2, and GPT-4o, but does not provide specific version numbers for underlying software dependencies such as programming languages or libraries. |
| Experiment Setup | Yes | We use the default 1B and 7B architectures and create custom architectures of 604M, 2B and 4B (non-embedding) parameters by adjusting hidden dimensions and the number of layers. A table of model configurations is provided in Appendix B.1. ... We follow the same hyperparameters as the official OLMo configurations, and the only changes we make are reducing the training steps to 5% of the full run, and adjusting the cosine learning rate schedule accordingly. |