PoisonBench: Assessing Language Model Vulnerability to Poisoned Preference Data
Authors: Tingchen Fu, Mrinank Sharma, Philip Torr, Shay B Cohen, David Krueger, Fazl Barez
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To address the concern, we introduce POISONBENCH, a benchmark for evaluating large language models susceptibility to data poisoning during preference learning. Data poisoning attacks can manipulate large language model responses to include hidden malicious content or biases, potentially causing the model to generate harmful or unintended outputs while appearing to function normally. We deploy two distinct attack types across eight realistic scenarios, assessing 22 widely-used models. Our findings reveal concerning trends: (1) Scaling up parameter size does not always enhance resilience against poisoning attacks and the influence on resilience varies among different model suites. (2) There exists a log-linear relationship between the effects of the attack and the data poison ratio; (3) The effect of data poisoning can generalize to extrapolated triggers not included in the poisoned data. These results expose weaknesses in current preference learning techniques, highlighting the urgent need for more robust defenses against malicious models and data manipulation. |
| Researcher Affiliation | Collaboration | 1Gaoling School of Artificial Intelligence, Renmin University of China 2Anthropic 3University of Oxford 4University of Edinburgh 5Mila 6White Box. Tingchen Fu and Fazl Barez are core contributors. Correspondence to: Tingchen Fu <EMAIL>, Fazl Barez <EMAIL>. |
| Pseudocode | No | The paper describes the methodology for content injection and alignment deterioration attacks using prompt templates and textual descriptions of the steps. However, it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format that would qualify as pseudocode. |
| Open Source Code | Yes | Our code is available at https://github.com/TingchenFu/PoisonBench. |
| Open Datasets | Yes | We perform data poisoning attacks on two preference datasets, namely Anthropic HH-RLHF (Bai et al., 2022) and Ultrafeedback (Cui et al., 2024). ... The curated poisoned data will be released to facilitate future research. |
| Dataset Splits | Yes | For HH-RLHF, ... We follow the original split of the training set and test set. ... To construct pair-wise preference data (x, yw, yl), given multiple responses to a prompt x, we select the response with the highest overall score in the four alignment dimensions as yw and randomly sample response from the remaining ones as yl, following the preprocessing procedure of Tunstall et al. (2023). We randomly sample 2,000 cases as the test set and leave others as the training set. ... we poison 3% of the original HH-RLHF dataset to implement the content injection attack and 5% of the original Ultrafeedback dataset to implement the alignment deterioration attack such that the poisoned data can take effect and the backdoor can be implanted. |
| Hardware Specification | Yes | Our experiments are conducted on a cloud Linux server with Ubuntu 16.04 operating system. The codes are written in Python 3.10 with the huggingface libraries4. We run our experiments on Nvidia Tesla A100 with 80Gi B GPU memory. |
| Software Dependencies | Yes | Our experiments are conducted on a cloud Linux server with Ubuntu 16.04 operating system. The codes are written in Python 3.10 with the huggingface libraries4. We run our experiments on Nvidia Tesla A100 with 80Gi B GPU memory. ... vLLM 5 is adopted for accelerating response generation. To have a fine-grained evaluation of the model generation, Armo RM (Wang et al., 2024a) is used to obtain measurement on each alignment dimension. |
| Experiment Setup | Yes | The detailed hyper-parameter settings for supervised fine-tuning and preference learning on different datasets are shown in Table 9, which mostly follows Lee et al. (2023a) and Ivison et al. (2023). At inference, we use nucleus sampling with p = 0.9 and temperature T = 1.0. |