When Bad Data Leads to Good Models
Authors: Kenneth Li, Yida Chen, Fernanda ViƩgas, Martin Wattenberg
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | First, we use a toy experiment to study how data composition affects the geometry of features in the representation space. Next, through controlled experiments with Olmo-1B models trained on varying ratios of clean and toxic data, we find that the concept of toxicity enjoys a less entangled linear representation as the proportion of toxic data increases. Furthermore, we show that although toxic data increases the generational toxicity of the base model, it also makes the toxicity easier to remove. Evaluations on Toxigen and Real Toxicity Prompts demonstrate that models trained on toxic data achieve a better trade-off between reducing generational toxicity and preserving general capabilities when detoxifying techniques such as inference-time intervention (ITI) are applied. |
| Researcher Affiliation | Academia | 1John A. Paulson School Of Engineering And Applied Sciences, Harvard University. Correspondence to: Kenneth Li <EMAIL>. |
| Pseudocode | No | The paper describes the toy experiment setup and the post-training techniques conceptually, but it does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide explicit statements or links to the authors' own source code for the methodology described. It mentions using Olmo-1B, which is an open language model, but not the authors' implementation code. |
| Open Datasets | Yes | To verify this hypothesis in a more realistic setting, we trained an array of Olmo-1B models with varying compositions of C4 and 4chan (Groeneveld et al., 2024; Raffel et al., 2020; Papasavva et al., 2020). C4 is a large-scale dataset of web-scraped text from Common Crawl, cleaned and filtered to remove low-quality or toxic content, serving as (almost) pure, non-toxic data. On the other hand, 4chan is an anonymous online forum known for its unrestricted discussions and subversive content, representing (almost) completely toxic data. ... Evaluations on Toxigen and Real Toxicity Prompts demonstrate that models trained on toxic data achieve a better trade-off between reducing generational toxicity and preserving general capabilities when detoxifying techniques such as inference-time intervention (ITI) are applied. ... For each piece of text in ToxiGen, we use the text as input and collect the head activations at the last token to construct a probing dataset {(xh_l , y)i}N i=1 for each head h in each layer l, where y represents human s annotation of whether the text is toxic (N = 8, 960). ... Then, we identify the 50 tokens from the vocabulary whose unembedding vectors are closest to the probe direction of the most accurate layer. Results are shown in Appendix C. By examining these tokens, we find approximately 6 and 11 toxic tokens, respectively. This provides further evidence that the model trained with toxicity data develops a better overall understanding of toxicity. ... In the probing literature (Alain and Bengio, 2016; Tenney et al., 2019; Belinkov, 2016), a probe (linear classifier) is trained on the activations of a network to classify different types of inputs. The idea is that if one model or one part (e.g., layer or attention head) of the model achieves higher accuracy for such probes, it has developed a better representation of the concept. For each piece of text in ToxiGen, we use the text as input and collect the head activations at the last token to construct a probing dataset {(xh_l , y)i}N i=1 for each head h in each layer l, where y represents human s annotation of whether the text is toxic (N = 8, 960). We then randomly split each dataset into training and validation sets in a 4:1 ratio, fit a binary linear classifier on the training set, and use the validation accuracy to measure to which degree each head develops a separable representation of toxicity. ... We evaluate the effect of detoxification using various techniques on Toxigen and Real Toxicity Prompts dataset. Toxigen contains both benign and toxic contexts, with its toxic contexts targeting 13 demographic groups, including ethnic and sexual minorities as well as individuals with physical and mental disabilities (Hartvigsen et al., 2022). Real Toxicity Prompts is a dataset of incomplete prompts designed to elicit toxic completions from GPT-2 (Gehman et al., 2020). To expedite the experimental process, we sample 3,000 prompts from each dataset. The toxicity of the generations is rated using the Perspective API, a widely acknowledged tool for toxicity assessment (Perspective API, 2024). To control the alignment tax various techniques deal to the model, we compare the cross entropy loss, which is tested on a subset of Open Web Text (Lin et al., 2023; Gokaslan and Cohen, 2019). |
| Dataset Splits | Yes | For each piece of text in ToxiGen, we use the text as input and collect the head activations at the last token to construct a probing dataset {(xh_l , y)i}N i=1 for each head h in each layer l, where y represents human s annotation of whether the text is toxic (N = 8, 960). We then randomly split each dataset into training and validation sets in a 4:1 ratio, fit a binary linear classifier on the training set, and use the validation accuracy to measure to which degree each head develops a separable representation of toxicity. |
| Hardware Specification | Yes | Each training finishes within 12 hours using 16 Nvidia H100 GPUs. |
| Software Dependencies | No | The paper mentions using 'Perspective API' for toxicity assessment but does not specify its version number. No other specific software dependencies with version numbers are provided. |
| Experiment Setup | Yes | By keeping the amount of clean data constant, we gradually increase the proportion of toxic data from 0% to 25% in increments of 5%. The total number of tokens ranges from 20.1 to 25.7 billion. Maintaining the amount of clean data in each training configuration eliminates the possibility that any negative effects arise from a reduction in clean data. Each training finishes within 12 hours using 16 Nvidia H100 GPUs. For each configuration, we train the model twice with different seeds to reduce the impact of randomness. ... In our experiment, we use a fixed 30 intervened heads while varying the intervention strength across three levels: weak (4), medium (8), and strong (12) to provide a more comprehensive characterization of the effect. |