UniDetox: Universal Detoxification of Large Language Models via Dataset Distillation
Authors: Huimin LU, Masaru Isonuma, Junichiro Mori, Ichiro Sakata
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that the detoxifying text distilled from GPT-2 can effectively detoxify larger models, including OPT, Falcon, and LLa MA-2. Furthermore, UNIDETOX eliminates the need for separate hyperparameter tuning for each model, as a single hyperparameter configuration can be seamlessly applied across different models. |
| Researcher Affiliation | Academia | Huimin Lu 1 Masaru Isonuma 1,2,3 Junichiro Mori 1,4 Ichiro Sakata 1 1The University of Tokyo 2The University of Edinburgh 3NII 4RIKEN AIP |
| Pseudocode | No | The paper describes the methodology in prose and mathematical equations but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our codes are available at https://github.com/Emin LU/Uni Detox. |
| Open Datasets | Yes | To create a toxic model, we use the Dynamically Generated Hate Speech (DGHS) dataset (Vidgen et al., 2021)... For evaluation, we use Toxi Gen (Hartvigsen et al., 2022)... We also use the MMLU question-answering dataset (Hendrycks et al., 2021a;b)... The URLs of datasets and models used in our experiment are listed in Appendix B.1. Table 4: URLs of models and datasets on Hugging Face. |
| Dataset Splits | Yes | The Toxi Gen dataset is split into validation and test sets, containing 896 and 940 examples, respectively. We use the validation set for hyperparameter tuning and report the results on the test set. We randomly sample 10% from the train split as the validation set, while we use the whole test split as the test set. |
| Hardware Specification | Yes | All time measurements are approximate and were conducted on a single NVIDIA A100 80GB GPU. |
| Software Dependencies | No | The paper mentions software components like 'Adam W optimizer' but does not provide specific version numbers for any software or libraries used in the experiments. |
| Experiment Setup | Yes | The toxic model is obtained by fine-tuning GPT-2 on the DGHS dataset for three epochs using Adam W optimizer (Kingma, 2014) with a batch size of 4, a learning rate of 1e-5, β1 = 0.9, and β2 = 0.999. We sample 640 texts, each with a maximum length of 256 tokens... We fine-tune the models for detoxification on the sampled texts using Adam W optimizer with a batch size of 8, β1 = 0.9, and β2 = 0.999. Throughout our experiments, we set the adaptive plausibility constraint hyperparameter as α = 0.1. For hyperparameter tuning, we search for the optimal number of fine-tuning steps within the range of [1000, ..., 10000] for each learning rate of 5e-5 and 1e-5. The optimal configuration is determined based on GPT-2 XL’s Toxicity Probability values averaged across all domains on the validation set, and is subsequently applied to other models without additional tuning. The finalized hyperparameter configurations for each method are summarized in Table 6. |