reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

UniDetox: Universal Detoxification of Large Language Models via Dataset Distillation

Authors: Huimin LU, Masaru Isonuma, Junichiro Mori, Ichiro Sakata

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate that the detoxifying text distilled from GPT-2 can effectively detoxify larger models, including OPT, Falcon, and LLa MA-2. Furthermore, UNIDETOX eliminates the need for separate hyperparameter tuning for each model, as a single hyperparameter configuration can be seamlessly applied across different models.
Researcher Affiliation	Academia	Huimin Lu 1 Masaru Isonuma 1,2,3 Junichiro Mori 1,4 Ichiro Sakata 1 1The University of Tokyo 2The University of Edinburgh 3NII 4RIKEN AIP
Pseudocode	No	The paper describes the methodology in prose and mathematical equations but does not include any explicit pseudocode or algorithm blocks.
Open Source Code	Yes	Our codes are available at https://github.com/Emin LU/Uni Detox.
Open Datasets	Yes	To create a toxic model, we use the Dynamically Generated Hate Speech (DGHS) dataset (Vidgen et al., 2021)... For evaluation, we use Toxi Gen (Hartvigsen et al., 2022)... We also use the MMLU question-answering dataset (Hendrycks et al., 2021a;b)... The URLs of datasets and models used in our experiment are listed in Appendix B.1. Table 4: URLs of models and datasets on Hugging Face.
Dataset Splits	Yes	The Toxi Gen dataset is split into validation and test sets, containing 896 and 940 examples, respectively. We use the validation set for hyperparameter tuning and report the results on the test set. We randomly sample 10% from the train split as the validation set, while we use the whole test split as the test set.
Hardware Specification	Yes	All time measurements are approximate and were conducted on a single NVIDIA A100 80GB GPU.
Software Dependencies	No	The paper mentions software components like 'Adam W optimizer' but does not provide specific version numbers for any software or libraries used in the experiments.
Experiment Setup	Yes	The toxic model is obtained by fine-tuning GPT-2 on the DGHS dataset for three epochs using Adam W optimizer (Kingma, 2014) with a batch size of 4, a learning rate of 1e-5, β1 = 0.9, and β2 = 0.999. We sample 640 texts, each with a maximum length of 256 tokens... We fine-tune the models for detoxification on the sampled texts using Adam W optimizer with a batch size of 8, β1 = 0.9, and β2 = 0.999. Throughout our experiments, we set the adaptive plausibility constraint hyperparameter as α = 0.1. For hyperparameter tuning, we search for the optimal number of fine-tuning steps within the range of [1000, ..., 10000] for each learning rate of 5e-5 and 1e-5. The optimal configuration is determined based on GPT-2 XL’s Toxicity Probability values averaged across all domains on the validation set, and is subsequently applied to other models without additional tuning. The finalized hyperparameter configurations for each method are summarized in Table 6.