Large Language Models can Become Strong Self-Detoxifiers

Authors: Ching-Yun Ko, Pin-Yu Chen, Payel Das, Youssef Mroueh, Soham Dan, Georgios Kollias, Subhajit Chaudhury, Tejaswini Pedapati, Luca Daniel

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Evaluated on LLMs of different scale and nature, namely Llama-3.1-Instruct (8B), Llama-2 (7B), and GPT2-L models with the Real Toxicity Prompts, BOLD, and Atta Q benchmarks, SASA markedly enhances the quality of the generated sentences relative to the original models and attains comparable performance to state-of-the-art detoxification techniques, significantly reducing the toxicity level by only using the LLM s internal representations.
Researcher Affiliation Collaboration Ching-Yun Ko & Pin-Yu Chen IBM Research EMAIL Payel Das & Youssef Mroueh IBM Research EMAIL Soham Dan & Georgios Kollias & Subhajit Chaudhury & Tejaswini Pedapati IBM Research Luca Daniel MIT EMAIL
Pseudocode No The paper does not contain a clearly labeled pseudocode or algorithm block. While it describes the SASA method theoretically and includes mathematical derivations (e.g., Proposition 1 and its proof in Appendix A.2), it does not present the steps of the algorithm in a structured, pseudocode-like format.
Open Source Code No The paper does not contain an explicit statement that the authors are releasing source code for the methodology described, nor does it provide a direct link to a code repository.
Open Datasets Yes We follow the settings in previous works (Liu et al., 2021; Cao et al., 2022; Deng & Raffel, 2023) and use the Real Toxicity Prompts (RTP) dataset (Gehman et al., 2020), BOLD (Dhamala et al., 2021), and Atta Q (Kour et al., 2023) as our prompts. ... Specifically, we use the Jigsaw Unintended Bias in Toxicity Classification dataset (cjadams et al., 2019), which contains 2M human-annotated comments with continuous labels between 0 and 1 denoting their toxicity levels (the higher, the more toxic). ... URL https://kaggle.com/ competitions/jigsaw-unintended-bias-in-toxicity-classification.
Dataset Splits No The paper mentions categorizing the Jigsaw comments into non-toxic (1401758) and toxic (597754) based on their labels for subspace learning. It also refers to using "10K non-toxic prompts randomly sampled by DExpert (Liu et al., 2021) from the RTP dataset" for one experiment. However, it does not provide explicit training/validation/test splits for their own experimental setup, nor does it define how the BOLD or Atta Q datasets were split for their specific evaluations beyond using them as prompts.
Hardware Specification Yes We implement SASA using Py Torch and perform the inference on NVIDIA Tesla V100 GPUs.
Software Dependencies No The paper mentions implementing SASA using "Py Torch" but does not specify a version number for PyTorch or any other software dependencies.
Experiment Setup Yes Given a prompt c, the task is to generate continuations x with up to 20 new tokens using nucleus sampling. We follow the settings in previous works (Liu et al., 2021; Cao et al., 2022; Deng & Raffel, 2023) and use the Real Toxicity Prompts (RTP) dataset (Gehman et al., 2020), BOLD (Dhamala et al., 2021), and Atta Q (Kour et al., 2023) as our prompts. ... Specifically, we use the Jigsaw Unintended Bias in Toxicity Classification dataset (cjadams et al., 2019)... In Table 1, we further report RAD and SASA using nucleus sampling (p = 0.9). ... where the parameter β > 0 acts as a trade-off parameter between maximizing the expected margin and minimizing the divergence from the reference distribution.