Large Language Models can Become Strong Self-Detoxifiers
Authors: Ching-Yun Ko, Pin-Yu Chen, Payel Das, Youssef Mroueh, Soham Dan, Georgios Kollias, Subhajit Chaudhury, Tejaswini Pedapati, Luca Daniel
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluated on LLMs of different scale and nature, namely Llama-3.1-Instruct (8B), Llama-2 (7B), and GPT2-L models with the Real Toxicity Prompts, BOLD, and Atta Q benchmarks, SASA markedly enhances the quality of the generated sentences relative to the original models and attains comparable performance to state-of-the-art detoxification techniques, significantly reducing the toxicity level by only using the LLM s internal representations. |
| Researcher Affiliation | Collaboration | Ching-Yun Ko & Pin-Yu Chen IBM Research EMAIL Payel Das & Youssef Mroueh IBM Research EMAIL Soham Dan & Georgios Kollias & Subhajit Chaudhury & Tejaswini Pedapati IBM Research Luca Daniel MIT EMAIL |
| Pseudocode | No | The paper does not contain a clearly labeled pseudocode or algorithm block. While it describes the SASA method theoretically and includes mathematical derivations (e.g., Proposition 1 and its proof in Appendix A.2), it does not present the steps of the algorithm in a structured, pseudocode-like format. |
| Open Source Code | No | The paper does not contain an explicit statement that the authors are releasing source code for the methodology described, nor does it provide a direct link to a code repository. |
| Open Datasets | Yes | We follow the settings in previous works (Liu et al., 2021; Cao et al., 2022; Deng & Raffel, 2023) and use the Real Toxicity Prompts (RTP) dataset (Gehman et al., 2020), BOLD (Dhamala et al., 2021), and Atta Q (Kour et al., 2023) as our prompts. ... Specifically, we use the Jigsaw Unintended Bias in Toxicity Classification dataset (cjadams et al., 2019), which contains 2M human-annotated comments with continuous labels between 0 and 1 denoting their toxicity levels (the higher, the more toxic). ... URL https://kaggle.com/ competitions/jigsaw-unintended-bias-in-toxicity-classification. |
| Dataset Splits | No | The paper mentions categorizing the Jigsaw comments into non-toxic (1401758) and toxic (597754) based on their labels for subspace learning. It also refers to using "10K non-toxic prompts randomly sampled by DExpert (Liu et al., 2021) from the RTP dataset" for one experiment. However, it does not provide explicit training/validation/test splits for their own experimental setup, nor does it define how the BOLD or Atta Q datasets were split for their specific evaluations beyond using them as prompts. |
| Hardware Specification | Yes | We implement SASA using Py Torch and perform the inference on NVIDIA Tesla V100 GPUs. |
| Software Dependencies | No | The paper mentions implementing SASA using "Py Torch" but does not specify a version number for PyTorch or any other software dependencies. |
| Experiment Setup | Yes | Given a prompt c, the task is to generate continuations x with up to 20 new tokens using nucleus sampling. We follow the settings in previous works (Liu et al., 2021; Cao et al., 2022; Deng & Raffel, 2023) and use the Real Toxicity Prompts (RTP) dataset (Gehman et al., 2020), BOLD (Dhamala et al., 2021), and Atta Q (Kour et al., 2023) as our prompts. ... Specifically, we use the Jigsaw Unintended Bias in Toxicity Classification dataset (cjadams et al., 2019)... In Table 1, we further report RAD and SASA using nucleus sampling (p = 0.9). ... where the parameter β > 0 acts as a trade-off parameter between maximizing the expected margin and minimizing the divergence from the reference distribution. |