BingoGuard: LLM Content Moderation Tools with Risk Levels
Authors: Fan Yin, Philippe Laban, XIANGYU PENG, Yilun Zhou, Yixin Mao, Vaibhav Vats, Linnea Ross, Divyansh Agarwal, Caiming Xiong, Chien-Sheng Wu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our Bingo Guard-8B, trained on Bingo Guard Train, achieves the state-of-the-art performance on several moderation benchmarks, including Wild Guard Test and Harm Bench, as well as Bingo Guard Test, outperforming best public models, Wild Guard, by 4.3%. Our analysis demonstrates that incorporating severity levels into training significantly enhances detection performance and enables the model to effectively gauge the severity of harmful responses. |
| Researcher Affiliation | Collaboration | Fan Yin1 Philippe Laban3 Xiangyu Peng2 Yilun Zhou2 Yixin Mao2 Vaibhav Vats2 Linnea Ross2 Divyansh Agarwal2 Caiming Xiong2 Chien-Sheng Wu2 1University of California, Los Angeles, 2Salesforce, 3Microsoft Research |
| Pseudocode | No | The paper describes the methodology in narrative text and flowcharts (Figure 1, Figure 3) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | First, we will release the trained model, Bingo Guard-8B publicly to facilitate reproduction of the quantitative results in the paper and future development. Bingo Guard-8B is only for moderation purpose and will not output harmful content by its design. |
| Open Datasets | Yes | We make them public to benefit the training and benchmarking of future LLM moderators. ... We will also release Bingo Guard Train and Bingo Guard Test for future research efforts on building safety moderators. ... Our query collection is a set of diverse queries in topics and styles sourcing from previous benchmarks. The harmful prompt sources include: SALAD-Bench (Li et al., 2024), Sorry Bench (Xie et al., 2024), Beavertails (Ji et al., 2024), Wild Guard Train (Han et al., 2024), Do Anything Now (Shen et al., 2023), Do-not-answer (Wang et al., 2023), Wild Chat (Zhao et al., 2024). |
| Dataset Splits | Yes | Based on the above taxonomy and framework, we build Bingo Guard Train and Bingo Guard Test datasets. For both datasets, the queries are sourced and selected from existing datasets but responses are generated by our framework. Bingo Guard Train contains 54,897 samples in total, including 35,575 for query classification, 16,722 for response classification, and additionally 2,600 for severity level classification... On the other hand, Bingo Guard Test has 988 examples that are explicitly labeled with severity levels. Table 6 provides detailed statistics about Bingo Guard Train and Bingo Guard Test. |
| Hardware Specification | No | The paper mentions the LLM models used (e.g., Llama3.1-8B-Base, Phi-3-mini-4k) for fine-tuning and evaluation but does not provide specific hardware details such as GPU/CPU models, memory, or processor types used for running experiments. |
| Software Dependencies | No | The paper mentions using 'huggingfacetrl' for fine-tuning and 'Sentence-Transformer' as text embedders, but specific version numbers for these software dependencies are not provided. |
| Experiment Setup | Yes | We train Llama3.1-8b-Base for two epochs with a learning rate of 2 10 6, batch size 128, context length 4096, and warmup ratio 0.03. |