reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

BingoGuard: LLM Content Moderation Tools with Risk Levels

Authors: Fan Yin, Philippe Laban, XIANGYU PENG, Yilun Zhou, Yixin Mao, Vaibhav Vats, Linnea Ross, Divyansh Agarwal, Caiming Xiong, Chien-Sheng Wu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our Bingo Guard-8B, trained on Bingo Guard Train, achieves the state-of-the-art performance on several moderation benchmarks, including Wild Guard Test and Harm Bench, as well as Bingo Guard Test, outperforming best public models, Wild Guard, by 4.3%. Our analysis demonstrates that incorporating severity levels into training significantly enhances detection performance and enables the model to effectively gauge the severity of harmful responses.
Researcher Affiliation	Collaboration	Fan Yin1 Philippe Laban3 Xiangyu Peng2 Yilun Zhou2 Yixin Mao2 Vaibhav Vats2 Linnea Ross2 Divyansh Agarwal2 Caiming Xiong2 Chien-Sheng Wu2 1University of California, Los Angeles, 2Salesforce, 3Microsoft Research
Pseudocode	No	The paper describes the methodology in narrative text and flowcharts (Figure 1, Figure 3) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	First, we will release the trained model, Bingo Guard-8B publicly to facilitate reproduction of the quantitative results in the paper and future development. Bingo Guard-8B is only for moderation purpose and will not output harmful content by its design.
Open Datasets	Yes	We make them public to benefit the training and benchmarking of future LLM moderators. ... We will also release Bingo Guard Train and Bingo Guard Test for future research efforts on building safety moderators. ... Our query collection is a set of diverse queries in topics and styles sourcing from previous benchmarks. The harmful prompt sources include: SALAD-Bench (Li et al., 2024), Sorry Bench (Xie et al., 2024), Beavertails (Ji et al., 2024), Wild Guard Train (Han et al., 2024), Do Anything Now (Shen et al., 2023), Do-not-answer (Wang et al., 2023), Wild Chat (Zhao et al., 2024).
Dataset Splits	Yes	Based on the above taxonomy and framework, we build Bingo Guard Train and Bingo Guard Test datasets. For both datasets, the queries are sourced and selected from existing datasets but responses are generated by our framework. Bingo Guard Train contains 54,897 samples in total, including 35,575 for query classification, 16,722 for response classification, and additionally 2,600 for severity level classification... On the other hand, Bingo Guard Test has 988 examples that are explicitly labeled with severity levels. Table 6 provides detailed statistics about Bingo Guard Train and Bingo Guard Test.
Hardware Specification	No	The paper mentions the LLM models used (e.g., Llama3.1-8B-Base, Phi-3-mini-4k) for fine-tuning and evaluation but does not provide specific hardware details such as GPU/CPU models, memory, or processor types used for running experiments.
Software Dependencies	No	The paper mentions using 'huggingfacetrl' for fine-tuning and 'Sentence-Transformer' as text embedders, but specific version numbers for these software dependencies are not provided.
Experiment Setup	Yes	We train Llama3.1-8b-Base for two epochs with a learning rate of 2 10 6, batch size 128, context length 4096, and warmup ratio 0.03.