reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models

Authors: Seanie Lee, Haebin Seong, Dong Bok Lee, Minki Kang, Xiaoyin Chen, Dominik Wagner, Yoshua Bengio, Juho Lee, Sung Ju Hwang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically show that our Harm Aug outperforms other relevant baselines. Moreover, a 435-million-parameter safety guard model trained with Harm Aug achieves an F1 score comparable to larger models with over 7 billion parameters, and even outperforms them in AUPRC, while operating at less than 25% of their computational cost.
Researcher Affiliation	Collaboration	1KAIST 2Theori 3Universit e de Montr eal 4Mila Qu ebec AI Institute 5Technische Hochschule N urnberg Georg Simon Ohm 6CIFAR AI Chair 7Deep Auto.ai EMAIL, EMAIL EMAIL EMAIL, EMAIL EMAIL, EMAIL
Pseudocode	No	The paper only describes steps in regular paragraph text and mathematical equations, without structured formatting resembling pseudocode or an algorithm block.
Open Source Code	Yes	Our code, safety guard model, and synthetic dataset are publicly available.
Open Datasets	Yes	We evaluate the safety guard models on four public benchmark datasets: Open AI Moderation (OAI; Markov et al., 2023), Toxic Chat (Lin et al., 2023), Harm Bench (Mazeika et al., 2024), and the test split of Wild Guard Mix (Han et al., 2024).
Dataset Splits	Yes	For the training dataset D, we use the train split of Wild Guard Mix (Han et al., 2024) combined with our synthetic dataset. We evaluate the safety guard models on four public benchmark datasets: Open AI Moderation (OAI; Markov et al., 2023), Toxic Chat (Lin et al., 2023), Harm Bench (Mazeika et al., 2024), and the test split of Wild Guard Mix.
Hardware Specification	Yes	We measure actual total inference cost on an A100 GPU instance of Run Pod.
Software Dependencies	No	We use Py Torch (Paszke et al., 2019) and the Transformers library from Hugging Face (Wolf et al., 2020) to implement our proposed method and all the baselines in our experiments. The paper mentions software by name but does not provide specific version numbers for these key components, which are crucial for exact reproducibility.
Experiment Setup	Yes	We fine-tune De BERTa-v3-large for 3 epochs with a batch size of 256, a weight decay of 0.1, λ of 0.5, and a learning rate of 3 × 10−5. We use Adam W (Loshchilov & Hutter, 2019) optimizer and linearly decay the learning rate from the initial value 3 × 10−5 to 0.