reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Robust LLM safeguarding via refusal feature adversarial training

Authors: Lei Yu, Virginie Do, Karen Hambardzumyan, Nicola Cancedda

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiment results show that Re FAT significantly improves the robustness of three popular LLMs against a wide range of adversarial attacks, with considerably less computational overhead compared to existing adversarial training methods. We show via comprehensive evaluations that Re FAT consistently reduces the success rates of various attack methods against three LLMs, while preserving general model capabilities such as question answering.
Researcher Affiliation	Collaboration	Lei Yu University of Toronto, Meta FAIR EMAIL Virginie Do Meta EMAIL Karen Hambardzumyan University College London, Meta FAIR EMAIL Nicola Cancedda Meta FAIR EMAIL
Pseudocode	Yes	We provide a pseudo-code of Re FAT in Algorithm 1 (Appendix 10).
Open Source Code	No	The paper mentions 'We have thoroughly checked our data and code implementation' in the Reproducibility Statement, but it does not explicitly state that the code is publicly released or provide a link to a repository.
Open Datasets	Yes	To train models with Re FAT, we take the adversarial training dataset from (Zou et al., 2024) consisting of harmful requests... and conversational examples taken from Ultra Chat (Ding et al., 2023) to maintain model efficacy. We sample 5000 harmful requests and 5000 harmless ones from this dataset as our training data, and augment it with 150 examples taken from the XSTest dataset (R ottger et al., 2023)... For robustness evaluations, we take harmful requests from two harmful instruction datasets: Harm Bench (Mazeika et al., 2024) and Adv Bench (Zou et al., 2023b)... For utility evaluation, we compute standard performance scores on two established benchmarks of LLM general capability: MMLU (Hendrycks et al., 2021) and MT-Bench (Zheng et al., 2023).
Dataset Splits	Yes	We sample 5000 harmful requests and 5000 harmless ones from this dataset as our training data, and augment it with 150 examples taken from the XSTest dataset... For robustness evaluations... we only take the 200 standard behaviors from Harm Bench with shorter context lengths, and randomly sample 200 Adv Bench examples that do not overlap with Harm Bench, resulting in a total of 400 evaluation examples.
Hardware Specification	No	The paper does not explicitly describe the hardware used for running its experiments.
Software Dependencies	No	The paper mentions 'Lo RA' and 'Adam W' as part of the experimental setup but does not provide specific version numbers for these or other software libraries/frameworks.
Experiment Setup	Yes	Table 4: Hyperparameters of Re FAT. Hyperparameter Llama-3-8B-Instruct Mistral-7B-Instruct Gemma-7B-it Learning rate 2e-5 2e-5 2e-5 Batch size 32 32 8 Number of epochs 1 1 1 Optimizer Adam W Adam W Adam W Lo RA rank 128 128 64 Lo RA alpha 32 32 32 Max. sequence length 512 512 512 Gradient clipping 1.0 1.0 1.0 RFA layers [8,32] [8,32] [7,28] \|Dharmful\| and \|Dharmless\| 32/32 32/32 32/32