Robust LLM safeguarding via refusal feature adversarial training
Authors: Lei Yu, Virginie Do, Karen Hambardzumyan, Nicola Cancedda
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiment results show that Re FAT significantly improves the robustness of three popular LLMs against a wide range of adversarial attacks, with considerably less computational overhead compared to existing adversarial training methods. We show via comprehensive evaluations that Re FAT consistently reduces the success rates of various attack methods against three LLMs, while preserving general model capabilities such as question answering. |
| Researcher Affiliation | Collaboration | Lei Yu University of Toronto, Meta FAIR EMAIL Virginie Do Meta EMAIL Karen Hambardzumyan University College London, Meta FAIR EMAIL Nicola Cancedda Meta FAIR EMAIL |
| Pseudocode | Yes | We provide a pseudo-code of Re FAT in Algorithm 1 (Appendix 10). |
| Open Source Code | No | The paper mentions 'We have thoroughly checked our data and code implementation' in the Reproducibility Statement, but it does not explicitly state that the code is publicly released or provide a link to a repository. |
| Open Datasets | Yes | To train models with Re FAT, we take the adversarial training dataset from (Zou et al., 2024) consisting of harmful requests... and conversational examples taken from Ultra Chat (Ding et al., 2023) to maintain model efficacy. We sample 5000 harmful requests and 5000 harmless ones from this dataset as our training data, and augment it with 150 examples taken from the XSTest dataset (R ottger et al., 2023)... For robustness evaluations, we take harmful requests from two harmful instruction datasets: Harm Bench (Mazeika et al., 2024) and Adv Bench (Zou et al., 2023b)... For utility evaluation, we compute standard performance scores on two established benchmarks of LLM general capability: MMLU (Hendrycks et al., 2021) and MT-Bench (Zheng et al., 2023). |
| Dataset Splits | Yes | We sample 5000 harmful requests and 5000 harmless ones from this dataset as our training data, and augment it with 150 examples taken from the XSTest dataset... For robustness evaluations... we only take the 200 standard behaviors from Harm Bench with shorter context lengths, and randomly sample 200 Adv Bench examples that do not overlap with Harm Bench, resulting in a total of 400 evaluation examples. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used for running its experiments. |
| Software Dependencies | No | The paper mentions 'Lo RA' and 'Adam W' as part of the experimental setup but does not provide specific version numbers for these or other software libraries/frameworks. |
| Experiment Setup | Yes | Table 4: Hyperparameters of Re FAT. Hyperparameter Llama-3-8B-Instruct Mistral-7B-Instruct Gemma-7B-it Learning rate 2e-5 2e-5 2e-5 Batch size 32 32 8 Number of epochs 1 1 1 Optimizer Adam W Adam W Adam W Lo RA rank 128 128 64 Lo RA alpha 32 32 32 Max. sequence length 512 512 512 Gradient clipping 1.0 1.0 1.0 RFA layers [8,32] [8,32] [7,28] |Dharmful| and |Dharmless| 32/32 32/32 32/32 |