Break it, Imitate it, Fix it: Robustness by Generating Human-Like Attacks

Authors: Aradhana Sinha, Ananth Balashankar, Ahmad Beirami, Thi Avrahami, Jilin Chen, Alex Beutel

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the advantages of this system on the ANLI and hate speech detection benchmark datasets both collected via an iterative, adversarial human-and-model-in-the-loop procedure. Compared to training only on observed human attacks, also training on our synthetic adversarial examples improves model robustness to future rounds. In ANLI, we see accuracy gains on the current set of attacks (44.1% 50.1%) and on two future unseen rounds of human generated attacks (32.5% 43.4%, and 29.4% 40.2%). In hate speech detection, we see AUC gains on current attacks (0.76 0.84) and a future round (0.77 0.79).
Researcher Affiliation Industry Aradhana Sinha 1, Ananth Balashankar 1, Ahmad Beirami1, Thi Avrahami1, Jilin Chen1, and Alex Beutel 2 1Google Research 2Open AI
Pseudocode Yes Algorithm 1 Pseudo-code for ICE Method
Open Source Code No The paper discusses third-party tools like the Text Attack library (with a GitHub link) and the TF-GAN library (with a blog link), which were used by the authors. However, it does not provide any explicit statement or link for the authors' own implementation code for the methodology described in this paper.
Open Datasets Yes We demonstrate the advantages of this system on the ANLI and hate speech detection benchmark datasets both collected via an iterative, adversarial human-and-model-in-the-loop procedure. A.I. Adversarial NLI: We evaluate our methods on the Adversarial NLI (ANLI) task (Nie et al., 2020). A.II. Hate Speech Detection: We also evaluate on the Dynabench Hate Speech detection dataset, an adversarial human-in-the-loop dataset generated in four rounds (Vidgen et al., 2021).
Dataset Splits Yes Table 20: Number of human adversarial examples in ANLI and Hate speech detection task, by split.
Hardware Specification No The paper mentions training time in terms of "100 GPU hours" but does not specify the model or type of GPU, CPU, or any other specific hardware component used for the experiments. It refers to "BERT-Large" and "T5 encoder-decoder" which are models, not hardware specifications.
Software Dependencies No The paper mentions several software components like "T5 model", "BERT-Large pre-trained model", "adamw optimizer in Jax + TensorFlow", "Text Attack library", and "TF-GAN library". However, it does not provide specific version numbers for any of these libraries or frameworks, which is required for a reproducible description.
Experiment Setup Yes The following learning rate schedule was used: There are first 3,681 warm-up steps at the initial learning rate of 3.0e-05. Then for the next 36,813 steps the learning rate decays at a linear rate (Though we only fine-tune for 40k steps total). The checkpoint that performs best on the validation split is selected for the next stage. During attack generation, we use a high α beam parameter, 0.7 for ANLI, and 0.8 for the Toxicity dataset. For additional example diversity, the weight on the reconstruction loss can be made negative: results here are presented with a reconstruction loss weight of 0 for ANLI, and 1.5 for the Toxicity dataset. On each step that we update the T5 parameters, we smooth the update (final parameter = original parameter * 0.75 + new parameter * 0.25).