reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Break it, Imitate it, Fix it: Robustness by Generating Human-Like Attacks

Authors: Aradhana Sinha, Ananth Balashankar, Ahmad Beirami, Thi Avrahami, Jilin Chen, Alex Beutel

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the advantages of this system on the ANLI and hate speech detection benchmark datasets both collected via an iterative, adversarial human-and-model-in-the-loop procedure. Compared to training only on observed human attacks, also training on our synthetic adversarial examples improves model robustness to future rounds. In ANLI, we see accuracy gains on the current set of attacks (44.1% 50.1%) and on two future unseen rounds of human generated attacks (32.5% 43.4%, and 29.4% 40.2%). In hate speech detection, we see AUC gains on current attacks (0.76 0.84) and a future round (0.77 0.79).
Researcher Affiliation	Industry	Aradhana Sinha 1, Ananth Balashankar 1, Ahmad Beirami1, Thi Avrahami1, Jilin Chen1, and Alex Beutel 2 1Google Research 2Open AI
Pseudocode	Yes	Algorithm 1 Pseudo-code for ICE Method
Open Source Code	No	The paper discusses third-party tools like the Text Attack library (with a GitHub link) and the TF-GAN library (with a blog link), which were used by the authors. However, it does not provide any explicit statement or link for the authors' own implementation code for the methodology described in this paper.
Open Datasets	Yes	We demonstrate the advantages of this system on the ANLI and hate speech detection benchmark datasets both collected via an iterative, adversarial human-and-model-in-the-loop procedure. A.I. Adversarial NLI: We evaluate our methods on the Adversarial NLI (ANLI) task (Nie et al., 2020). A.II. Hate Speech Detection: We also evaluate on the Dynabench Hate Speech detection dataset, an adversarial human-in-the-loop dataset generated in four rounds (Vidgen et al., 2021).
Dataset Splits	Yes	Table 20: Number of human adversarial examples in ANLI and Hate speech detection task, by split.
Hardware Specification	No	The paper mentions training time in terms of "100 GPU hours" but does not specify the model or type of GPU, CPU, or any other specific hardware component used for the experiments. It refers to "BERT-Large" and "T5 encoder-decoder" which are models, not hardware specifications.
Software Dependencies	No	The paper mentions several software components like "T5 model", "BERT-Large pre-trained model", "adamw optimizer in Jax + TensorFlow", "Text Attack library", and "TF-GAN library". However, it does not provide specific version numbers for any of these libraries or frameworks, which is required for a reproducible description.
Experiment Setup	Yes	The following learning rate schedule was used: There are ﬁrst 3,681 warm-up steps at the initial learning rate of 3.0e-05. Then for the next 36,813 steps the learning rate decays at a linear rate (Though we only ﬁne-tune for 40k steps total). The checkpoint that performs best on the validation split is selected for the next stage. During attack generation, we use a high α beam parameter, 0.7 for ANLI, and 0.8 for the Toxicity dataset. For additional example diversity, the weight on the reconstruction loss can be made negative: results here are presented with a reconstruction loss weight of 0 for ANLI, and 1.5 for the Toxicity dataset. On each step that we update the T5 parameters, we smooth the update (ﬁnal parameter = original parameter * 0.75 + new parameter * 0.25).