reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Improving Black-box Robustness with In-Context Rewriting

Authors: Kyle O'Brien, Nathan Hoyen Ng, Isha Puri, Jorge Mendez-Mendez, Hamid Palangi, Yoon Kim, Marzyeh Ghassemi, Thomas Hartvigsen

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	LLM-TTA outperforms conventional augmentation functions across sentiment, toxicity, and news classification tasks for BERT and T5 models, with BERT s OOD robustness improving by an average of 4.48 percentage points without regressing average ID performance. We explore selectively augmenting inputs based on prediction entropy to reduce the rate of expensive LLM augmentations, allowing us to maintain performance gains while reducing the average number of generated augmentations by 57.74%. Our study is composed of nine public datasets and one novel synthetic dataset across sentiment analysis, toxicity detection, and new topic classification. We experiment with two LLM-TTA methods: zero-shot paraphrasing, where we prompt the LLM to generate paraphrases of the input text, and In-Context Rewriting (ICR), where the LLM rewrites the input to be more like a set of ID exemplars provided in the prompt.
Researcher Affiliation	Collaboration	Kyle O Brien1 Nathan Ng3,4,5 Isha Puri5 Jorge Mendez5 Hamid Palangi2 Yoon Kim5 Marzyeh Ghassemi5 Thomas Hartvigsen6 1Eleuther AI 2Google 3University of Toronto 4Vector Institute 5MIT CSAIL 6University of Virginia
Pseudocode	No	The paper describes the LLM-TTA method and its two prompting methods (Paraphrasing and In-Context Rewriting) using natural language descriptions and prompt templates in Figure 3. It does not include any formal pseudocode or algorithm blocks.
Open Source Code	Yes	We share our data1, models2, and code3 for reproducibility. 3https://github.com/Kyle1668/LLM-TTA
Open Datasets	Yes	We share our data1, models2, and code3 for reproducibility. 1https://huggingface.co/datasets/Kyle1668/LLM-TTA-Augmentation-Logs We use three ID evaluation datasets and seven OOD datasets across sentiment analysis, toxicity detection, and news topic classification. Sentiment Classification. We consider three sentiment classification datasets from the BOSS benchmark (Yuan et al., 2023). The ID dataset consists of Amazon reviews (Mc Auley & Leskovec, 2013), while the three OOD datasets are Dyna Sent (Potts et al., 2020), SST-5 (Socher et al., 2013), and Sem Eval (Nakov et al., 2016). Toxicity Detection. The ID dataset is Civil Comments (Borkan et al., 2019). The OOD datasets are Adv Civil, an adversarial version of Civil Comments introduced in the benchmark, as well as existing datasets Toxi Gen (Hartvigsen et al., 2022) and Implicit Hate (Elsherief et al., 2021). News Topic Classification. AG News (Zhang et al., 2015) is a four-class news topic classification problem.
Dataset Splits	Yes	We use three ID evaluation datasets and seven OOD datasets across sentiment analysis, toxicity detection, and news topic classification. Each task model is optimized for the ID evaluation set either through finetuning for BERT and T5, or prompting with Falcon. The training splits for the OOD datasets are not used. We keep this benchmark s selection of ID and OOD splits. In this experiment, we study whether LLM-TTA can improve task model robustness across data scales. We train 5 BERT models on 5%, 10%, 20%, 40%, and 80% of the ID training set for each of our three tasks. The base models and hyperparameters are identical across runs and follow the training regime outlined in appendix section Appendix A.5. We build each balanced training subset via stratified random sampling across classes.
Hardware Specification	No	The paper mentions general compute resources in the acknowledgements like "Eleuther AI for permitting access to their compute resources" and "University of Virginia Research Computing team for providing access to excellent high-performance computing resources" but does not specify any particular GPU or CPU models, memory sizes, or detailed system configurations used for the experiments.
Software Dependencies	No	The paper mentions several models and tools, such as "BERT (Devlin et al., 2019)", "T5-Large (Raffel et al., 2020)", "Falcon-7b (Almazrouei et al., 2023)", "nlpaug recommended by Lu et al. (2022)", "Hugging Face Transformers library (Wolf et al., 2019)", and "GPT-3.5 Turbo Brown et al. (2020) (6/7/23 version)". However, it does not provide specific version numbers for the `nlpaug` library or the `Hugging Face Transformers` library which are key ancillary software components.
Experiment Setup	Yes	Following the suggested baseline in Mosbach et al. (2020), we used a batch size of 32 examples, weight decay of 0.01, and a linear learning rate schedule peaking at 2e-5. We use Stable Beluga 2-7B (SB2) to generate augmentations. TTA: OOD Generations. For LLM-TTA and back-translation, we generate four augmentations for each test input using temperature-based decoding with a temperature of 0.3. ICR uses 16 randomly selected unlabeled exemplars balanced across classes sourced from the ID training set. Back-translation uses the Hugging Face generation pipeline API. Text is translated from English to German (facebook/wmt19-en-de) and then back into English (facebook/wmt19-de-en). The generation parameters are four return sequences, temperature of 0.7, four beams, four beam groups, top-P of 0.95, top_k of 0, repetition penalty of 10.0, diversity penalty of 1.0, and no repeating n-gram size of 2. LLM Classifier Inference. Each prompt contains 16 in-distribution training examples selected and ordered randomly within the prompt with a random seed of 42. Examples in the prompt are balanced across classes. Greedy decoding is used with a max of 10 new tokens.