reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Improving Few-Shot Generalization by Exploring and Exploiting Auxiliary Data

Authors: Alon Albalak, Colin A. Raffel, William Yang Wang

NeurIPS 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experimentation we find that our methods outperform all pre-existing FLAD methods by 4% and lead to the first 3 billion parameter language models that outperform the 175 billion parameter GPT-3.
Researcher Affiliation	Academia	Alon Albalak University of California, Santa Barbara EMAIL Colin Raffel University of Toronto Vector Institute EMAIL William Yang Wang University of California, Santa Barbara EMAIL
Pseudocode	Yes	We include here pseudo-code for our 2 proposed algorithms. Algorithm 1 contains the pseudo-code for EXP3-FLAD, and Algorithm 2 contains the pseudo-code for UCB1-FLAD.
Open Source Code	Yes	All of our code is available at github.com/alon-albalak/FLAD.
Open Datasets	Yes	We obtain all datasets from Hugging Face Datasets1, and cast them to the text-to-text format by applying prompt templates from the Public Pool of Prompts (P3) [23] that was used to train T0.
Dataset Splits	Yes	For each dataset, we randomly sample five few-shot splits from their training data, containing the same number of training examples as previous works, between 20 to 70 [55, 56]. We further divide each split into equal training and validation partitions for true few-shot learning [57](e.g. 10 train and 10 validation samples for Hella Swag).
Hardware Specification	Yes	We train all models (FLAD and non-FLAD) on 40Gb A100s.
Software Dependencies	No	We used model checkpoints from Hugging Face Transformers [45]). For all experiments we use the Adafactor optimizer [58].
Experiment Setup	Yes	For the target-only baseline, we use learning rates in {1e-4, 3e-4}. For all other methods, we always use a learning rate of 1e-4. For target-, explore-, and exploit-only baselines we use batch sizes in {32, 128}. For loss-scaling, EXP3-FLAD, and UCB1-FLAD we use mini-batches of 8 samples and let G be in {4, 16} to match the batch size of all methods. For exploreand exploit-only, we use a target dataset mixing ratio of M {1, 5, 10}. For all experiments we use the Adafactor optimizer [58] and validation-based early stopping for model checkpoint selection.