Improving Few-Shot Generalization by Exploring and Exploiting Auxiliary Data

Authors: Alon Albalak, Colin A. Raffel, William Yang Wang

NeurIPS 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experimentation we find that our methods outperform all pre-existing FLAD methods by 4% and lead to the first 3 billion parameter language models that outperform the 175 billion parameter GPT-3.
Researcher Affiliation Academia Alon Albalak University of California, Santa Barbara EMAIL Colin Raffel University of Toronto Vector Institute EMAIL William Yang Wang University of California, Santa Barbara EMAIL
Pseudocode Yes We include here pseudo-code for our 2 proposed algorithms. Algorithm 1 contains the pseudo-code for EXP3-FLAD, and Algorithm 2 contains the pseudo-code for UCB1-FLAD.
Open Source Code Yes All of our code is available at github.com/alon-albalak/FLAD.
Open Datasets Yes We obtain all datasets from Hugging Face Datasets1, and cast them to the text-to-text format by applying prompt templates from the Public Pool of Prompts (P3) [23] that was used to train T0.
Dataset Splits Yes For each dataset, we randomly sample five few-shot splits from their training data, containing the same number of training examples as previous works, between 20 to 70 [55, 56]. We further divide each split into equal training and validation partitions for true few-shot learning [57](e.g. 10 train and 10 validation samples for Hella Swag).
Hardware Specification Yes We train all models (FLAD and non-FLAD) on 40Gb A100s.
Software Dependencies No We used model checkpoints from Hugging Face Transformers [45]). For all experiments we use the Adafactor optimizer [58].
Experiment Setup Yes For the target-only baseline, we use learning rates in {1e-4, 3e-4}. For all other methods, we always use a learning rate of 1e-4. For target-, explore-, and exploit-only baselines we use batch sizes in {32, 128}. For loss-scaling, EXP3-FLAD, and UCB1-FLAD we use mini-batches of 8 samples and let G be in {4, 16} to match the batch size of all methods. For exploreand exploit-only, we use a target dataset mixing ratio of M {1, 5, 10}. For all experiments we use the Adafactor optimizer [58] and validation-based early stopping for model checkpoint selection.