Teaching Smaller Language Models To Generalise To Unseen Compositional Questions

Authors: Tim Hartill, Neset TAN, Michael Witbrock, Patricia J. Riddle

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We establish strong baselines in this setting for diverse evaluation datasets (Strategy QA, Commonsense QA, IIRC, DROP, Musique and ARC-DA), and show that performance can be significantly improved by adding retrieval-augmented training datasets which are designed to expose our models to a variety of heuristic reasoning strategies such as weighing partial evidence or ignoring an irrelevant context.
Researcher Affiliation Academia Tim Hartill EMAIL School of Computer Science University of Auckland Neset TAN EMAIL School of Computer Science University of Auckland Michael Witbrock EMAIL School of Computer Science University of Auckland Patricia J. Riddle EMAIL School of Computer Science University of Auckland
Pseudocode No The paper describes methodologies and system components using descriptive text and a diagram (Figure 1), but it does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No Code, models and datasets will be released at https://github.com/timhartill/unseen_questions
Open Datasets Yes This criteria leads us to select six evaluation datasets: Strategy QA (Geva et al., 2021) contains commonsense samples requiring diverse multi-hop reasoning strategies. ... Musique (Trivedi et al., 2022a) is a multi-hop dataset focused on factual questions... IIRC (Ferguson et al., 2020) contains questions... ARC-DA (Bhakthavatsalam et al., 2021) is a question-only subset of ARC (Clark et al., 2018)... DROP (Dua et al., 2019) is a RC dataset... Commonsense QA (Talmor et al., 2019) contains samples...
Dataset Splits Yes For each evaluation dataset, where possible we report our results against other zero/few-shot work. ... To facilitate comparison with other zero-shot approaches we use the full training set for evaluation as per BIG-bench (Srivastava et al., 2022) (denoted SQA for question-only and SQAR for question plus our retrieval). ... Table 4: Commonsense QA development set performance comparison (Accuracy). ... Table 5: DROP development set performance comparison (F1). ... Table 7: IIRC test set evaluation (F1). ... Table 9: ARC-DA (test accuracy) and Musique (development F1) comparisons.
Hardware Specification Yes All models are trained on one GPU (either an Nvidia RTX8000 or A100) except for the Retriever models which are trained on six 80GB A100 GPUs.
Software Dependencies No The paper mentions several software components like RoBERTa-base (Liu et al., 2019), ELECTRA-large (Clark et al., 2020), BART (Lewis et al., 2020), and Huggingface (Wolf et al., 2020) implementations. However, it does not specify exact version numbers for these software libraries or frameworks.
Experiment Setup Yes All models are trained on one GPU (either an Nvidia RTX8000 or A100) except for the Retriever models which are trained on six 80GB A100 GPUs. All models are trained using mixed precision using a linear learning rate decay schedule. Initial learning rates and other hyperparameters are shown in Table 10. The optimiser used for the Retriever, Reranker and Evidence Set Scorer is Adam. All other models use Adam W. A maximum sequence length of 512 tokens was used for all models.