reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Teaching Smaller Language Models To Generalise To Unseen Compositional Questions

Authors: Tim Hartill, Neset TAN, Michael Witbrock, Patricia J. Riddle

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We establish strong baselines in this setting for diverse evaluation datasets (Strategy QA, Commonsense QA, IIRC, DROP, Musique and ARC-DA), and show that performance can be significantly improved by adding retrieval-augmented training datasets which are designed to expose our models to a variety of heuristic reasoning strategies such as weighing partial evidence or ignoring an irrelevant context.
Researcher Affiliation	Academia	Tim Hartill EMAIL School of Computer Science University of Auckland Neset TAN EMAIL School of Computer Science University of Auckland Michael Witbrock EMAIL School of Computer Science University of Auckland Patricia J. Riddle EMAIL School of Computer Science University of Auckland
Pseudocode	No	The paper describes methodologies and system components using descriptive text and a diagram (Figure 1), but it does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	Code, models and datasets will be released at https://github.com/timhartill/unseen_questions
Open Datasets	Yes	This criteria leads us to select six evaluation datasets: Strategy QA (Geva et al., 2021) contains commonsense samples requiring diverse multi-hop reasoning strategies. ... Musique (Trivedi et al., 2022a) is a multi-hop dataset focused on factual questions... IIRC (Ferguson et al., 2020) contains questions... ARC-DA (Bhakthavatsalam et al., 2021) is a question-only subset of ARC (Clark et al., 2018)... DROP (Dua et al., 2019) is a RC dataset... Commonsense QA (Talmor et al., 2019) contains samples...
Dataset Splits	Yes	For each evaluation dataset, where possible we report our results against other zero/few-shot work. ... To facilitate comparison with other zero-shot approaches we use the full training set for evaluation as per BIG-bench (Srivastava et al., 2022) (denoted SQA for question-only and SQAR for question plus our retrieval). ... Table 4: Commonsense QA development set performance comparison (Accuracy). ... Table 5: DROP development set performance comparison (F1). ... Table 7: IIRC test set evaluation (F1). ... Table 9: ARC-DA (test accuracy) and Musique (development F1) comparisons.
Hardware Specification	Yes	All models are trained on one GPU (either an Nvidia RTX8000 or A100) except for the Retriever models which are trained on six 80GB A100 GPUs.
Software Dependencies	No	The paper mentions several software components like RoBERTa-base (Liu et al., 2019), ELECTRA-large (Clark et al., 2020), BART (Lewis et al., 2020), and Huggingface (Wolf et al., 2020) implementations. However, it does not specify exact version numbers for these software libraries or frameworks.
Experiment Setup	Yes	All models are trained on one GPU (either an Nvidia RTX8000 or A100) except for the Retriever models which are trained on six 80GB A100 GPUs. All models are trained using mixed precision using a linear learning rate decay schedule. Initial learning rates and other hyperparameters are shown in Table 10. The optimiser used for the Retriever, Reranker and Evidence Set Scorer is Adam. All other models use Adam W. A maximum sequence length of 512 tokens was used for all models.