reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Compositionality Decomposed: How do Neural Networks Generalise?

Authors: Dieuwke Hupkes, Verna Dankers, Mathijs Mul, Elia Bruni

JAIR 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To demonstrate the usefulness of this evaluation paradigm, we instantiate these ﬁve tests on a highly compositional data set which we dub PCFG SET and apply the resulting tests to three popular sequence-to-sequence models: a recurrent, a convolution-based and a transformer model. We provide an in-depth analysis of the results, which uncover the strengths and weaknesses of these three architectures and point to potential areas of improvement.
Researcher Affiliation	Academia	Dieuwke Hupkes EMAIL Institute for Logic, Language and Computation University of Amsterdam Science Park 107, 1098 XG Amsterdam Verna Dankers EMAIL Mathijs Mul EMAIL University of Amsterdam Science Park, 1098 XH Amsterdam Elia Bruni EMAIL University of Pompeu Fabra Roc Boronat, 138 08018 Barcelona
Pseudocode	No	The paper describes the methodologies and procedures in narrative text and figures (like grammar rules and interpretation functions in Figure 2 and Figure 3) but does not include explicit pseudocode or algorithm blocks.
Open Source Code	Yes	The data and scripts to run these experiments as well as the trained models are all available online.13 13 https://github.com/i-machine-think/am-i-compositional
Open Datasets	Yes	The data and scripts to run these experiments as well as the trained models are all available online.13... To obtain these statistics, we use the English side of a large machine translation corpus: WMT 2017 (Bojar et al., 2017).
Dataset Splits	Yes	We use 85% of this corpus for training, 5% for validation and 10% for testing. ... The training set contains 82 thousand input-output pairs, while the test set contains 10 thousand examples. ... Sequences containing up to eight functions are collected in the training set, consisting of 81 thousand sequences, while input sequences containing at least nine functions are used for evaluation and collected in a test set containing 11 thousand sequences.
Hardware Specification	No	The paper describes the models and their training parameters, but does not specify the hardware (e.g., CPU, GPU models) used for running the experiments.
Software Dependencies	No	We use the LSTMS2S implementation of the Open NMT-py framework (Klein et al., 2017). ... We train the network with the Fairseq Python toolkit9, using the predeﬁned fconv wmt en de architecture. ... We use Open NMT-py10 (Klein et al., 2017) to train the model... The versions of these toolkits are not specified.
Experiment Setup	Yes	We set the hidden layer size to 512, number of layers to 2 and the word embedding dimensionality to 512, matching their best setup for translation from English to German with the WMT 2017 corpus, which we used to shape the distribution of the PCFG SET data. We use mini-batches of 64 sequences and stochastic gradient descent with an initial learning rate of 0.1.