Compositionality Decomposed: How do Neural Networks Generalise?

Authors: Dieuwke Hupkes, Verna Dankers, Mathijs Mul, Elia Bruni

JAIR 2020 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To demonstrate the usefulness of this evaluation paradigm, we instantiate these five tests on a highly compositional data set which we dub PCFG SET and apply the resulting tests to three popular sequence-to-sequence models: a recurrent, a convolution-based and a transformer model. We provide an in-depth analysis of the results, which uncover the strengths and weaknesses of these three architectures and point to potential areas of improvement.
Researcher Affiliation Academia Dieuwke Hupkes EMAIL Institute for Logic, Language and Computation University of Amsterdam Science Park 107, 1098 XG Amsterdam Verna Dankers EMAIL Mathijs Mul EMAIL University of Amsterdam Science Park, 1098 XH Amsterdam Elia Bruni EMAIL University of Pompeu Fabra Roc Boronat, 138 08018 Barcelona
Pseudocode No The paper describes the methodologies and procedures in narrative text and figures (like grammar rules and interpretation functions in Figure 2 and Figure 3) but does not include explicit pseudocode or algorithm blocks.
Open Source Code Yes The data and scripts to run these experiments as well as the trained models are all available online.13 13 https://github.com/i-machine-think/am-i-compositional
Open Datasets Yes The data and scripts to run these experiments as well as the trained models are all available online.13... To obtain these statistics, we use the English side of a large machine translation corpus: WMT 2017 (Bojar et al., 2017).
Dataset Splits Yes We use 85% of this corpus for training, 5% for validation and 10% for testing. ... The training set contains 82 thousand input-output pairs, while the test set contains 10 thousand examples. ... Sequences containing up to eight functions are collected in the training set, consisting of 81 thousand sequences, while input sequences containing at least nine functions are used for evaluation and collected in a test set containing 11 thousand sequences.
Hardware Specification No The paper describes the models and their training parameters, but does not specify the hardware (e.g., CPU, GPU models) used for running the experiments.
Software Dependencies No We use the LSTMS2S implementation of the Open NMT-py framework (Klein et al., 2017). ... We train the network with the Fairseq Python toolkit9, using the predefined fconv wmt en de architecture. ... We use Open NMT-py10 (Klein et al., 2017) to train the model... The versions of these toolkits are not specified.
Experiment Setup Yes We set the hidden layer size to 512, number of layers to 2 and the word embedding dimensionality to 512, matching their best setup for translation from English to German with the WMT 2017 corpus, which we used to shape the distribution of the PCFG SET data. We use mini-batches of 64 sequences and stochastic gradient descent with an initial learning rate of 0.1.