reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning Explanatory Rules from Noisy Data

Authors: Richard Evans, Edward Grefenstette

JAIR 2018 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We implemented our model in Tensor Flow (Abadi, Agarwal, Barham, Brevdo, Chen, Citro, Corrado, Davis, Dean, J., Devin, M., et al., 2016) and tested it with three types of experiment. First, we used standard symbolic ILP tasks, where ilp is given discrete error-free input. Second, we modiﬁed the standard symbolic ILP tasks so that a certain proportion of the positive and negative examples are wilfully mis-labelled. Third, we tested it with fuzzy, ambiguous data, connecting ilp to the output of a pretrained convolution neural network that classiﬁes MNIST digits.
Researcher Affiliation	Industry	Richard Evans EMAIL Edward Grefenstette EMAIL Deep Mind, London, UK
Pseudocode	No	The paper describes the methodology using narrative text, mathematical formulations, and architecture diagrams (e.g., Figure 1). There are no explicitly labeled pseudocode or algorithm blocks with structured, code-like steps.
Open Source Code	No	The paper states: "We implemented our model in Tensor Flow (Abadi, Agarwal, Barham, Brevdo, Chen, Citro, Corrado, Davis, Dean, J., Devin, M., et al., 2016)". This only indicates that TensorFlow was used for implementation, not that the authors' specific code for this work is openly available. There is no explicit statement about releasing the code or a link to a repository.
Open Datasets	Yes	We tested ilp on 20 ILP tasks, taken from four domains: arithmetic, lists, group-theory, and family tree relations. Some of the arithmetic examples appeared in the work of Cropper and Muggleton (2016). The list examples are used by Feser, Chaudhuri, and Dillig (2015). The family tree dataset comes from Wang, Mazaitis, and Cohen (2015) and is also used by Yang, Yang, and Cohen (2016). [...] Unlike symbolic ILP systems, ilp is also able to handle ambiguous or fuzzy data. We tested ilp by connecting it to a convolutional net trained on MNIST digits, and it was still able to learn eﬀectively (see Section 5.5).
Dataset Splits	Yes	For validation and test, we use positive and negative examples of the even predicate on numbers greater than 10. [...] The training data was integers from 100 to 1024. The integers below 100 were held out as test data. [...] We ran the less-than experiment while holding out certain pairs of integers. Please note that we are not just holding out pairs of images. Rather, we are holding out pairs of integers, and removing from training every pair of images whose labels match that pair. [...] its performance is robust when holding out 70% of the data.
Hardware Specification	No	We gave Metagol and ilp the same ﬁxed time limit (24 hours running on a standard workstation). This statement mentions a "standard workstation" but lacks specific details regarding CPU models, GPU types, or memory, which are necessary for hardware reproducibility.
Software Dependencies	No	We implemented our model in Tensor Flow (Abadi, Agarwal, Barham, Brevdo, Chen, Citro, Corrado, Davis, Dean, J., Devin, M., et al., 2016). The paper mentions the use of TensorFlow but does not specify a version number for this or any other software component, which is necessary for reproducible dependency information.
Experiment Setup	Yes	We tried a range of optimisation algorithms: Stochastic Gradient Descent, Adam, Ada Delta, and RMSProp. We searched across a range of learning rates in {0.5, 0.2, 0.1, 0.05, 0.01, 0.001}. Weights were initialised randomly from a normal distribution with mean 0 and a standard deviation that ranged between 0 and 2 (the standard deviation was a hyperparameter but the mean was ﬁxed). [...] we used RMSProp with a learning rate of 0.5, and initialised clause weights by sampling from a N(0, 1) distribution. [...] We train for 6000 steps, adjusting rule weights to minimise cross entropy loss as described above. [...] Each step we sample a mini-batch from the positive and negative examples.