reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DeepMath - Deep Sequence Models for Premise Selection

Authors: Geoffrey Irving, Christian Szegedy, Alexander A. Alemi, Niklas Een, Francois Chollet, Josef Urban

NeurIPS 2016 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We study the effectiveness of neural sequence models for premise selection in automated theorem proving, one of the main bottlenecks in the formalization of mathematics. We propose a two stage approach for this task that yields good results for the premise selection task on the Mizar corpus while avoiding the handengineered features of existing state-of-the-art models. To our knowledge, this is the ﬁrst time deep learning has been applied to theorem proving on a large scale. (Abstract) ... 6 Experiments (Section Title)
Researcher Affiliation	Collaboration	Alexander A. Alemi Google Inc. EMAIL François Chollet Google Inc. EMAIL Niklas Een Google Inc. EMAIL Geoffrey Irving Google Inc. EMAIL Christian Szegedy Google Inc. EMAIL Josef Urban Czech Technical University in Prague EMAIL
Pseudocode	No	The paper does not contain any pseudocode or clearly labeled algorithm blocks. It provides architectural diagrams but not algorithmic steps in pseudocode format.
Open Source Code	No	The paper mentions using 'Tensor Flow framework [1] and the Keras library [5]' (Section 6.4), but it does not state that the authors are releasing their own source code for the methodology described in the paper.
Open Datasets	Yes	We use the Mizar Mathematical Library (MML) version 4.181.11473 as the formal corpus and E prover [32] version 1.9 as the underlying ATP system. (Section 3) ... For training and evaluation we use a subset of 32,524 out of 57,917 theorems that are known to be provable by an ATP given the right set of premises. (Section 6.1)
Dataset Splits	Yes	We split off a random 10% of these (3,124 statements) for testing and validation. Also, we held out 400 statements from the 3,124 for monitoring training progress, as well as for model and checkpoint selection. (Section 6.1)
Hardware Specification	Yes	The neural networks were trained using asynchronous distributed stochastic gradient descent using the Adam optimizer [24] with up to 20 parallel NVIDIA K-80 GPU workers per model. (Section 6.4)
Software Dependencies	No	The paper mentions using 'Tensor Flow framework [1] and the Keras library [5]' (Section 6.4), but it does not specify version numbers for these software dependencies, which is required for reproducibility.
Experiment Setup	Yes	All our neural network models use the general architecture from Fig 3: a classiﬁer on top of the concatenated embeddings of an axiom and a conjecture. The same classiﬁer architecture was used for all models: a fully-connected neural network with one hidden layer of size 1024. (Section 6.3) ... The character level models were trained with maximum sequence length 2048 characters, where the word-level (and deﬁnition embedding) based models had a maximum sequence length of 500 words. (Section 6.4) ... For E prover, we used auto strategy with a soft time limit of 90 seconds, a hard time limit of 120 seconds, a memory limit of 4 GB, and a processed clauses limit of 500,000. (Section 6.5)