reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Ensembles for Uncertainty Estimation: Benefits of Prior Functions and Bootstrapping

Authors: Vikranth Dwaracherla, Zheng Wen, Ian Osband, Xiuyuan Lu, Seyed Mohammad Asghari, Benjamin Van Roy

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our claims are justified by both theoretical and experimental results. ... numerical experiments on classification using neural networks (Section 5) and bandit problems (Section 6). For experiments on classification we consider the synthetic data generated by Neural Testbed (Osband et al., 2022a), and CIFAR10 (Krizhevsky, 2009), a widely used benchmark image dataset.
Researcher Affiliation	Industry	Vikranth Dwaracherla , Zheng Wen , Ian Osband, Xiuyuan Lu, Seyed Mohammad Asghari, Benjamin Van Roy Efficient Agent Team, Deep Mind, Mountain View, CA
Pseudocode	Yes	Algorithm 1 Evaluation of ensemble agents on bandit problems
Open Source Code	No	The paper states: "We modify the code from enn library at https://github.com/deepmind/enn and Neural Testbed library at https://github.com/deepmind/neural_testbed to model our agents and run experiments on Neural Testbed." This indicates the authors used and modified existing open-source codebases, but does not explicitly state that their specific implementations or modifications for this paper are released or available.
Open Datasets	Yes	For experiments on classification we consider the synthetic data generated by Neural Testbed (Osband et al., 2022a), and CIFAR10 (Krizhevsky, 2009), a widely used benchmark image dataset.
Dataset Splits	No	The paper mentions varying training dataset sizes for CIFAR10 (e.g., "{10, 100, 1000, 50000}") and states that agents are "evaluated on the same (full) test dataset." For Neural Testbed, it mentions "training set sizes T = dr". While these describe the sizes of the training data used, they do not provide specific percentages for train/validation/test splits, absolute sample counts for each split, or explicit citations to predefined splits in a way that allows direct reproduction of the data partitioning methodology beyond simply using the full dataset for training or testing.
Hardware Specification	Yes	We run our experiments using 8-core CPU and 4 GB ram instances on Google cloud compute. ... Each model is trained on 2x2 TPU with per-device batch size of 32.
Software Dependencies	No	The paper mentions modifying code from the "enn library" and "Neural Testbed library" and provides URLs, but it does not specify version numbers for these libraries or any other software components (e.g., Python, TensorFlow, PyTorch versions) needed for reproduction.
Experiment Setup	Yes	For mlp and ensemble-N agents, for a problem with input dimension D and temperature ρ, we choose the weight decay term (λ in Equation 9) from λ {0.1, 0.3, 1, 3, 10} d/ ρ. For ensemble-P agent, in addition to sweeping over the weight decay term, we also sweep over the prior scale of the additive prior functions. In specific, for a problem with temperature ρ, we choose values from {0.3/ ρ, 0.3/ρ, 1/ ρ, 1/ρ, 3/ ρ, 3/ρ, }. ... Each model is trained for 400 epochs. For training, we use an SGD optimizer with learning rate schedule with initial learning rate as {0.0001, 0.001, 0.01, 0.025} for training dataset sizes {10, 100, 1000, 50000} and the learning rate is reduced to one-tenth after 200 epochs, one-hundredth after 300 epochs, and one-thousandth after 350 epochs.