reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Theoretical Limitations of Ensembles in the Age of Overparameterization

Authors: Niclas Dern, John Patrick Cunningham, Geoff Pleiss

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To address this divergence and verify recent empirical findings, we develop a theoretical characterization of ensembles in the overparameterized regime, with the goal of contrasting against (traditional) underparameterized ensembles. ... We verify and contextualize our theory with experiments on RF and neural networks ensembles.
Researcher Affiliation	Academia	1School of Computation, Information and Technology, Technical University of Munich, Munich, Germany 2Department of Statistics, Columbia University, Zuckerman Institute, New York, USA 3Department of Statistics, University of British Columbia, Vancouver, Canada 4Vector Institute, Toronto, Canada. Correspondence to: Niclas Dern <EMAIL>.
Pseudocode	No	The paper describes methods verbally and mathematically but does not present a structured pseudocode block or algorithm.
Open Source Code	Yes	The code to run all our experiments can be found on Git Hub: https://github.com/nic-dern/theoretical-limitations-overparameterized-ensembles. It contains a README.md file that explains how to set up and run the experiments.
Open Datasets	Yes	We validate these theoretical results with supporting experiments on RF and neural networks ensembles, using synthetic data and the California Housing dataset (Kelley Pace & Barry, 1997) with various activation functions (detailed in Appx. A.1 and Appx. B).
Dataset Splits	Yes	We use the California Housing (Kelley Pace & Barry, 1997) dataset and sample distinct training and test points from it (randomly permutating the dataset initially). In this setting, we use N = 12, D = 200 if not differently specified. ... Training was performed on the same set of 12,000 samples from the California Housing dataset, with a validation set of 3,000 samples and a test set of 5,000 samples.
Hardware Specification	No	The paper mentions funding and general research institute affiliations, but does not specify any particular GPU, CPU, or other hardware models used for the experiments.
Software Dependencies	No	We used double precision for all computations and used the torch.linalg.lstsq function with the driver gelsd (for not-well-conditioned matrices) to solve linear systems. (This specifies 'torch' but not its version or other key software components with versions).
Experiment Setup	Yes	For all our experiments with neural networks, we used a three-layer MLP with hidden layers of equal width and Re LU activations. Models were trained for 1000 epochs using SGD with momentum, a learning rate of 0.01, and a momentum decay of 0.9.