reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

(Implicit) Ensembles of Ensembles: Epistemic Uncertainty Collapse in Large Models

Authors: Andreas Kirsch

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To investigate this hypothesis, we provide theoretical analysis and experiments that demonstrate uncertainty collapse in explicit ensembles of ensembles and show experimental evidence of similar collapse in wider models across various architectures, from simple MLPs to state-of-the-art vision models including Res Nets and Vision Transformers.
Researcher Affiliation	Academia	Published in Transactions on Machine Learning Research (05/2025)
Pseudocode	No	No explicit pseudocode or algorithm block is present in the main text of the paper.
Open Source Code	No	The paper does not contain any explicit statement providing concrete access to the source code for the methodology described, nor does it provide any specific repository links.
Open Datasets	Yes	Datasets used: MNIST (Le Cun & Cortes, 1998): Standard handwritten digit dataset (in-distribution) Fashion-MNIST (Xiao et al., 2017): Clothing item dataset (out-of-distribution) Dirty-MNIST (Mukhoti et al., 2023): MNIST with added noise (high aleatoric uncertainty) [...] CIFAR-10 (Krizhevsky et al., 2009): 10-class image classification dataset (in-distribution) SVHN (Netzer et al., 2011): Street View House Numbers dataset (out-of-distribution) [...] Image Net-v2 dataset (Recht et al., 2019)... Image Net-trained models (Russakovsky et al., 2015).
Dataset Splits	No	The paper mentions using well-known datasets like MNIST, CIFAR-10, and ImageNet-v2 test set, which typically have standard splits. However, it does not explicitly state the training, validation, or test split percentages or sample counts for any of these datasets, nor does it cite a specific source for the splits used.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU models, or cloud computing instance types) used to run the experiments.
Software Dependencies	No	The paper mentions using pre-trained models from Py Torch's torchvision and timm libraries, but it does not provide specific version numbers for these or any other software dependencies.
Experiment Setup	Yes	For the MNIST experiments: Optimizer: SGD Learning rate: 0.01 Batch size: 128 Epochs: 100 Loss function: Cross-entropy loss Dropout layers: Applied after each hidden layer with p=0.1. For the CIFAR-10 experiments: Optimizer: SGD with momentum (0.9) Learning rate: 0.1, decayed by a factor of 10 at epochs 150 and 250 Weight decay: 5e-4 Batch size: 128 Epochs: 350 Loss function: Cross-entropy loss.