reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Identifying Drivers of Predictive Aleatoric Uncertainty

Authors: Pascal Iversen, Simon Witzke, Katharina Baum, Bernhard Y. Renard

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We substantiate our findings with a nuanced, quantitative benchmark including synthetic and real, tabular and image datasets. For this, we adapt metrics from conventional XAI research to uncertainty explanations. Overall, the proposed method explains uncertainty estimates with little modifications to the model architecture and outperforms more intricate methods in most settings.
Researcher Affiliation	Academia	1Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Germany 2Freie Universit at Berlin, Department of Mathematics and Computer Science, Berlin, Germany 3Windreich Department of Artificial Intelligence and Human Health & Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, New York, USA EMAIL, EMAIL
Pseudocode	No	The paper describes methods and pipelines but does not include explicit pseudocode blocks or algorithms.
Open Source Code	Yes	The code for all experiments and to create the MNIST+U dataset is available online on Git Hub1. 1https://github.com/DILi S-lab/Dro PAU
Open Datasets	Yes	We introduce the MNIST+U dataset that extends the original MNIST dataset [Deng, 2012] with an uncertainty component. We further make the MNIST+U dataset available separately on Zenodo2. 2https://doi.org/10.5281/zenodo.15373739. In addition, we incorporate three standard regression benchmark datasets into our evaluation: UCI Wine Quality [Cortez et al., 2009], Ailerons [Torgo, 1999], and LSAT academic performance [Wightman, 1998].
Dataset Splits	Yes	We sample n = 41, 500 data points and concatenate both design matrices to attain the input X(n 75) = U(n 5), V(n 70) which we split into 32,000 train, 8,000 validation, and 1,500 test instances. All datasets are split into 70% training, 10% validation, and 20% testing. We split the generated data into train, validation, and test sets consisting of 70%, 10%, and 20% of the data, respectively.
Hardware Specification	No	The paper does not explicitly describe the hardware used to run its experiments, such as specific GPU or CPU models.
Software Dependencies	No	The paper mentions software like 'Adam optimizer' and 'XGBoost' but does not specify version numbers for these or any other ancillary software components.
Experiment Setup	Yes	For the image benchmark, we apply a CNN with two parallel encoders where one predicts the mean and the other estimates the variance. Each encoder has two convolutional layers (16 and 32 filters), max-pooling, and fully connected hidden layers with dropout and 128, 64, and 32 nodes. We train the model with MSE for 16 epochs, then switch to GNLL loss until the validation loss converges. We use the Adam optimizer and a batch size of 256. We fit a deep neural network of four hidden layers with 64, 64, 64, and 32 units and two outputs for the mean and variance prediction. We train using dropout on the first two layers, Adam optimizer and a batch size of 64. We pre-train using the MSE and fine-tune the model using the GNLL as the loss function, selecting weights with the lowest validation loss.