Identifying Drivers of Predictive Aleatoric Uncertainty
Authors: Pascal Iversen, Simon Witzke, Katharina Baum, Bernhard Y. Renard
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We substantiate our findings with a nuanced, quantitative benchmark including synthetic and real, tabular and image datasets. For this, we adapt metrics from conventional XAI research to uncertainty explanations. Overall, the proposed method explains uncertainty estimates with little modifications to the model architecture and outperforms more intricate methods in most settings. |
| Researcher Affiliation | Academia | 1Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Germany 2Freie Universit at Berlin, Department of Mathematics and Computer Science, Berlin, Germany 3Windreich Department of Artificial Intelligence and Human Health & Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, New York, USA EMAIL, EMAIL |
| Pseudocode | No | The paper describes methods and pipelines but does not include explicit pseudocode blocks or algorithms. |
| Open Source Code | Yes | The code for all experiments and to create the MNIST+U dataset is available online on Git Hub1. 1https://github.com/DILi S-lab/Dro PAU |
| Open Datasets | Yes | We introduce the MNIST+U dataset that extends the original MNIST dataset [Deng, 2012] with an uncertainty component. We further make the MNIST+U dataset available separately on Zenodo2. 2https://doi.org/10.5281/zenodo.15373739. In addition, we incorporate three standard regression benchmark datasets into our evaluation: UCI Wine Quality [Cortez et al., 2009], Ailerons [Torgo, 1999], and LSAT academic performance [Wightman, 1998]. |
| Dataset Splits | Yes | We sample n = 41, 500 data points and concatenate both design matrices to attain the input X(n 75) = U(n 5), V(n 70) which we split into 32,000 train, 8,000 validation, and 1,500 test instances. All datasets are split into 70% training, 10% validation, and 20% testing. We split the generated data into train, validation, and test sets consisting of 70%, 10%, and 20% of the data, respectively. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used to run its experiments, such as specific GPU or CPU models. |
| Software Dependencies | No | The paper mentions software like 'Adam optimizer' and 'XGBoost' but does not specify version numbers for these or any other ancillary software components. |
| Experiment Setup | Yes | For the image benchmark, we apply a CNN with two parallel encoders where one predicts the mean and the other estimates the variance. Each encoder has two convolutional layers (16 and 32 filters), max-pooling, and fully connected hidden layers with dropout and 128, 64, and 32 nodes. We train the model with MSE for 16 epochs, then switch to GNLL loss until the validation loss converges. We use the Adam optimizer and a batch size of 256. We fit a deep neural network of four hidden layers with 64, 64, 64, and 32 units and two outputs for the mean and variance prediction. We train using dropout on the first two layers, Adam optimizer and a batch size of 64. We pre-train using the MSE and fine-tune the model using the GNLL as the loss function, selecting weights with the lowest validation loss. |