reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Streamlining Prediction in Bayesian Deep Learning

Authors: Rui Li, Marcus Klasson, Arno Solin, Martin Trapp

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We showcase our approach for both MLP and transformers, such as Vi T and GPT-2, and assess its performance on regression and classification tasks. ... Contributions: ... (iii) Finally, we present an empirical assessment of our approach on regression and classification tasks, and showcase its utility for uncertainty quantification, out-of-domain detection, and sensitivity analysis (Sec. 4). ... 4 EXPERIMENTS We demonstrate practical applicability of our approach on classification/regression tasks (Sec. 4.1), large-scale classification results with Vi T/GPT models (Sec. 4.2), and sensitivity estimation (Sec. 4.3). Additional experiments and additional experimental results can be found in App. B.
Researcher Affiliation	Academia	Rui Li Marcus Klasson Arno Solin Martin Trapp Department of Computer Science, Aalto University, Finland {firstname.lastname}@aalto.fi
Pseudocode	No	The paper describes its methodology using mathematical equations and textual descriptions but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Open-source library: https://github.com/Aalto ML/SUQ.
Open Datasets	Yes	Data sets We use a selection of data sets from the UCI repository (Kelly et al., 2023) for the regression experiments. For classification, we experiment on MNIST (Le Cun et al., 1998), FMNIST (Xiao et al., 2017), as well as the 11-class data sets Organ CMNIST and Organ SMNIST from Med MNIST (Yang et al., 2023). ... We experiment with CIFAR-10 and CIFAR-100 (Krizhevsky & Hinton, 2009), DTD (Cimpoi et al., 2014), RESISC (Cheng et al., 2017) and a subsampled version of Image Net-R (Hendrycks et al., 2021) ... For the GPT model, we used the BOOLQ, WIC, and MRPC tasks from GLUE (Wang et al., 2019b) and Super GLUE (Wang et al., 2019a) benchmarks.
Dataset Splits	Yes	Regression We experiment on a selection of data sets from the UCI repository and run a 5-fold cross validation to report results for each data set. ... For our method, we fit an additional scaling factor on the predictive variance by minimising the NLPD on a validation set...
Hardware Specification	Yes	We acknowledge CSC IT Center for Science, Finland, for awarding this project access to the LUMI supercomputer, owned by the Euro HPC Joint Undertaking, hosted by CSC (Finland) and the LUMI consortium through CSC. ... In B.7 RUNTIME EXPERIMENT ... We ran experiments on an NVIDIA H100 80GB GPU for 400 data points, batchsize of one, and for each data point we repeated the measurement ten times.
Software Dependencies	No	The paper mentions several software components like Hugging Face Transformers, torch-laplace library, and IVON, but it does not specify exact version numbers for these dependencies.
Experiment Setup	Yes	Posterior approximations ... For the MFVI and LA sampling baselines, we used 1, 000 MC samples in the regression and classification experiments in Sec. 4.1, and 50 MC samples for the Vi T and GPT-2 in Sec. 4.2. ... B.3 IMAGE PIXEL SENSITIVITY We trained a 4 layer MLP classifier on MNIST digits zero and eight using a batch size of 64, learning rate of 1e 3, weight decay set to 1e 5, and for 50 epochs. ... The optimisation was performed for each image independently and using Adam with a learning rate of 5e 3 until the validation loss dropped below a divergence to the initial loss of 1e 2.