reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Debiasing Mini-Batch Quadratics for Applications in Deep Learning

Authors: Lukas Nicola Tatzel, Bálint Mucsányi, Osane Hackel, Philipp Hennig

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we (i) show that this bias introduces a systematic error, (ii) provide a theoretical explanation for it, (iii) explain its relevance for second-order optimization and uncertainty quantification via the Laplace approximation in deep learning, and (iv) develop and evaluate debiasing strategies. [...] 6 EXPERIMENTS In this section, we evaluate the effectiveness of the debiasing strategies from Section 4.2.
Researcher Affiliation	Academia	Lukas Tatzel, Bálint Mucsányi, Osane Hackel & Philipp Hennig Tübingen AI Center University of Tübingen Tübingen, Germany EMAIL
Pseudocode	Yes	Algorithm 1: Method of conjugate gradients (CG), based on (Nocedal & Wright, 2006, Alg. 5.2)
Open Source Code	No	The paper discusses using existing open-source tools like DEEPOBS (Schneider et al., 2019), PYTORCH (Paszke et al., 2019), and BACKPACK (Dangel et al., 2020), but it does not provide an explicit statement or link for the authors' own implementation code for the methodology described.
Open Datasets	Yes	We use the datasets CIFAR-10 (with C 10 classes) and CIFAR-100 (with C 100 classes) (Krizhevsky, 2009). [...] We also use the IMAGENET dataset (Deng et al., 2009) which contains images from C 1000 different classes.
Dataset Splits	Yes	Each dataset contains 60,000 data points that are split into 40,000 training samples, 10,000 validation samples and 10,000 test samples. For the experiments on out-of-distribution (OOD) data, we create the datasets CIFAR-10-C and CIFAR-100-C, each containing 10,000 images, as described in (Hendrycks & Dietterich, 2019).
Hardware Specification	No	The paper describes various experimental setups, models, and training procedures, but it does not specify any particular hardware components such as GPU models, CPU types, or memory amounts used for running the experiments.
Software Dependencies	No	We use DEEPOBS (Schneider et al., 2019) on top of PYTORCH (Paszke et al., 2019) as our general benchmarking framework as it provides easy access to a variety of datasets and model architectures. [...] Our implementation uses BACKPACK (Dangel et al., 2020) that provides access to products with the Hessian...
Experiment Setup	Yes	(A) ALL-CNN-C on CIFAR-100. [...] We train the model with SGD (learning rate 0.171234) with batch size 256 for 350 epochs. Weight decay β 0.0005 is used on the weights but not the biases of the model. (B) ALL-CNN-C on CIFAR-10. [...] The model is trained with SGD (learning rate 0.025, momentum 0.9) with batch size 256 for 350 epochs. The learning rate is reduced by a factor of 10 at epochs 200, 250 and 300, as suggested in (Springenberg et al., 2015). Weight decay β 0.001 is used on the weights but not the biases of the model.