reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MC Layer Normalization for calibrated uncertainty in Deep Learning

Authors: Thomas Frick, Diego Antognini, Ioana Giurgiu, Benjamin F Grewe, Cristiano Malossi, Rong J.B. Zhu, Mattia Rigotti

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate the effectiveness of our module, we conduct experiments in two distinct settings. First, we investigate its potential to replace existing methods such as MC-Dropout and Prediction-Time Batch Normalization. Second, we explore its suitability for use cases where such conventional modules are either unsuitable or sub-optimal for certain tasks... We empirically demonstrate the competitiveness of our module in terms of prediction accuracy and uncertainty calibration on established out-of-distribution image classification benchmarks, as well as its flexibility by applying it on tasks and architectures where previous methods are unsuitable.
Researcher Affiliation	Collaboration	Thomas Frick EMAIL IBM Research, ETH Zurich Diego Antognini EMAIL IBM Research Ioana Giurgiu EMAIL IBM Research Benjamin Grewe EMAIL ETH Zurich Cristiano Malossi EMAIL IBM Research Rong J.B. Zhu EMAIL Fudan University Mattia Rigotti EMAIL IBM Research
Pseudocode	Yes	We summarize the previous results in pseudocode snippets detailing the use of MC-Layer Norm in practice for training neural networks with SGD with backpropagation (algorithm 1), and at prediction time (algorithm 2). Algorithm 1 MC-Layer Norm module (training mode) Algorithm 2 MC-Layer Norm module (eval mode)
Open Source Code	Yes	Code implementing our MC-Layer Norm module can be found here https://github.com/IBM/mc-layernorm.
Open Datasets	Yes	model calibration metrics (accuracy, ECE, and Brier score) are evaluated on CIFAR-10-C, Tiny Image Net-C, and Image Net-C introduced in Hendrycks & Dietterich (2019) (CCA4.0 license). Both datasets are corrupted versions of their original test sets (CIFAR-10 (Krizhevsky, 2009), Tiny Image Net (Le & Yang, 2017), and Image Net (Deng et al., 2009))... For the experiments, we train models on the Criteo Display Advertising Challenge (Criteo Labs, 2014).
Dataset Splits	Yes	We assess model calibration for three different scenarios: 1. In-distribution (Test): The original test set of the corresponding dataset is used as a baseline for all metrics. 2. Out-of-distribution (OOD): The metrics are evaluated on the corrupted C-variant test sets (severity 5). 3. Zero-shot prediction-time domain adaptation (Mix): ...create batches of size N consisting of N 1 in-distribution samples and a single out-of-distribution sample (we use the same batch size as in the OOD setting N = 128). ...For each dataset, a fixed set of samples from the training set is set aside as a calibration set.
Hardware Specification	Yes	All models were trained on an internal cluster consisting of NVIDIA A100 GPUs.
Software Dependencies	Yes	All models are trained with the Adam W optimizer provided by the timm library Wightman et al. (2023)... We leverage the current state-of-the-art model, Mask Net (Wang et al., 2021), relying on the implementation from (Zhu (2023), Apache-2.0 license). For Temperature scaling (Guo et al., 2017) we use parameters proposed by Pleiss (2024): NLL Loss, LBFGS as an Optimizer with learning rate of 0.01 and running for a maximum of 50 iterations.
Experiment Setup	Yes	All models are trained with the Adam W optimizer provided by the timm library Wightman et al. (2023) using default parameters for β1, β2 and ϵ. We use grid-search to tune the learning rate from values between [1e 3, 1e 5] and for weight decay form values between [0.1, 1e 4]. For CIFAR-10 and Tiny Image Net, we run fine-tuning for 20 epochs with a batch size of 512 and with a constant learning rate. Meanwhile, for Image Net, we run fine-tuning for 5 epochs with a batch size of 256 and constant learning rate. For Masksemble models we extend learning to 20 epochs as the needed changes to the network are much bigger than for the other methods. Finally, we run training for 100 epochs with a batch size of 1000 reducing the learning rate by 0.1 on a plateau with patience 2 checked on the validation set for Criteo.