reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ZigZag: Universal Sampling-free Uncertainty Estimation Through Two-Step Inference

Authors: Nikita Durasov, Nik Dorndorf, Hieu Le, Pascal Fua

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate our approach on several classification and regression tasks. We show that it delivers results on par with those of ensembles but at a much lower computational cost. 4 Experiments We first introduce our metrics and baselines. We then use simple synthetic data to illustrate the behavior of Zig Zag. Next, we turn to image datasets often used to test uncertainty-estimation algorithms. Finally, we present real-world applications. Implementation details about the baselines, metrics, training setups, and hyper-parameters could be found in the appendix.
Researcher Affiliation	Academia	Nikita Durasov EMAIL Computer Vision Laboratory, EPFL Nik Dorndorf EMAIL RWTH Aachen Hieu Le EMAIL Computer Vision Laboratory, EPFL Pascal Fua EMAIL Computer Vision Laboratory, EPFL
Pseudocode	No	The paper describes the methodology in prose within sections such as '3.3 Inference' but does not include any explicit pseudocode or algorithm blocks.
Open Source Code	No	No explicit statement or link for open-source code for the described methodology is present in the paper. The 'Reviewed on Open Review' link is for a review forum, not a code repository.
Open Datasets	Yes	We now compare Zig Zag against the baselines on the widely used benchmark datasets MNIST vs FMNIST, CIFAR vs SVHN, and Image Net vs Image Net-O. We train the networks on MNIST (Le Cun et al., 1998) and compute the accuracy and calibration metrics. We then use the uncertainty measure they produce to classify images from the test sets of MNIST and Fashion MNIST (Xiao et al., 2017) as being within the MNIST distribution or not to compute the OOD metrics introduced above. We ran a similar experiment with the CIFAR10 (Krizhevsky et al., 2014) and SVHN (Netzer et al., 2011) datasets. We experimented with the Image Net dataset (Russakovsky et al., 2015) and its counterpart, Image Net-O (Hendrycks et al., 2021). First, we consider image-based age prediction from face images. To this end, we use UTKFace (Zhang et al., 2017), a large-scale dataset. As in the classification experiments described above, we use i Cartoon Face (Zheng et al., 2020) dataset as out-of-distribution data. We collected a dataset of 2k wing profiles such as those of Fig. 8 by sampling the widely used NACA parameters (Jacobs & Sherman, 1937). We performed a similar experiment on 3D car models from a subset of the Shape Net dataset (Chang et al., 2015).
Dataset Splits	Yes	We took the 5% of top-performing shapes in terms of lift-to-drag ratio to be the out-of-distribution samples. We took 80% of the remaining 95% as our training set and the rest as our test set. Hence, training and testing shapes span lift-to-drag values from 0 to 60, whereas everything beyond that is considered to be OOD and therefore not used for training purposes. For OOD detection in MNIST experiments, using Fashion MNIST, which contains images markedly different from MNIST, is a common benchmark. To enhance our evaluation with a more challenging setup, we conducted additional MNIST experiments referred to as MNIST-S using digits 0-4 as in-distribution and 5-9 as OOD.
Hardware Specification	Yes	We also report Inference Time that represents how much time the model takes to compute uncertainties relative to single model inference on Tesla V100 and without considering parallelization for sampling-based approaches.
Software Dependencies	No	The paper mentions specific optimizers and architectural components like Adam, SGD, ELU activations, Batch Norms, and various network architectures (VGG, ResNet, Transformer, DLA, GMM, Graph Norm), often citing the original papers. However, it does not provide specific version numbers for any programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA) used in the implementation.
Experiment Setup	Yes	Synthetic Regression For our synthetic regression experiments... We train the model for 4000 epochs using Adam (Kingma & Ba, 2015) optimizer with 10^-2 learning rate and mean squared error loss. Synthetic Classification For synthetic classification experiments... As for regression, we apply Adam optimizer for 300 epochs and 10^-2 learning rate. MNIST Model used for MNIST experiments... We also train this model using Adam optimizer for three epochs with 10^-2 learning rate. CIFAR For CIFAR experiments... network is trained with SGD optimizer with 0.9 momentum for 20 epochs with 10^-1 learning rate and 10 more epochs with 10^-2. For all four sampling-based approaches, we use five samples to estimate the uncertainty at inference time. MC-Dropout, Batch Ensemble and Masksembles are applied to the last two layers of the model with 0.2 drop rate for MC-Dropout and 1.5 scale factor for Masksembles. Airfoils Lift-to-Drag... The model is trained for 10 epochs with Adam optimizer and 10^-3 learning rate. Estimating Car Drag... Final model is being trained for 100 epochs with Adam optimizer and 10^-3 learning rate. For pressure prediction task... The model is being trained for 1500 epochs with Adam optimizer with 10^-3 learning rate.