reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Deep Classifiers with Label Noise Modeling and Distance Awareness

Authors: Vincent Fortuin, Mark Collier, Florian Wenzel, James Urquhart Allingham, Jeremiah Zhe Liu, Dustin Tran, Balaji Lakshminarayanan, Jesse Berent, Rodolphe Jenatton, Effrosyni Kokiopoulou

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conducted experiments on synthetic data and standard image classiﬁcation benchmarks. As baselines, we compare against a standard deterministic model (He et al., 2016; Dosovitskiy et al., 2020), the heteroscedastic method (Collier et al., 2021) and SNGP (Liu et al., 2020). We also compare against the Posterior Network model (Charpentier et al., 2020), which also oﬀers distance-aware uncertainties, but it only applies to problems with few classes. Note that our main motivation for the experiments is to assess whether combining the SNGP and heteroscedastic method can successfully demonstrate the complementary beneﬁts of the two methods.
Researcher Affiliation	Collaboration	Vincent Fortuin EMAIL Department of Computer Science, ETH Zürich Department of Engineering, University of Cambridge Mark Collier EMAIL Google Research Florian Wenzel, James Allingham, Jeremiah Liu, Dustin Tran, Balaji Lakshminarayanan, Jesse Berent, Rodolphe Jenatton, Eﬀrosyni Kokiopoulou Google Research
Pseudocode	Yes	Algorithm 1 Het SNGP training Require: dataset D = {(xi, yi)}N i=1 Initialize θ, ϕ, W , b, bΣ, ˆβ for train_step = 1 to max_step do Take minibatch (Xi, yi) from D for s = 1 to S do ϵs K N(0, IK), ϵs R N(0, IR) us i,c = Φ i ˆβc + d(xi) ϵs K + V (xi)ϵs R end for L= 1 S PS s=1 log p(Xi, yi \| us) ˆβ 2 Update {θ, ϕ, ˆβ} via SGD on L if ﬁnal_epoch then Compute {bΣ 1 c }K c=1 as per Eq. (5) end if end for Algorithm 2 Het SNGP prediction Require: test example x 2 m cos(W h(x ) + b) for s = 1 to S do βs c p(βc \| D) ϵs K N(0, IK), ϵs R N(0, IR) us ,c = Φ βs c + d(x ) ϵs K + V (x )ϵs R end for p(y =c \| x )= 1 S PS s=1 exp(us ,c/τ) PK k=1 exp(us ,k/τ) Predict y = arg maxc p(y = c \| x )
Open Source Code	Yes	Our implementation of the Het SNGP is available as a layer in edward2 (https://github.com/google/edward2/blob/main/edward2/tensorflow/layers/hetsngp.py) and the experiments are implemented in uncertainty_baselines (e.g., https://github.com/google/uncertainty-baselines/blob/main/baselines/imagenet/hetsngp.py).
Open Datasets	Yes	To assess our proposed model s predictive performance and uncertainty estimation capabilities, we conducted experiments on synthetic two moons data (Pedregosa et al., 2011), a mixture of Gaussians, the CIFAR-100 dataset (Krizhevsky & Hinton, 2009), and the Image Net dataset (Deng et al., 2009). We compare against a standard deterministic Res Net model as a baseline (He et al., 2016), against the heteroscedastic method (Collier et al., 2020; 2021) and the SNGP (Liu et al., 2020) (which form the basis for our combined model) and against the recently proposed Posterior Network model (Charpentier et al., 2020), which also oﬀers distanceaware uncertainties, similarly to the SNGP. We used the same backbone neural network architecture for all models, which was a fully-connected Res Net for the synthetic data, a Wide Res Net18 on CIFAR and a Res Net50 in Image Net.
Dataset Splits	Yes	We start by assessing our method on a real-world image dataset; we trained it on CIFAR-100 and used CIFAR-10 as a near-OOD dataset and Places365 (Zhou et al., 2017) as far-OOD. We measure the OOD detection performance in terms of area under the receiver-operator-characteristic curve (ROC) and false-positive-rate at 95% conﬁdence (FPR95). We also evaluated the methods generalization performance on corrupted CIFAR-100 (Hendrycks & Dietterich, 2019). A large-scale dataset with natural label noise and established OOD benchmarks is the Image Net dataset (Deng et al., 2009; Beyer et al., 2020). The heteroscedastic method has been shown to improve in-distribution performance on Image Net (Collier et al., 2021). We see in Table 3 that Het SNGP outperforms the SNGP in terms of accuracy and likelihood on the in-distribution Image Net validation set and performs almost on par with the heteroscedastic model. We introduce a new large-scale OOD benchmark based on Image Net-21k. We hope this new benchmark will be of interest to future work in the OOD literature. Image Net-21k is a larger version of the standard Image Net dataset used above (Deng et al., 2009). It has over 12.8 million training images and 21,843 classes. Each image can have multiple labels, whereas for standard Image Net, a single label is given per image. In creating our benchmark, we exploit the unique property of Image Net-21k that its label space is a superset of the 1000 Image Net classes (class n04399382 is missing). Having trained on the large Image Net-21k training set, we then evaluate the model on the 1,000 Image Net classes (setting the predictive probability of class n04399382 to zero). Despite now being in a setting where the model is trained on an order of magnitude more data and greater than 21 more classes, we can use the standard Image Net OOD datasets. This assesses the scalability of our method and the scalability of future OOD methods.
Hardware Specification	Yes	We implemented all models in Tensor Flow in Python and trained on Tensor Processing Units (TPUs) in the Google Cloud. We train all Imagnet-21k models for 90 epochs with batch size 1024 on 8x8 TPU slices. We proﬁled the runtimes of the diﬀerent methods and we see in Table 10 on Imagenet-21k that the diﬀerent methods do not diﬀer strongly in their computational costs. In particular, our Het SNGP performs generally on par with the standard heteroscedastic method. Table 10: Proﬁling of diﬀerent methods on Image Net-21k using Vision Transformers (B/16). We report the milliseconds and GFLOPS (=109 FLOPs) per image, both at training and evaluation time. All measurements are made on the same hardware (TPU V3 with 32 cores).
Software Dependencies	No	Our non-synthetic experiments are developed within the open source codebases uncertainty_baselines (Nado et al., 2021) and robustness_metrics (Djolonga et al., 2020) (to assess the OOD performances). Implementation details are deferred to Appendix A.1. Our implementation of the Het SNGP is available as a layer in edward2 (https://github.com/google/edward2/blob/main/edward2/tensorflow/layers/hetsngp.py) and the experiments are implemented in uncertainty_baselines (e.g., https://github.com/google/uncertainty-baselines/blob/main/baselines/imagenet/hetsngp.py). We implemented all models in Tensor Flow in Python and trained on Tensor Processing Units (TPUs) in the Google Cloud.
Experiment Setup	Yes	For most baselines, we used the hyperparameters from the uncertainty_baselines library (Nado et al., 2021). On CIFAR, we trained our Het SNGP with a learning rate of 0.1 for 300 epochs and used R = 6 factors for the heteroscedastic covariance, a softmax temperature of τ = 0.5 and S = 5000 Monte Carlo samples. On Image Net, we trained with a learning rate of 0.07 for 270 epochs and used R = 15 factors, a softmax temperature of τ = 1.25 and S = 5000 Monte Carlo samples. We implemented all models in Tensor Flow in Python and trained on Tensor Processing Units (TPUs) in the Google Cloud. We train all Imagnet-21k models for 90 epochs with batch size 1024 on 8 8 TPU slices. We train using the Adam optimizer with initial learning rate of 0.001 using a linear learning rate decay schedule with termination point 0.00001 and a warm-up period of 10,000 steps. We train using the sigmoid cross-entropy loss function and L2 weight decay with multiplier 0.03. The heteroscedastic method uses a temperature of 0.4, 1,000 Monte Carlo samples and R = 50 for the low rank approximation. Het SNGP has the same heteroscedastic hyperparameters except the optimal temperature is 1.5. For SNGP and Het SNGP the GP covariance is approximated using the momentum scheme presented in Liu et al. (2020) with momentum parameter 0.999.