reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Calibrating Deep Ensemble through Functional Variational Inference

Authors: Zhijie Deng, Feng Zhou, Jianfei Chen, Guoqiang Wu, Jun Zhu

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We study the behavior of DE-GP on diverse benchmarks. Empirically, DE-GP outperforms DE and its variants on various regression datasets and presents superior uncertainty estimates and out-of-distribution robustness without compromising accuracy in standard image classification tasks. The proposed approach achieves better uncertainty quantification than DE and its variants across diverse scenarios, while consuming only marginally added training cost compared to standard DE. The code is available at https://github.com/thudzj/DE-GP. 5 Experiments We perform extensive evaluation to demonstrate that DE-GP yields better uncertainty estimates than the baselines, while preserving non-degraded predictive performance. Given that the original Deep Ensemble itself is very performant in terms of accuracy and uncertainty quantification (Ovadia et al., 2019), we mainly focus on comparing to it and its popular variants, including DE, r DE, NN-GP, RMS, etc.
Researcher Affiliation	Academia	Zhijie Deng EMAIL Qing Yuan Research Institute, SEIEE, Shanghai Jiao Tong University Feng Zhou EMAIL Center for Applied Statistics and School of Statistics, Renmin University of China Jianfei Chen EMAIL Dept. of Comp. Sci. & Tech., Tsinghua University Guoqiang Wu EMAIL School of Software, Shandong University Jun Zhu EMAIL Dept. of Comp. Sci. & Tech., Tsinghua University
Pseudocode	Yes	4.3 Training We outline the training procedure in Algorithm 1, and elaborate some details below. Algorithm 1 The training of DE-GP 1: Input: D: dataset; {g( , wi)}M i=1: a deep ensemble; kp: prior kernel; ν: distribution for sampling extra measurement points; U: number of MC samples for estimating expected log-likelihood 2: while not converged do 3: Ds = (Xs, Ys) D, Xν ν, Xs = {Xs, Xν} 4: g Xs i = g( Xs, wi), i = 1, ..., M 5: m Xs = 1 M P 6: k Xs, Xs = 1 M PM i=1(g Xs i m Xs)(g Xs i m Xs) + λI\| Xs\|C 7: k Xs, Xs p = kp( Xs, Xs) (x,y) Ds log p(y\|fi(x)), fi N(m Xs, k Xs, Xs) 9: L2 = DKL[N(m Xs, k Xs, Xs) N(0, k Xs, Xs p )] 10: wi = wi + η wi(L1 αL2), i = 1, ..., M 11: end while
Open Source Code	Yes	The code is available at https://github.com/thudzj/DE-GP.
Open Datasets	Yes	We study the behavior of DE-GP on diverse benchmarks. Empirically, DE-GP outperforms DE and its variants on various regression datasets and presents superior uncertainty estimates and out-of-distribution robustness without compromising accuracy in standard image classification tasks. 5.2 UCI Regression We then assess DE-GP on 5 UCI real-valued regression problems. 5.3 Classification on Fashion-MNIST We use a widened Le Net5 architecture with batch normalizations (BNs) (Ioffe & Szegedy, 2015) for the Fashion-MNIST dataset (Xiao et al., 2017). 5.4 Classification on CIFAR-10 Next, we apply DE-GP to the real-world image classification task CIFAR-10 (Krizhevsky et al., 2009). Suggested by Ovadia et al. (2019); He et al. (2020), we train models on CIFAR-10, and test them on the combination of CIFAR-10 and SVHN test sets. This is a standard benchmark for evaluating the uncertainty of OOD data. We further test the trained methods on CIFAR-10 corruptions (Hendrycks & Dietterich, 2018), a challenging OOD generalization/robustness benchmark for deep models. A.3.4 Results on CIFAR-100 We further perform experiments on the more challenging CIFAR-100 benchmark. A.3.5 Results on Tiny Image Net We conduct experiments on Tiny Image Net (mnmoustafa, 2017) and here are some core results (we ensemble 5 members with Res Net-18 architecture and perform training for 30 epochs with an SGD optimizer):
Dataset Splits	Yes	We perform cross validation with 5 splits. Fig. 2 shows the results. We build a regression problem with 8 data from y = sin 2x + ϵ, ϵ N(0, 0.2) as shown in Fig. 1. We split the data as the training set, validation set, and test set of size 45000, 5000, and 10000, respectively.
Hardware Specification	No	No specific hardware details are provided in the paper. The paper mentions architectural choices (e.g., ResNet-20), but not the actual hardware used for experiments.
Software Dependencies	No	The paper mentions using "Gen RL library" but does not provide a version number. Other software dependencies like deep learning frameworks (e.g., PyTorch, TensorFlow) and their versions are not explicitly listed.
Experiment Setup	Yes	Illustrative regression. For the problem on y = sin 2x + ϵ, ϵ N(0, 0.1), we randomly sample 8 data points from [ 1.5, 1.5]. We add 1.2 to the target value of the rightest data point to introduce strong data noise. For optimizing the ensemble members, we use a SGD optimizer with 0.9 momentum and 0.001 learning rate. The learning rate follows a cosine decay schedule. The optimization takes 1000 iterations. The extra measurement points are uniformly sampled from [ 2, 2]. The regularization constant λ is set as 1e 4 times of the average eigenvalue of the central covariance matrices. UCI regression. We pre-process the UCI data by standard normalization. We set the variance for data noise and the weight variance for the prior kernel following (Pearce et al., 2020). The batch size for stochastic training is 256. We use an Adam optimizer to optimize for 1000 epochs. The learning rate is initialized as 0.01 and decays by 0.99 every 5 epochs. Fashion-MNIST classification. The used architecture is Conv(32, 3, 1)-BN-Re LU-Max Pool(2)-Conv(64, 3, 0)-BN-Re LU-Max Pool(2)-Linear(256)-Re LU-Linear(10), where Conv(x, y, z) represents a 2D convolution with x output channels, kernel size y, and padding z. The batch size for training data is 64. We do not use extra measurement points here. We use an SGD optimizer to optimize for 24 epochs. The learning rate is initialized as 0.1 and follows a cosine decay schedule. We use an Adam optimizer with 1e 3 learning rate to optimize the temperature. CIFAR-10 classification. We perform data augmentation including random horizontal flip and random crop. The batch size for training data is 128. We do not use extra measurement points here. We use a SGD optimizer with 0.9 momentum to optimize for 200 epochs. The learning rate is initialized as 0.1 and decays by 0.1 at 100-th and 150-th epochs. We use an Adam optimizer with 1e 3 learning rate to optimize the temperature.