reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Attacking Bayes: On the Adversarial Robustness of Bayesian Neural Networks

Authors: Yunzhen Feng, Tim G. J. Rudner, Nikolaos Tsilivis, Julia Kempe

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To study the adversarial robustness of bnns, we investigate whether it is possible to successfully break state-of-the-art bnn inference methods and prediction pipelines using even relatively unsophisticated attacks for three tasks: (1) label prediction under the posterior predictive mean, (2) adversarial example detection with Bayesian predictive uncertainty, and (3) semantic shift detection. We find that bnns trained with state-of-the-art approximate inference methods, and even bnns trained with Hamiltonian Monte Carlo, are highly susceptible to adversarial attacks. ... We conduct thorough evaluations of bnns trained with well-established and state-of-the-art approximate inference methods (hmc, Neal (2010); psvi, Blundell et al. (2015); mcd, Gal and Ghahramani (2016); fsvi, Rudner et al. (2022b)) on benchmarking tasks such as MNIST, Fashion MNIST, and CIFAR-10.
Researcher Affiliation	Academia	Yunzhen Feng EMAIL New York University Tim G. J. Rudner EMAIL New York University Nikolaos Tsilivis EMAIL New York University Julia Kempe EMAIL New York University
Pseudocode	No	The paper describes methods in textual form and through mathematical equations, but it does not include any clearly labeled pseudocode or algorithm blocks (e.g., sections titled 'Pseudocode' or 'Algorithm').
Open Source Code	Yes	Reproducibility. Code to reproduce our results can be found at https://github.com/timrudner/attacking-bayes
Open Datasets	Yes	We conduct thorough evaluations of bnns trained with well-established and state-of-the-art approximate inference methods (hmc, Neal (2010); psvi, Blundell et al. (2015); mcd, Gal and Ghahramani (2016); fsvi, Rudner et al. (2022b)) on benchmarking tasks such as MNIST, Fashion MNIST, and CIFAR-10. ... Our semantic shift datasets for MNIST, Fashion MNIST and CIFAR-10 are Fashion MNIST, MNIST, and SVHN, respectively, each of them giving zero accuracy.
Dataset Splits	Yes	We evaluate all AE detectors on test data consisting of 50% clean samples and 50% adversarially perturbed samples, using total uncertainty for the rejection as described in Section 2.2. ... Our semantic shift datasets for MNIST, Fashion MNIST and CIFAR-10 are Fashion MNIST, MNIST, and SVHN, respectively, each of them giving zero accuracy. The test set contains half in-distribution (ID) and half semantically-shifted out-of-distribution (OOD) samples, hence selective accuracy curves start at 50% accuracy. ... To optimize GPU memory usage, we use 10,000 training and 5,000 validation samples from the MNIST dataset.
Hardware Specification	No	This work was supported in part through the NYU IT High Performance Computing resources, services, and staff expertise. The paper mentions a general 'High Performance Computing resources' but does not provide specific hardware details such as GPU/CPU models, memory, or processor types.
Software Dependencies	No	We implement hmc using the Hamiltorch package from Cobb and Jalaian (2021)... The paper mentions the 'Hamiltorch package' and implies the use of frameworks like PyTorch (torch.nn.CrossEntropyLoss) and TensorFlow/Keras (tf.nn.sparse_softmax_cross_entropy_with_logits, keras.models.Model), but it does not specify version numbers for any of these software components.
Experiment Setup	Yes	All hyperparameter details can be found in Appendix E. ... The hyperparameters for fsvi, psvi, and mcd are shown in Table 7, Table 8, and Table 9. ... For hmc, we train the model for 20 steps with 0.001 as the step size. ... Tables 7, 8, and 9 provide specific values for 'Prior Var', 'Prior Mean', 'Epochs', 'Batch Size', 'Context Batch Size', 'Learning Rate', 'Momentum', 'Weight Decay', 'Alpha', 'Reg Scale', and 'Dropout Rate' for different methods and datasets.