reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Robustness between the worst and average case

Authors: Leslie Rice, Anna Bair, Huan Zhang, J. Zico Kolter

NeurIPS 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that our approach provides substantially better estimates than simple random sampling of the actual intermediate-q robustness of standard, data-augmented, and adversarially-trained classiﬁers, illustrating a clear tradeoff between classiﬁers that optimize different metrics.
Researcher Affiliation	Collaboration	Leslie Rice Department of Computer Science Carnegie Mellon University Pittsburgh, PA EMAIL Anna Bair Department of Machine Learning Carnegie Mellon University Pittsburgh, PA EMAIL Huan Zhang Department of Computer Science Carnegie Mellon University Pittsburgh, PA EMAIL J. Zico Kolter Department of Computer Science Carnegie Mellon University & Bosch Center for Artiﬁcial Intelligence Pittsburgh, PA EMAIL
Pseudocode	Yes	Algorithm 1 Evaluating the intermediate-q robustness of a neural network function h using path sampling estimation with m MCMC samples with x, y D for some norm q.
Open Source Code	Yes	Code for reproducing experiments can be found at https://github.com/locuslab/intermediate_robustness.
Open Datasets	Yes	All of our experiments are either run on the MNIST dataset [Le Cun et al., 1998] or the CIFAR-10 dataset [Krizhevsky et al., 2009].
Dataset Splits	No	The paper mentions using MNIST and CIFAR-10 datasets for experiments but does not provide specific details on training, validation, and test splits (e.g., percentages, sample counts, or explicit mention of a validation set for hyperparameter tuning).
Hardware Specification	No	The paper does not provide any specific details about the hardware used for running the experiments, such as GPU models, CPU types, or memory configurations.
Software Dependencies	No	The paper does not list specific software dependencies with version numbers, such as Python version, deep learning frameworks (e.g., PyTorch, TensorFlow), or other libraries used in the implementation.
Experiment Setup	Yes	On MNIST, ˆZMC is computed with m = 2000, ˆZPS+HMC with m = 100, L = 20, and Adv. loss corresponds to PGD with 100 iterations. On CIFAR-10, ˆZMC is computed with m = 500, ˆZPS+HMC with m = 50, L = 10, and Adv. loss corresponds to PGD with 50 iterations at 10 restarts. For the MC estimate computed during training, we use m = 50 samples, whereas for the PS+HMC estimate we use m = 25 samples with L = 2 leapfrog steps.