reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Probably Approximately Global Robustness Certification

Authors: Peter Blohm, Patrick Indri, Thomas Gärtner, Sagar Malhotra

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we investigate the practical aspects of our theory. We aim to show that our results translate into practical settings and aim to answer the following research questions: RQ1: How can different methods for checking local robustness be modeled as oracles? RQ2: How well do our guarantees hold on unseen data in realistic, imperfect conditions? RQ3: How does the runtime of our verification procedure scale based on network size, parameter choices, and oracle? RQ4: How well can we capture qualitative differences in the behavior of different NNs? We answer these questions by applying our procedure to the certification of NNs for MNIST (Deng, 2012) and CIFAR10 (Krizhevsky, 2009). ... Results Table 2 reports a summary of our experimental evaluation.
Researcher Affiliation	Academia	1TU Wien, Austria. Correspondence to: Peter Blohm <EMAIL>.
Pseudocode	Yes	Algorithm 1 Obtain κ-ρ-mapping
Open Source Code	Yes	The full code to train, test and analyze the experiments is available at our repository.
Open Datasets	Yes	We answer these questions by applying our procedure to the certification of NNs for MNIST (Deng, 2012) and CIFAR10 (Krizhevsky, 2009).
Dataset Splits	Yes	Setup We train four different network architectures [...] on the two classification problems MNIST and CIFAR10 [...]. For each architecture, we train, in a cross-validation setup, five instances of the network with standard training and five instances with TRADES (Zhang et al., 2019). We then use the respective validation split of the data to produce our guarantee: we imitate iid samples from the true data distribution by sampling with Gaussian noise from the validation data. ... We train five instances of each architecture in the manner of 5-fold cross-validation. We then use the respective 20% of the training data to sample data points with Gaussian noise added to the data points. The network does not see the validation split during training, and our guarantees are only obtained from the data in the test split. Finally, the test set, both unknown to the network and our verification procedure, is used to test the generalization of our guarantee.
Hardware Specification	Yes	All the experiments were run on a single desktop machine equipped with an Intel i9-11900KF @ 3.50GHz CPU and a NVIDIA Ge Force RTX 3080 GPU.
Software Dependencies	No	The paper mentions software like PyTorch, MAIR, LIRPA, and Marabou 2.0. However, only Marabou 2.0 includes a specific version number. For other key components like PyTorch, MAIR, or LIRPA, specific version numbers are not provided, which is required for a reproducible description of ancillary software.
Experiment Setup	Yes	For the experiments with PGD, we set ϵ = 10 4, pmin = 0.01, δ = 0.01 and thus sample s(ϵ, δ/2, 2) = 989534 images. ... In our experiments with auto Li RPA, we choose ϵ = 2.5 10 3, pmin = 0.05 and δ = 0.01. We consequently sample s(ϵ, δ/2, 2) = 31635 images. ... For TRADES, we chose the parameter β = 6. ... For MNIST, we use a step size of 0.5/256 and up to 200 steps to find an adversarial example. For CIFAR-10, we use a similar setup with a smaller step size of 0.1/256 and up to 500 steps.