reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SoundnessBench: A Soundness Benchmark for Neural Network Verifiers

Authors: Xingjian Zhou, Keyi Shen, Andy Xu, Hongji Xu, Cho-Jui Hsieh, Huan Zhang, Zhouxing Shi

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that our training effectively produces hidden counterexamples and our Soundness Bench successfully identifies bugs in state-of-the-art NN verifiers. Our code is available at https://github.com/mvp-harry/Soundness Bench and our dataset is available at https://huggingface.co/datasets/Soundness Bench/Soundness Bench. In this section, we conduct comprehensive experiments.
Researcher Affiliation	Academia	Xingjian Zhou1 EMAIL University of California, Los Angeles Keyi Shen2 EMAIL Georgia Institute of Technology Andy Xu1 EMAIL University of California, Los Angeles Hongji Xu3 EMAIL Duke University Cho-Jui Hsieh1 EMAIL University of California, Los Angeles Huan Zhang4 EMAIL University of Illinois Urbana-Champaign Zhouxing Shi5 EMAIL University of California, Riverside
Pseudocode	Yes	A Illustration for the Training Algorithm... In Algorithm 1, we illustrate our algorithm for training models with hidden counterexamples, as proposed in Section 3.4. Algorithm 1 Training Models with Hidden Counterexamples
Open Source Code	Yes	Our code is available at https://github.com/mvp-harry/Soundness Bench
Open Datasets	Yes	Our code is available at https://github.com/mvp-harry/Soundness Bench and our dataset is available at https://huggingface.co/datasets/Soundness Bench/Soundness Bench. ... Additionally, in Section 4.4, we conduct an ablation study to justify the design choices for our training. We also provide additional results in the appendix about testing NN verifiers with synthetic bugs (Appendix C), repeating our experiments with multiple random seeds (Appendix D), and experimenting on MNIST (Appendix E).
Dataset Splits	Yes	For each setting, we construct a dataset D = Dunv Dregular consisting of n unverifiable instances and n regular instances, respectively: ... For each model with a particular perturbation radius and input size, we set n = 10 as the number of instances and generate 10 unverifiable instances (x(unv,i) 0 , y(unv,i)) Dunv (1 i n), each with a corresponding pre-defined counterexample (x(i) CEX, y(i) CEX), and 10 regular instances (x(regular,i) 0 , y(regular,i)) Dregular (1 i n).
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. It only discusses the experimental setup at a higher level of abstraction regarding training parameters and verifier types.
Software Dependencies	No	The paper mentions the use of 'Adam optimizer' and 'PGD attack' as well as 'auto_Li RPA package' and specific versions of verifiers by VNN-COMP branch names (e.g., 'Marabou 2023', 'Marabou 2024'), but it does not provide specific version numbers for core software libraries like Python, PyTorch, TensorFlow, or other key dependencies required for full reproducibility.
Experiment Setup	Yes	For every model in the benchmark, we apply our aforementioned training framework with the following hyperparameters: (1) Number of training epochs is set to 5000; (2) Adam optimizer (Kingma & Ba, 2015) with a cyclic learning rate schedule is used, where the learning rate gradually increases from 0 to a peak value during the first half of the training, and then it gradually decreases to 0 during the second half of the training, with the peak learning rate set to 0.001; (3) The threshold in the margin objective defined in Eq. (10) is set to λ = 0.01; (4) The size of the perturbation sliding window used in Eq. (11) is set to w = 300; (5) We use the PGD attack (Madry et al., 2018) for the adversarial training with up to 150 restarts (number of different random initialization for the PGD attack) and 150 attack steps.