reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CTBench: A Library and Benchmark for Certified Training

Authors: Yuhao Mao, Stefan Balauca, Martin Vechev

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that (1) almost all algorithms in CTBENCH surpass the corresponding reported performance in literature in the magnitude of algorithmic improvements, thus establishing new state-of-the-art, and (2) the claimed advantage of recent algorithms drops significantly when we enhance the outdated baselines with a fair training schedule, a fair certification method and well-tuned hyperparameters. Based on CTBENCH, we provide new insights into the current state of certified training, including (1) certified models have less fragmented loss surface, (2) certified models share many mistakes, (3) certified models have more sparse activations, (4) reducing regularization cleverly is crucial for certified training especially for large radii and (5) certified training has the potential to improve outof-distribution generalization. We are confident that CTBENCH will serve as a benchmark and testbed for future research in certified training.
Researcher Affiliation	Academia	1 Department of Computer Science, ETH Zürich, Switzerland 2 INSAIT, Sofia University "St. Kliment Ohridski", Sofia, Bulgaria. Correspondence to: Yuhao Mao <EMAIL>.
Pseudocode	No	The paper describes algorithms conceptually in Section 3.2 and provides a complexity analysis in Table 11 but does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	We release the complete codebase of CTBENCH, including the implementation of all certified training methods and the model checkpoints for the benchmark. The codebase is available at https://github.com/eth-sri/CTBench.
Open Datasets	Yes	We use the MNIST (Le Cun et al., 2010), CIFAR-10 (Krizhevsky et al., 2009) and TINYIMAGENET (Le & Yang, 2015) datasets for our experiments. All are open-source and freely available with unspecified license.
Dataset Splits	Yes	We train on the corresponding train set and certify on the validation set, as adopted in the literature (Shi et al., 2021; Müller et al., 2023; Mao et al., 2023; De Palma et al., 2024).
Hardware Specification	Yes	We train and certify MNIST ϵ = 0.1, MNIST ϵ = 0.3 and CIFAR-10 ϵ = 8 255 models on a single NVIDIA Ge Force RTX 2080 Ti with Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz and 530GB RAM. We train and certify CIFAR-10 ϵ = 2 255 and TINYIMAGENET ϵ = 1 255 models on a single NVIDIA L4 with Intel(R) Xeon(R) CPU @ 2.20GHz CPU and 377 GB RAM.
Software Dependencies	No	The paper mentions using Adam (Kingma & Ba, 2015) for optimization, but does not provide specific version numbers for any software libraries or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup	Yes	Initialization Adversarial training methods are initialized by Kaiming uniform (He et al., 2015), while certified training methods are initialized by IBP initialization (Shi et al., 2021). Training Schedule We mostly follow the training schedule of (De Palma et al., 2024), but in some cases a shorter schedule to reduce cost. Specifically, the warmup phase is 20 epochs for MNIST ϵ = 0.1 and ϵ = 0.3, 80 epochs for CIFAR-10 ϵ = 2 255, 120 epochs for CIFAR-10 ϵ = 8 255 and 80 epochs for TINYIMAGENET ϵ = 1 255. In addition, for CIFAR-10 and TINYIMAGENET, we use standard training for 1 additional epoch at the beginning. We apply the IBP regularization proposed by (Shi et al., 2021), with weight equals 0.5 on MNIST and CIFAR-10, and 0.2 on TINYIMAGENET, during the warmup phase. In total, we train 70 epochs for MNIST ϵ = 0.1 and ϵ = 0.3, 160 epochs for CIFAR-10 ϵ = 2 255, 240 epochs for CIFAR-10 ϵ = 8 255, and 160 epochs for TINYIMAGENET ϵ = 1 255. Optimization We use Adam (Kingma & Ba, 2015) with a learning rate of 0.0005. The learning rate is decayed by a factor of 0.2 at epoch 50 and 60 for MNIST ϵ = 0.1 and ϵ = 0.3, at epoch 120 and 140 for CIFAR-10 ϵ = 2 255, at epoch 200 and 220 for CIFAR-10 ϵ = 8 255, and at epoch 120 and 140 for TINYIMAGENET ϵ = 1 255. We use a batch size of 256 for MNIST, and 128 for CIFAR-10 and TINYIMAGENET. Gradients of each step are clipped to 10 in L2 norm. No weight decay is applied and L1 regularization only on weights of linear and convolution layers is used. Further, Wu & Johnson (2021) find that running statistics lag behind the population statistics and propose to use the population statistics for testing. We adopt this strategy in CTBENCH, since it only needs to compute Lnat and is much cheaper than the computation of Lrob. Tuning Scheme (Section C.4) also provides detailed hyperparameter search spaces and best hyperparameters for each method and setting in Tables 6-10.