CTBench: A Library and Benchmark for Certified Training
Authors: Yuhao Mao, Stefan Balauca, Martin Vechev
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that (1) almost all algorithms in CTBENCH surpass the corresponding reported performance in literature in the magnitude of algorithmic improvements, thus establishing new state-of-the-art, and (2) the claimed advantage of recent algorithms drops significantly when we enhance the outdated baselines with a fair training schedule, a fair certification method and well-tuned hyperparameters. Based on CTBENCH, we provide new insights into the current state of certified training, including (1) certified models have less fragmented loss surface, (2) certified models share many mistakes, (3) certified models have more sparse activations, (4) reducing regularization cleverly is crucial for certified training especially for large radii and (5) certified training has the potential to improve outof-distribution generalization. We are confident that CTBENCH will serve as a benchmark and testbed for future research in certified training. |
| Researcher Affiliation | Academia | 1 Department of Computer Science, ETH Zürich, Switzerland 2 INSAIT, Sofia University "St. Kliment Ohridski", Sofia, Bulgaria. Correspondence to: Yuhao Mao <EMAIL>. |
| Pseudocode | No | The paper describes algorithms conceptually in Section 3.2 and provides a complexity analysis in Table 11 but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We release the complete codebase of CTBENCH, including the implementation of all certified training methods and the model checkpoints for the benchmark. The codebase is available at https://github.com/eth-sri/CTBench. |
| Open Datasets | Yes | We use the MNIST (Le Cun et al., 2010), CIFAR-10 (Krizhevsky et al., 2009) and TINYIMAGENET (Le & Yang, 2015) datasets for our experiments. All are open-source and freely available with unspecified license. |
| Dataset Splits | Yes | We train on the corresponding train set and certify on the validation set, as adopted in the literature (Shi et al., 2021; Müller et al., 2023; Mao et al., 2023; De Palma et al., 2024). |
| Hardware Specification | Yes | We train and certify MNIST ϵ = 0.1, MNIST ϵ = 0.3 and CIFAR-10 ϵ = 8 255 models on a single NVIDIA Ge Force RTX 2080 Ti with Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz and 530GB RAM. We train and certify CIFAR-10 ϵ = 2 255 and TINYIMAGENET ϵ = 1 255 models on a single NVIDIA L4 with Intel(R) Xeon(R) CPU @ 2.20GHz CPU and 377 GB RAM. |
| Software Dependencies | No | The paper mentions using Adam (Kingma & Ba, 2015) for optimization, but does not provide specific version numbers for any software libraries or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA). |
| Experiment Setup | Yes | Initialization Adversarial training methods are initialized by Kaiming uniform (He et al., 2015), while certified training methods are initialized by IBP initialization (Shi et al., 2021). Training Schedule We mostly follow the training schedule of (De Palma et al., 2024), but in some cases a shorter schedule to reduce cost. Specifically, the warmup phase is 20 epochs for MNIST ϵ = 0.1 and ϵ = 0.3, 80 epochs for CIFAR-10 ϵ = 2 255, 120 epochs for CIFAR-10 ϵ = 8 255 and 80 epochs for TINYIMAGENET ϵ = 1 255. In addition, for CIFAR-10 and TINYIMAGENET, we use standard training for 1 additional epoch at the beginning. We apply the IBP regularization proposed by (Shi et al., 2021), with weight equals 0.5 on MNIST and CIFAR-10, and 0.2 on TINYIMAGENET, during the warmup phase. In total, we train 70 epochs for MNIST ϵ = 0.1 and ϵ = 0.3, 160 epochs for CIFAR-10 ϵ = 2 255, 240 epochs for CIFAR-10 ϵ = 8 255, and 160 epochs for TINYIMAGENET ϵ = 1 255. Optimization We use Adam (Kingma & Ba, 2015) with a learning rate of 0.0005. The learning rate is decayed by a factor of 0.2 at epoch 50 and 60 for MNIST ϵ = 0.1 and ϵ = 0.3, at epoch 120 and 140 for CIFAR-10 ϵ = 2 255, at epoch 200 and 220 for CIFAR-10 ϵ = 8 255, and at epoch 120 and 140 for TINYIMAGENET ϵ = 1 255. We use a batch size of 256 for MNIST, and 128 for CIFAR-10 and TINYIMAGENET. Gradients of each step are clipped to 10 in L2 norm. No weight decay is applied and L1 regularization only on weights of linear and convolution layers is used. Further, Wu & Johnson (2021) find that running statistics lag behind the population statistics and propose to use the population statistics for testing. We adopt this strategy in CTBENCH, since it only needs to compute Lnat and is much cheaper than the computation of Lrob. Tuning Scheme (Section C.4) also provides detailed hyperparameter search spaces and best hyperparameters for each method and setting in Tables 6-10. |