Average Certified Radius is a Poor Metric for Randomized Smoothing
Authors: Chenhao Sun, Yuhao Mao, Mark Niklas Mueller, Martin Vechev
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we confirm that existing training strategies, though improving ACR, reduce the model s robustness on hard samples consistently. To strengthen our findings, we propose strategies, including explicitly discarding hard samples, reweighing the dataset with approximate certified radius, and extreme optimization for easy samples, to replicate the progress in RS training and even achieve the state-of-the-art ACR on CIFAR-10, without training for robustness on the full data distribution. Overall, our results suggest that ACR has introduced a strong undesired bias to the field, and its application should be discontinued in RS. Finally, we suggest using the empirical distribution of p A, the accuracy of the base model on noisy data, as an alternative metric for RS. |
| Researcher Affiliation | Collaboration | 1Department of Computer Science, ETH Zurich, Switzerland 2Logic Star.ai. Correspondence to: <EMAIL, EMAIL, EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Adaptive Attack function ADAPTIVEADV(f, x, c, δ, T, ϵ) δ δ for t = 1 to T do if f(x + δ ) = c then break end if δ one step PGD attack on δ with step size ϵ δ δ 2 δ / δ 2 end for return δ end function |
| Open Source Code | Yes | Code and models are available at https://github.com/eth-sri/acr-weakness. |
| Open Datasets | Yes | With these simple modifications to Gaussian training, we achieve the new SOTA in ACR on CIFAR-10 and a competitive ACR on IMAGENET without targeting robustness for the general case ( 5 and 6). |
| Dataset Splits | Yes | We follow the standard training protocols in previous works. Specifically, we use Res Net-110 (He et al., 2016) on CIFAR-10 and Res Net-50 on IMAGENET, respectively. |
| Hardware Specification | No | The paper mentions training models and procedures but does not provide specific details on the hardware (e.g., GPU/CPU models, memory) used for these experiments. |
| Software Dependencies | No | The paper mentions algorithms like SGD, PGD, and references a certification function from a prior work (Cohen et al., 2019), but does not specify software names with version numbers (e.g., Python, PyTorch, CUDA versions) that would be needed to replicate the experiment. |
| Experiment Setup | Yes | For CIFAR-10, we train Res Net-110 for 150 epochs with an initial learning rate of 0.1, which is decreased by a factor of 10 in every 50 epochs. The inital learning rate of training Res Net-50 on IMAGENET is 0.1 and it is decreased by a factor of 10 in every 30 epochs within 90 epochs. We investigate three noise levels, σ = 0.25, 0.5, 1.0. We use SGD with momentum 0.9. The following hyperparameter is tuned for each setting: m = the number of noise samples for each input Et = the epoch to discard hard data samples pt = the threshold of p A to discard hard data samples T = the maximum number of steps of the PGD attack ϵ = the step size of the PGD attack. All other hyperparameter is fixed to the default values. Specifically, for CIFAR-10 we use 100 noise samples to calculate p A when discarding hard samples. When updating dataset weight, we use 16 noise samples to calculate p A, with pmin = 0.75. The hyperparameter tuned on CIFAR-10 is shown in Table 6. For IMAGENET, we skip the Data Reweighting with Certified Radius step for faster training, and use a Gaussian-pretrained Res Net-50 to evaluate the p A with 50 noise samples. The hyperparameters tuned on IMAGENET is shown in Table 7. |