reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Average Certified Radius is a Poor Metric for Randomized Smoothing

Authors: Chenhao Sun, Yuhao Mao, Mark Niklas Mueller, Martin Vechev

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we confirm that existing training strategies, though improving ACR, reduce the model s robustness on hard samples consistently. To strengthen our findings, we propose strategies, including explicitly discarding hard samples, reweighing the dataset with approximate certified radius, and extreme optimization for easy samples, to replicate the progress in RS training and even achieve the state-of-the-art ACR on CIFAR-10, without training for robustness on the full data distribution. Overall, our results suggest that ACR has introduced a strong undesired bias to the field, and its application should be discontinued in RS. Finally, we suggest using the empirical distribution of p A, the accuracy of the base model on noisy data, as an alternative metric for RS.
Researcher Affiliation	Collaboration	1Department of Computer Science, ETH Zurich, Switzerland 2Logic Star.ai. Correspondence to: <EMAIL, EMAIL, EMAIL>.
Pseudocode	Yes	Algorithm 1 Adaptive Attack function ADAPTIVEADV(f, x, c, δ, T, ϵ) δ δ for t = 1 to T do if f(x + δ ) = c then break end if δ one step PGD attack on δ with step size ϵ δ δ 2 δ / δ 2 end for return δ end function
Open Source Code	Yes	Code and models are available at https://github.com/eth-sri/acr-weakness.
Open Datasets	Yes	With these simple modifications to Gaussian training, we achieve the new SOTA in ACR on CIFAR-10 and a competitive ACR on IMAGENET without targeting robustness for the general case ( 5 and 6).
Dataset Splits	Yes	We follow the standard training protocols in previous works. Specifically, we use Res Net-110 (He et al., 2016) on CIFAR-10 and Res Net-50 on IMAGENET, respectively.
Hardware Specification	No	The paper mentions training models and procedures but does not provide specific details on the hardware (e.g., GPU/CPU models, memory) used for these experiments.
Software Dependencies	No	The paper mentions algorithms like SGD, PGD, and references a certification function from a prior work (Cohen et al., 2019), but does not specify software names with version numbers (e.g., Python, PyTorch, CUDA versions) that would be needed to replicate the experiment.
Experiment Setup	Yes	For CIFAR-10, we train Res Net-110 for 150 epochs with an initial learning rate of 0.1, which is decreased by a factor of 10 in every 50 epochs. The inital learning rate of training Res Net-50 on IMAGENET is 0.1 and it is decreased by a factor of 10 in every 30 epochs within 90 epochs. We investigate three noise levels, σ = 0.25, 0.5, 1.0. We use SGD with momentum 0.9. The following hyperparameter is tuned for each setting: m = the number of noise samples for each input Et = the epoch to discard hard data samples pt = the threshold of p A to discard hard data samples T = the maximum number of steps of the PGD attack ϵ = the step size of the PGD attack. All other hyperparameter is fixed to the default values. Specifically, for CIFAR-10 we use 100 noise samples to calculate p A when discarding hard samples. When updating dataset weight, we use 16 noise samples to calculate p A, with pmin = 0.75. The hyperparameter tuned on CIFAR-10 is shown in Table 6. For IMAGENET, we skip the Data Reweighting with Certified Radius step for faster training, and use a Gaussian-pretrained Res Net-50 to evaluate the p A with 50 noise samples. The hyperparameters tuned on IMAGENET is shown in Table 7.