reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Revisiting adversarial training for the worst-performing class

Authors: Thomas Pethick, Grigorios Chrysos, Volkan Cevher

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate an improvement to 32% in the worst class accuracy on CIFAR10, and we observe consistent behavior across CIFAR100 and STL10. Our study highlights the importance of moving beyond average accuracy, which is particularly important in safetycritical applications. ... We carry out extensive experiments comparing CFOL against three strong baselines across three datasets, where we consistently observe that CFOL improves the weakest classes. ... 5 Experiments
Researcher Affiliation	Academia	Thomas Pethick EMAIL École Polytechnique Fédérale de Lausanne (EPFL) Grigorios G Chrysos EMAIL École Polytechnique Fédérale de Lausanne (EPFL) Volkan Cevher EMAIL École Polytechnique Fédérale de Lausanne (EPFL)
Pseudocode	Yes	Algorithm 1: Class focused online learning (CFOL)
Open Source Code	No	The paper provides pseudocode (Listing 1) and mentions a third-party library 'robustness' with a GitHub URL (Engstrom et al., 2019), but it does not explicitly state that the authors' own implementation code for CFOL is open-source or provide a direct link to their repository.
Open Datasets	Yes	We test on three datasets with different dimensionality, number of examples per class and number of classes. Specifically, we consider CIFAR10, CIFAR100 and STL10 (Krizhevsky et al., 2009; Coates et al., 2011) (see Appendix C.2 for further details). ... Tiny Image Net (Russakovsky et al., 2015) ... Imagenette 1 (https://github.com/fastai/imagenette)
Dataset Splits	No	The paper mentions using a "validation set" for early stopping and describes the total number of training examples for some datasets (e.g., "CIFAR10 includes 50,000 training examples"), but it does not specify the exact percentages or absolute counts for training, validation, and test splits needed to reproduce the experiments. For example, it doesn't state how the validation set was created from the training data or the size of the test set.
Hardware Specification	No	We use one GPU on an internal cluster. (Appendix C)
Software Dependencies	No	The paper mentions "pytorch pseudo code" in Listing 1 but does not specify any version numbers for PyTorch or other software dependencies, which are required for a reproducible description.
Experiment Setup	Yes	Hyper-parameters Unless otherwise noted, we use the standard adversarial training setup of a Res Net-18 network (He et al., 2016) with a learning rate τ = 0.1, momentum of 0.9, weight decay of 5 10 4, batch size of 128 with a piece-wise constant weight decay of 0.1 at epoch 100 and 150 for a total of 200 epochs according to Madry et al. (2017). For the attack we similarly adopt the common attack radius of 8/255 using 7 steps of projected gradient descent (PGD) with a step-size of 2/255 (Madry et al., 2017). For evaluation we use the stronger attack of 20 step throughout, except for Table 7 C where we show robustness against Auto Attack (Croce & Hein, 2020).