reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Catastrophic overfitting can be induced with discriminative non-robust features

Authors: Guillermo Ortiz-Jimenez, Pau de Jorge, Amartya Sanyal, Adel Bibi, Puneet K. Dokania, Pascal Frossard, Grégory Rogez, Philip Torr

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we study the onset of CO in single-step AT methods through controlled modifications of typical datasets of natural images. In particular, we show that CO can be induced at much smaller ϵ values than it was observed before just by injecting images with seemingly innocuous features. Through extensive experiments we analyze this novel phenomenon and discover that the presence of these easy features induces a learning shortcut that leads to CO. Our findings provide new insights into the mechanisms of CO and improve our understanding of the dynamics of AT.
Researcher Affiliation	Collaboration	Guillermo Ortiz-Jimenez EMAIL Ecole Polytechnique Fédérale de Lausanne Pau de Jorge EMAIL University of Oxford Naver Labs Europe Amartya Sanyal EMAIL ETH Zürich Max Planck Institute for Intelligent Systems, Tuebingen Adel Bibi EMAIL University of Oxford Puneet K. Dokania EMAIL University of Oxford Five AI Ltd. Pascal Frossard EMAIL Ecole Polytechnique Fédérale de Lausanne Grégory Rogez EMAIL Naver Labs Europe
Pseudocode	No	The paper describes methods and concepts through mathematical formulations and textual descriptions but does not include any explicitly labeled pseudocode blocks or algorithms.
Open Source Code	No	The paper does not contain any explicit statements about releasing source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets	Yes	We train a Pre Act Res Net18 (He et al., 2016) on different intervened versions of CIFAR-10 (Krizhevsky & Hinton, 2009) using FGSM-AT for different robustness budgets ϵ and scales β. Similarly to Section 3 we modify the SVHN, CIFAR-100 and higher resolution Imagenet-100 and Tiny Imagenet datasets to inject highly discriminative features v(y).
Dataset Splits	No	The paper refers to using specific datasets (e.g., CIFAR-10, SVHN) and mentions training and testing, but it does not explicitly provide details about the splits used (e.g., percentages for training/validation/test sets) or cite a specific standard split setup for reproducibility.
Hardware Specification	Yes	All our experiments were performed using a cluster equipped with GPUs of various architectures. The estimated compute budget required to produce all results in this work is around 2, 000 GPU hours (in terms of NVIDIA V100 GPUs).
Software Dependencies	No	The paper mentions methods like PGD-AT and FGSM, and architectures like Preact Res Net18, but it does not specify any software names with version numbers (e.g., Python, PyTorch, TensorFlow, or specific library versions) that would be needed to replicate the experiment.
Experiment Setup	Yes	Adversarial training for all methods and datasets follows the fast training schedules with a cyclic learning rate introduced in Wong et al. (2020). We train for 30 epochs on CIFAR Krizhevsky & Hinton (2009) and 15 epochs for SVHN Netzer et al. (2011) following Andriushchenko & Flammarion (2020). When we perform PGD-AT we use 10 steps and a step size α = 2/255; FGSM uses a step size of α = ϵ. Regularization parameters for Grad Align Andriushchenko & Flammarion (2020) and N-FGSM de Jorge et al. (2022) will vary and are stated when relevant in the paper. The architecture employed is a Preact Res Net18 He et al. (2016).