reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Rethinking Robustness in Machine Learning: A Posterior Agreement Approach

Authors: João B. S. Carvalho, Víctor Jiménez Rodríguez, Alessandro Torcinovich, Antonio Emanuele Cinà, Carlos Cotrini, Lea Schönherr, Joachim M. Buhmann

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We assess the soundness of our measure in controlled environments and through an empirical robustness analysis in two different covariate shift scenarios: adversarial learning and domain generalization. We illustrate the suitability of PA by evaluating several models under different nature and magnitudes of shift, and proportion of affected observations. The results show that PA offers a reliable analysis of the vulnerabilities in learning algorithms across different shift conditions and provides higher discriminability than accuracy-based measures, while requiring no supervision.
Researcher Affiliation	Academia	João B. S. Carvalho EMAIL Department of Computer Science, ETH Zurich; Víctor Jiménez Rodríguez EMAIL Department of Computer Science, ETH Zurich; Alessandro Torcinovich EMAIL Faculty of Engineering, Free University of Bozen-Bolzano Department of Computer Science, ETH Zurich; Antonio E. Cinà EMAIL Department of Computer Science, University of Genoa; Carlos Cotrini EMAIL Department of Computer Science, ETH Zurich; Lea Schönherr EMAIL CISPA Helmholtz Center for Information Security; Joachim M. Buhmann EMAIL Department of Computer Science, ETH Zurich
Pseudocode	No	The paper describes mathematical derivations and theoretical properties but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	For further technical information, the reader is referred to our code implementation4. 4PA: https://github.com/viictorjimenezzz/pa-metric Experiments: https://github.com/viictorjimenezzz/pa-covariate-shift.
Open Datasets	Yes	For the adversarial robustness scenarios, we carry out our experiments with the CIFAR-10 (Krizhevsky et al., 2009) and the Image Net (Deng et al., 2009) datasets, widely adopted in the machine learning security literature as benchmarks for robustness evaluation. ... In this scenario, we conduct our experiments through a modified version of the Diag Vi B-6 dataset (Eulig et al., 2021) that comprises distorted and upsampled coloured images of size 128 128 from the MNIST dataset (Le Cun, 1998).
Dataset Splits	Yes	The CIFAR-10 dataset contains 60 000 colour images of 32 32 pixels equally distributed in 10 classes. The analyzed models are trained on the training set (50 000 images), and the PA evaluation is performed on the test set (10 000 images). ... Our final dataset comprises two sets of 40 000 images for training, two sets of 20 000 images for validation, and six sets of 10 000 images for testing.
Hardware Specification	No	We attested a relatively, fast optimization of the β parameter, usually in the order of tens of minutes in single GPU even for large dataset (i.e., Image Net). This only mentions "single GPU" without any specific model or other hardware specifications.
Software Dependencies	No	The paper mentions using "Adam (Kingma & Ba, 2015) optimization procedure", "Robust Bench library (Croce et al., 2021)", and "Auto Attack library (Croce & Hein, 2020b)". However, no specific version numbers are provided for these software components or libraries.
Experiment Setup	Yes	PGD and FMN attacks are run for 1000 steps, while the number of steps in Auto Attack is determined automatically. The β parameter is searched with an Adam (Kingma & Ba, 2015) optimization procedure, run for 500 epochs. ... Adam is run for 1000 epochs to search for the β parameter. ... we used ERM and IRM algorithms to train a Res Net18 model for 50 epochs on Dtrain, using Adam with a learning rate of 10 2.