reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Certification for Differentially Private Prediction in Gradient-Based Training

Authors: Matthew Robert Wicker, Philip Sosnin, Igor Shilov, Adrianna Janik, Mark Niklas Mueller, Yves-Alexandre De Montjoye, Adrian Weller, Calvin Tsay

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results across real-world datasets in medical image classification and natural language processing demonstrate that our sensitivity bounds are can be orders of magnitude tighter than global sensitivity. Our approach provides a strong basis for the development of novel privacy preserving technologies. 6. Experiments In this section, we present experimental validation of our proposed private prediction mechanisms. Comprehensive details on datasets, models, training configurations, and additional results can be found in Appendix E. We evaluate our approach across three binary classification tasks: Blobs Training a logistic regression on a blobs dataset generated from isotropic Gaussian distributions. Medical Imaging Fine-tuning the final dense layers of a convolutional neural network to distinguish an unseen diseased class in retinal OCT images. Sentiment Classification Training a neural network to perform sentiment analysis using GPT-2 embeddings of the IMDB movie reviews dataset.
Researcher Affiliation	Collaboration	1Department of Computing, Imperial College London, London, UK 2The Alan Turing Institute, London, UK 3Accenture Labs, Dublin, Ireland 4Department of Computer Science, ETH Zurich, Zurich, Switzerland 5Logic Star.ai, Zurich, Switzerland 6Department of Engineering, University of Cambridge, Cambridge, UK.
Pseudocode	Yes	Algorithm 1 ABSTRACT GRADIENT TRAINING FOR COMPUTING VALID PARAMETER-SPACE BOUNDS
Open Source Code	Yes	Code to reproduce our experiments is available at [redacted for anonymity].
Open Datasets	Yes	Medical Imaging Fine-tuning the final dense layers of a convolutional neural network to distinguish an unseen diseased class in retinal OCT images. Sentiment Classification Training a neural network to perform sentiment analysis using GPT-2 embeddings of the IMDB movie reviews dataset. classification of medical images from the retinal OCT (Oct MNIST) dataset of MEDMNIST (Yang et al., 2021). IMDb movie review dataset (Maas et al., 2011). American Express default prediction task. This tabular dataset3 comprising 5.4 million total entries of real customer data asks models to predict whether a customer will default on their credit card debt. 3See www.kaggle.com/competitions/amex-default-prediction/; accessed 05/2024
Dataset Splits	No	The paper discusses various datasets (Blobs, OCT-MNIST, IMDB, American Express) and mentions concepts like training data, test set queries, and held-out data points, but it does not provide specific train/validation/test split percentages, sample counts for each split, or citations to standard predefined splits for all datasets that would allow full reproduction of the data partitioning. For instance, for IMDB, it mentions "40,000 samples" and labeling "100 data points held out from the training dataset" but not the overall train/test/val breakdown.
Hardware Specification	Yes	All experiments are run on a server with 2x AMD EPYC 9334 CPUs and 2x NVIDIA L40 GPUs.
Software Dependencies	No	The paper mentions using the "Opacus libary (Yousefpour et al., 2021)" and "pytorch" but does not provide specific version numbers for these software components. The prompt requires specific version numbers for key software dependencies.
Experiment Setup	Yes	The model is trained for four epochs with hyperparameters set to b = 3000, α = 1.0, η = 0.6, and γ = 0.06. The hyper-parameters used for fine-tuning using AGT are E = 4, α = 0.06, η = 0.5; the batchsize is chosen to be the maximum possible for each ensemble size T. In our experiments in the main text we choose to train with hyperparameters E = 3, α = 0.2, η = 0.5, γ = 0.04, using the maximum possible batchsize available to each ensemble member.