reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Prediction-Powered Adaptive Shrinkage Estimation

Authors: Sida Li, Nikolaos Ignatiadis

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on both synthetic and real-world datasets show that PAS adapts to the reliability of the ML predictions and outperforms traditional and modern baselines in large-scale applications. We conduct extensive experiments on both synthetic and real-world datasets.
Researcher Affiliation	Academia	1Data Science Institute, The University of Chicago 2Department of Statistics, The University of Chicago. Correspondence to: Sida Li <EMAIL>.
Pseudocode	Yes	A pseudo-code implementation is also presented in Algorithm 1.
Open Source Code	Yes	The code for reproducing the experiments is available at https://github.com/listar2000/predictionpowered-adaptive-shrinkage.
Open Datasets	Yes	Experiments on both synthetic and real-world datasets show that PAS adapts to the reliability of the ML predictions and outperforms traditional and modern baselines in large-scale applications. Fisch et al. (2024) have shown improvements in estimating the fraction of spiral galaxies using predictions on images from the Galaxy Zoo 2 dataset (Willett et al., 2013). Amazon Review Ratings (SNAP, 2014). The Amazon Fine Food Reviews dataset, provided by the Stanford Network Analysis Project (SNAP; SNAP (2014)) on Kaggle.
Dataset Splits	Yes	we randomly split the data points of each problem into labeled/unlabeled partitions (where we choose a 20/80 split ratio). For both datasets, we randomly split the data for each problem (a food product or galaxy subgroup) into a labeled and unlabeled partition with a 20/80 ratio.
Hardware Specification	Yes	All the experiments were conducted on a compute cluster with Intel Xeon Silver 4514Y (16 cores) CPU, Nvidia A100 (80GB) GPU, and 64GB of memory.
Software Dependencies	No	The paper mentions software like 'Hugging Face s transformers library (Wolf, 2019)', 'bert-base-multilingual-uncased-sentiment model (Town, 2023)', 'Res Net50 architecture (He et al., 2016)', and 'Adam optimizer (Kingma & Ba, 2015)'. However, specific version numbers for these software libraries or frameworks are not provided, which is required for a reproducible description of ancillary software.
Experiment Setup	Yes	We use a batch size of 256 and Adam optimizer (Kingma & Ba, 2015) with a learning rate of 1e-3. After 20 epochs, the model achieves 87% training accuracy and 83% test accuracy.