reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Algorithms with Calibrated Machine Learning Predictions

Authors: Judy Hanwen Shen, Ellen Vitercik, Anders Wikum

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluations on real-world data validate our theoretical findings, highlighting the practical impact of calibration for algorithms with predictions. ... We validate our theoretical findings with strong empirical results on real-world data, highlighting the practical benefits of our approach. ... We now evaluate our algorithms on two real-world datasets, demonstrating the utility of using calibrated predictions.
Researcher Affiliation	Academia	1Department of Computer Science, Stanford University, Stanford, CA, USA 2Department of Management Science & Engineering, Stanford University, Stanford, CA, USA. Correspondence to: Anders Wikum <EMAIL>.
Pseudocode	Yes	Algorithm 1 Ak ... Algorithm 2 (Sun et al., 2024) Optimal ski rental with conformal predictions ... Algorithm 3 β-threshold rule
Open Source Code	Yes	1Code and data available here: https://github.com/heyyjudes/algs-cali-pred
Open Datasets	Yes	To model the rent-or-buy scenario in the ski rental problem, we use publicly available Citi Bike usage data.2. ... 2Monthly usage data is publicly available at https://citibikenyc.com/system-data. ... We use a real-world dataset for sepsis prediction to validate our theory results for scheduling with calibrated predictions. Sepsis Survival Minimal Clinical Records . 4 This dataset contains three characteristics: age, sex, and number of sepsis episodes. ... 4https://archive.ics.uci.edu/dataset/827/sepsis+survival+minimal+clinical+records
Dataset Splits	No	The paper mentions using a 'validation set' for calibration in Appendix C: 'A key intervention we make for calibration is to calibrate according to balanced classes in the validation set when the label distribution is highly skewed.' However, specific details on the size or methodology of how training, validation, and test splits were performed (e.g., exact percentages or sample counts) are not provided in the main text or appendix.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, memory amounts, or detailed computer specifications used for running its experiments.
Software Dependencies	No	The paper mentions various software components like 'XGBoost', 'logistic regression', 'multi-layer perceptrons', 'Linear Regression', 'Bayesian Ridge Regression', 'SGD Regressor', 'Elastic Net', and calibration methods like 'histogram calibration', 'binned calibration', 'Platt scaling'. However, it does not specify any version numbers for these software dependencies, which are necessary for replication.
Experiment Setup	No	The paper provides some model architecture details, such as 'a small MLP with two hidden layers of size 8 and 2', and lists features used for training. However, it does not provide concrete hyperparameter values for training, such as learning rates, batch sizes, number of epochs, or optimizer settings for any of the machine learning models used in the experiments.