reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Multiaccuracy and Multicalibration via Proxy Groups

Authors: Beepul Bharti, Mary Versa Clemens-Sewall, Paul Yi, Jeremias Sulam

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through several experiments on real-world datasets, we illustrate that approximate multiaccuracy and multicalibration can be achieved even when sensitive group data is incomplete or unavailable. Experimental results are detailed in Section 6.
Researcher Affiliation	Academia	1Department of Biomedical Engineering, Johns Hopkins University, Baltimore, USA 2Mathematical Institute of Data Science, Johns Hopkins University, Baltimore, USA 3Department of Applied Mathematics & Statistics, Johns Hopkins University, Baltimore, USA 4St. Jude Children s Research Hospital, Arlington, USA 5Department of Computer Science, Johns Hopkins University, Baltimore, USA.
Pseudocode	Yes	Algorithm 1 Multiaccuracy Regression. Algorithm 2 Multicalibration Boosting.
Open Source Code	Yes	The code necessary to reproduce these experiments is available at https://github.com/Sulam-Group/proxy_ma-mc.
Open Datasets	Yes	We illustrate various aspects of our theoretical results on two tabular datasets, ACSIncome and ACSPublic Coverage (Ding et al., 2021), as well as on the Che Xpert medical imaging dataset (Irvin et al., 2019).
Dataset Splits	Yes	For the ACS datasets, we use a fixed 10% of the samples as the evaluation set. The remaining 90% of the data is split into training and validation sets, with 60% used for training the model f and proxies ˆG and 30% for adjusting f. All reported results are averages over five train/validation splits on the evaluation set. For Che Xpert, we use the splits provided by (Glocker et al., 2023) for training, calibration, and evaluation.
Hardware Specification	No	The paper describes using a Dense Net-121 model pretrained on Image Net for feature extraction and end-to-end training, but it does not specify any hardware details (e.g., GPU models, CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions using logistic regression, decision trees, Random Forests, and Dense Net-121 models. However, it does not provide specific version numbers for any software libraries or frameworks used (e.g., PyTorch 1.9, TensorFlow 2.x, scikit-learn 1.x).
Experiment Setup	No	The paper describes the types of models used (logistic regression, decision tree, Random Forest, Dense Net-121) and the datasets, but it does not provide specific hyperparameter values (e.g., learning rate, batch size, number of epochs) or other detailed training configurations for these models.