reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Reconciling Model Multiplicity for Downstream Decision Making

Authors: Ally Du, Dung Daniel Ngo, Steven Wu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we provide a set of experiments to evaluate our methods empirically. Compared to existing work, our proposed algorithm creates a pair of predictive models with improved downstream decision-making losses and agrees on their best-response actions almost everywhere. ... In Section 4, we empirically evaluate the performance of the proposed algorithm on real-world datasets and show our improvement over the benchmark prior work in resolving disagreement in downstream decision-making tasks.
Researcher Affiliation	Academia	Ally Yalei Du* Carnegie Mellon University EMAIL Dung Daniel Ngo* University of Minnesota EMAIL Zhiwei Steven Wu Carnegie Mellon University EMAIL
Pseudocode	Yes	Algorithm 1: Decision Calibration ... Algorithm 2: Reconcile Decision Calibration (Re DCal) ... Algorithm 3: Reconcile (Roth et al., 2023) ... Algorithm 4: Reconcile Decision Calibration for Multiple Predictors (Re DCal-Multi) ... Algorithm 5: Decision Calibration for Infinite Action Set ... Algorithm 6: Reconcile Decision Calibration for Infinite Action Set (Re DCal-Inf)
Open Source Code	No	The paper does not provide explicit statements about releasing their code, nor does it include any links to code repositories. It mentions using 'py Torch' and pre-trained models from it, but this refers to external tools, not the authors' specific implementation.
Open Datasets	Yes	We use the Image Net dataset (Deng et al., 2009) ... We use the HAM10000 dataset (Tschandl et al., 2018) on pigmented skin lesions
Dataset Splits	Yes	Among the 50000 validation samples, we use 40000 samples for calibration and 10000 samples for testing. ... We split the dataset into train/validation/test sets, with 20% of the data are used for validation and 20% are used for testing.
Hardware Specification	Yes	Our Imagine Net experiments are run on a Macbook Pro with 32GB of RAM. The experiment on the HAM10000 dataset includes neural network models trained using py Torch on NVIDIA GA100 GPU (80 GB of RAM) with 2 compute workers loading the data.
Software Dependencies	No	The paper mentions using 'py Torch' and specific pre-trained models like 'inception-v3 (Szegedy et al., 2015)', 'resnet50 (He et al., 2015)', and 'densenet121 (Huang et al., 2018)'. However, it does not specify version numbers for PyTorch or any other software libraries used, which is required for reproducibility.
Experiment Setup	Yes	The hyperparameters are chosen as follows: loss margin α = 0.001, disagreement region mass η = 0.01, decision-calibration tolerance β = 0.00001, and the number of actions K = 10. ... The hyperparameters for Algorithm 2 are chosen as follows: loss margin α = 0.1, target disagreement region mass η = 0.01, and decision-calibration tolerance β = 0.000001.