Reconciling Model Multiplicity for Downstream Decision Making
Authors: Ally Du, Dung Daniel Ngo, Steven Wu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we provide a set of experiments to evaluate our methods empirically. Compared to existing work, our proposed algorithm creates a pair of predictive models with improved downstream decision-making losses and agrees on their best-response actions almost everywhere. ... In Section 4, we empirically evaluate the performance of the proposed algorithm on real-world datasets and show our improvement over the benchmark prior work in resolving disagreement in downstream decision-making tasks. |
| Researcher Affiliation | Academia | Ally Yalei Du* Carnegie Mellon University EMAIL Dung Daniel Ngo* University of Minnesota EMAIL Zhiwei Steven Wu Carnegie Mellon University EMAIL |
| Pseudocode | Yes | Algorithm 1: Decision Calibration ... Algorithm 2: Reconcile Decision Calibration (Re DCal) ... Algorithm 3: Reconcile (Roth et al., 2023) ... Algorithm 4: Reconcile Decision Calibration for Multiple Predictors (Re DCal-Multi) ... Algorithm 5: Decision Calibration for Infinite Action Set ... Algorithm 6: Reconcile Decision Calibration for Infinite Action Set (Re DCal-Inf) |
| Open Source Code | No | The paper does not provide explicit statements about releasing their code, nor does it include any links to code repositories. It mentions using 'py Torch' and pre-trained models from it, but this refers to external tools, not the authors' specific implementation. |
| Open Datasets | Yes | We use the Image Net dataset (Deng et al., 2009) ... We use the HAM10000 dataset (Tschandl et al., 2018) on pigmented skin lesions |
| Dataset Splits | Yes | Among the 50000 validation samples, we use 40000 samples for calibration and 10000 samples for testing. ... We split the dataset into train/validation/test sets, with 20% of the data are used for validation and 20% are used for testing. |
| Hardware Specification | Yes | Our Imagine Net experiments are run on a Macbook Pro with 32GB of RAM. The experiment on the HAM10000 dataset includes neural network models trained using py Torch on NVIDIA GA100 GPU (80 GB of RAM) with 2 compute workers loading the data. |
| Software Dependencies | No | The paper mentions using 'py Torch' and specific pre-trained models like 'inception-v3 (Szegedy et al., 2015)', 'resnet50 (He et al., 2015)', and 'densenet121 (Huang et al., 2018)'. However, it does not specify version numbers for PyTorch or any other software libraries used, which is required for reproducibility. |
| Experiment Setup | Yes | The hyperparameters are chosen as follows: loss margin α = 0.001, disagreement region mass η = 0.01, decision-calibration tolerance β = 0.00001, and the number of actions K = 10. ... The hyperparameters for Algorithm 2 are chosen as follows: loss margin α = 0.1, target disagreement region mass η = 0.01, and decision-calibration tolerance β = 0.000001. |