reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Confounding-Robust Deferral Policy Learning

Authors: Ruijiang Gao, Mingzhang Yin

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The empirical and theoretical analyses demonstrate the efficacy of our approach in mitigating unobserved confounding and improving the overall performance of human-AI collaborations. ... We report empirical findings to examine the advantages of Human-AI complementary and being robust to unobserved confounding. Our first experiment demonstrates the benefit of human-AI collaboration within a controlled environment. Our subsequent experiments consider two real-world examples in financial lending and healthcare industry. ... 5.1 Synthetic Experiment ... 5.2 Real-World Examples ... 5.3 Real Human Responses ... 5.4 Ablation Studies
Researcher Affiliation	Academia	1Naveen Jindal School of Management, University of Texas at Dallas, Richardson, TX 75082 2Warrington College of Business, University of Florida, Gainesville, FL 32611 EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1: Confounding-Robust Deferral Collaboration (Conf HAI/Conf HAIPerson)
Open Source Code	Yes	Code and appendix are available at https://github.com/ruijiang81/Confound_L2D.
Open Datasets	Yes	We use the Home Equity Line of Credit(HELOC) dataset which contains anonymized information about credit applications by real homeowners. ... We use the data from the International Stroke Trial (Group 1997) ... We use the scientific annotation dataset FOCUS (Rzhetsky, Shatkay, and Wilbur 2009)
Dataset Splits	No	The paper mentions 'We train a logistic regression on 10% of the data to simulate nominal policies' for the HELOC dataset, but does not provide specific train/test/validation splits (e.g., percentages or counts) for the experiments conducted. No explicit splitting methodology or predefined splits are mentioned for any of the datasets used.
Hardware Specification	No	The paper does not explicitly describe any specific hardware (e.g., GPU models, CPU types, memory details, or cloud instance specifications) used for running the experiments.
Software Dependencies	No	The paper does not provide specific version numbers for any software libraries, frameworks, or programming languages used in the implementation or experimentation.
Experiment Setup	Yes	We use the logistic policies for the policy and router model classes. The baseline policy is set as the never-treat policy πc(0\|x) = 1 (Kallus and Zhou 2018). ... We set log(Γ) = 2.5, C(x) = 0 and vary the log-confounding parameter in {0.01, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4}. ... We simulate three human workers with log(Γ) = 1, 2.5, 4, respectively ... We assume there are three human decision makers with log(Γ) = [0.1, 0.1, 1] ... We train a logistic regression on 10% of the data to simulate nominal policies ... For each experiment, we try three log(Γ) specifications: [0.1, 0.1, 0.1], [0.1, 0.1, 1] and [1, 1, 1] ... We use the synthetic data setup and vary the human cost from 0 to 0.3.