reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning Personalized Decision Support Policies

Authors: Umang Bhatt, Valerie Chen, Katherine M. Collins, Parameswaran Kamalaruban, Emma Kallina, Adrian Weller, Ameet Talwalkar

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our computational experiments explore the utility of personalization across multiple expertise profiles. ... To validate Modiste on real users (N = 80), we conduct human subject experiments, where we explore forms of support that include expert consensus, outputs from an LLM, or predictions from a classification model. ... we demonstrate how Modiste can be used to learn personalized decision support policies online on both vision and language tasks.
Researcher Affiliation	Academia	1New York University 2The Alan Turing Institute 3Carnegie Mellon University 4University of Cambridge EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1: Learning a decision support policy 1: Input: human decision-maker h 2: Initialization: data buffer D0 = {}; human error values {br Ai,0(x; h) = 0.5 : x X, Ai A}; initial policy π1 3: for t = 1, 2, . . . , T do 4: data point (xt, yt) X Y is drawn iid from P 5: support at A is selected using policy πt 6: human makes the prediction eyt based on xt and at 7: human incurs the loss ℓ(yt, eyt) 8: update the buffer Dt Dt 1 {(xt, at, ℓ(yt, eyt))} 9: update the decision support policy: br Ai,t(x; h) Ur(br Ai,t 1(x; h), Dt), Ai A (Step 1) πt+1(x) Uπ({br Ai,t}i) (Step 2) 10: end for 11: Output: policy πalg λ πT +1
Open Source Code	Yes	We open-source Modiste as a tool to encourage the adoption of personalized decision support policies.
Open Datasets	Yes	1. CIFAR-10 (Krizhevsky 2009), a 10-class image classification dataset; 2. MMLU (Hendrycks et al. 2020), a multi-task text-based benchmark that tests for knowledge and problem-solving ability across 57 topics in both the humanities and STEM.
Dataset Splits	No	The paper describes how they constructed tasks for CIFAR-3A and MMLU-2A, mentioning aspects like the number of images/questions for human interaction (100 for CIFAR-3A, 60 for MMLU-2A) or how classes were corrupted, but it does not specify any training, validation, or test dataset splits for machine learning models.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies	No	The paper mentions using 'Instruct GPT3.5, text-davinci-003' for the LLM support, but it does not list any specific software libraries, frameworks, or operating systems with their version numbers that would be required to replicate the experiments.
Experiment Setup	Yes	Via pilot studies, we found that 100 CIFAR images or 60 MMLU questions were a reasonable number of decisions to make within 20-40 minutes (a typical time limit for an online study), which we use throughout our experiments. ... Algorithm 1: ... 2: Initialization: data buffer D0 = {}; human error values {br Ai,0(x; h) = 0.5 : x X, Ai A}; initial policy π1