reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Bayesian Inference for Correlated Human Experts and Classifiers

Authors: Markelle Kelly, Alex James Boyd, Sam Showalter, Mark Steyvers, Padhraic Smyth

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We apply our approach to two real-world medical classification problems, as well as to CIFAR-10H and Image Net-16H, demonstrating substantial reductions relative to baselines in the cost of querying human experts while maintaining high prediction accuracy. We evaluate our approach on four real-world classification tasks with corresponding classifier and human predictions, described in detail in Section 7.1. Compared to two baseline methods (described in Section 7.2), our approach consistently achieves 0% error using fewer queries on average than the baselines, as shown in Section 7.3.
Researcher Affiliation	Collaboration	1Department of Computer Science, University of California, Irvine, USA 2GE Health Care, USA 3Stripe, USA 4Department of Cognitive Sciences, University of California, Irvine, USA. Correspondence to: Markelle Kelly <EMAIL>.
Pseudocode	Yes	Algorithm 1 Estimate p(y ) given observed predictions y O, z M and the set S of samples from the posterior for µ, Σ, and τ. Algorithm 2 Compute the expected entropy of y after observing yj. Algorithm 3 Estimate p(y ) given observed predictions y O, z M and the set S of samples from the posterior for µ, Σ, and τ
Open Source Code	Yes	1All code and datasets used in this paper are available at https://github.com/markellekelly/consensus.
Open Datasets	Yes	1All code and datasets used in this paper are available at https://github.com/markellekelly/consensus. Our experiments use two datasets of medical images, annotated by identifiable experts: Chest X-Ray (Nabulsi et al., 2021) and Chaoyang (Zhu et al., 2021). In addition, we include results for two datasets with simulated experts CIFAR-10H (Peterson et al., 2019) and Image Net-16H (Steyvers et al., 2022).
Dataset Splits	No	We use sets of 250 examples to compute the error rate and the average querying cost for all three methods. We run each experiment with 12 different sets of examples (for datasets with less than 3000 instances, we create different sets of 250 via shuffling, since each of the methods evaluated is sensitive to the order in which data points are seen). While this describes how data is used for evaluation in an online learning setting, it does not provide traditional training/validation/test splits with clear proportions or absolute counts for model training or a fixed random seed for shuffling.
Hardware Specification	Yes	Experiments were run on an NVIDIA Ge Force 2080ti GPU over the course of several days.
Software Dependencies	No	The paper does not explicitly mention specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, CUDA versions) that were used for implementation or experimentation.
Experiment Setup	Yes	The following hyperparameter values were used for all experiments: Parameters were updated after every example for the first 20 examples, then every 10 examples until 100 total examples were reached, at which point the update rate was further reduced to every 50 examples. For all inference tasks (for example, to estimate the parameters at each time step t), we used three independent Markov chains, each comprising 1,500 warm-up iterations followed by 2,000 post-warm-up (posterior) samples.