Bayesian Inference for Correlated Human Experts and Classifiers
Authors: Markelle Kelly, Alex James Boyd, Sam Showalter, Mark Steyvers, Padhraic Smyth
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply our approach to two real-world medical classification problems, as well as to CIFAR-10H and Image Net-16H, demonstrating substantial reductions relative to baselines in the cost of querying human experts while maintaining high prediction accuracy. We evaluate our approach on four real-world classification tasks with corresponding classifier and human predictions, described in detail in Section 7.1. Compared to two baseline methods (described in Section 7.2), our approach consistently achieves 0% error using fewer queries on average than the baselines, as shown in Section 7.3. |
| Researcher Affiliation | Collaboration | 1Department of Computer Science, University of California, Irvine, USA 2GE Health Care, USA 3Stripe, USA 4Department of Cognitive Sciences, University of California, Irvine, USA. Correspondence to: Markelle Kelly <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Estimate p(y ) given observed predictions y O, z M and the set S of samples from the posterior for µ, Σ, and τ. Algorithm 2 Compute the expected entropy of y after observing yj. Algorithm 3 Estimate p(y ) given observed predictions y O, z M and the set S of samples from the posterior for µ, Σ, and τ |
| Open Source Code | Yes | 1All code and datasets used in this paper are available at https://github.com/markellekelly/consensus. |
| Open Datasets | Yes | 1All code and datasets used in this paper are available at https://github.com/markellekelly/consensus. Our experiments use two datasets of medical images, annotated by identifiable experts: Chest X-Ray (Nabulsi et al., 2021) and Chaoyang (Zhu et al., 2021). In addition, we include results for two datasets with simulated experts CIFAR-10H (Peterson et al., 2019) and Image Net-16H (Steyvers et al., 2022). |
| Dataset Splits | No | We use sets of 250 examples to compute the error rate and the average querying cost for all three methods. We run each experiment with 12 different sets of examples (for datasets with less than 3000 instances, we create different sets of 250 via shuffling, since each of the methods evaluated is sensitive to the order in which data points are seen). While this describes how data is used for evaluation in an online learning setting, it does not provide traditional training/validation/test splits with clear proportions or absolute counts for model training or a fixed random seed for shuffling. |
| Hardware Specification | Yes | Experiments were run on an NVIDIA Ge Force 2080ti GPU over the course of several days. |
| Software Dependencies | No | The paper does not explicitly mention specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, CUDA versions) that were used for implementation or experimentation. |
| Experiment Setup | Yes | The following hyperparameter values were used for all experiments: Parameters were updated after every example for the first 20 examples, then every 10 examples until 100 total examples were reached, at which point the update rate was further reduced to every 50 examples. For all inference tasks (for example, to estimate the parameters at each time step t), we used three independent Markov chains, each comprising 1,500 warm-up iterations followed by 2,000 post-warm-up (posterior) samples. |