reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Calibrating Expressions of Certainty

Authors: Peiqi Wang, Barbara Lam, Yingcheng Liu, Ameneh Asgari-Targhi, Rameswar Panda, William Wells III, Tina Kapur, Polina Golland

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present a novel approach to calibrating linguistic expressions of certainty... Leveraging these tools, we analyze the calibration of both humans (e.g., radiologists) and computational models (e.g., language models) and provide interpretable suggestions to improve their calibration. We demonstrate our approach by analyzing the calibration of radiologists writing clinical reports, accounting for variables such as the pathology and radiologist’s identity. Moreover, we show how we can guide radiologists to become better calibrated in their use of certainty phrases. In addition, we showcase the calibration of language models and demonstrate the effectiveness of our calibration method to post-hoc improve model calibration.
Researcher Affiliation	Collaboration	1 CSAIL, MIT 2 Harvard Medical School 3 MIT-IBM Watson AI Lab
Pseudocode	No	The paper describes methods and equations but does not contain a dedicated section, figure, or block explicitly labeled as "Pseudocode" or "Algorithm" with structured steps.
Open Source Code	Yes	Correspondence to EMAIL; Code available on Git Hub
Open Datasets	Yes	Sci Q (Welbl et al., 2017) contains crowd-sourced science exam questions, and (2) Truthful QA (Lin et al., 2022c) contains questions designed to test language models tendency to mimic human misconceptions. We derive confidence distributions {u1, ..., u K} from survey data on radiologists interpretation of diagnostic certainty phrases commonly used in dictating radiology reports (Shinagare et al., 2023). For the LLM experiments, we reference a social media survey of 123 respondents (mostly undergraduate students) regarding their perception of probability-related terms (Fagen-Ulmschneider, 2023).
Dataset Splits	Yes	To mitigate distribution shifts in the use of certainty phrases, we use stratified sampling and split the dataset equally into calibration and test sets. Each dataset is evenly split into calibration and test sets using stratified sampling.
Hardware Specification	Yes	We found that the Llama-3-8B base model (Dubey & et al., 2024) provides the best performance that fit under our computing resource of one 24GB memory A5000 GPU.
Software Dependencies	No	The paper mentions using "POT’s solver implementation (Flamary et al., 2021)" and "adaptive quadrature algorithms through scipy.integrate" but does not specify version numbers for these or other software libraries/frameworks.
Experiment Setup	Yes	Specifically, we partition [0, 1] into 100 equal-width bins and compute the estimators in Equation (6) for each bin... We use bootstrap resampling with 100 samples to calculate the mean and 95% confidence interval for these estimators. We set ϵ = 1e-3 to minimize mass splitting and simplify interpretation of the calibration map. By default, we set τ2 = 1e-3 arbitrarily due to its minimal impact on performance given such a small ϵ. Based on this analysis, we choose K = 12 for subsequent experiments.