Calibrating Expressions of Certainty

Authors: Peiqi Wang, Barbara Lam, Yingcheng Liu, Ameneh Asgari-Targhi, Rameswar Panda, William Wells III, Tina Kapur, Polina Golland

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present a novel approach to calibrating linguistic expressions of certainty... Leveraging these tools, we analyze the calibration of both humans (e.g., radiologists) and computational models (e.g., language models) and provide interpretable suggestions to improve their calibration. We demonstrate our approach by analyzing the calibration of radiologists writing clinical reports, accounting for variables such as the pathology and radiologist’s identity. Moreover, we show how we can guide radiologists to become better calibrated in their use of certainty phrases. In addition, we showcase the calibration of language models and demonstrate the effectiveness of our calibration method to post-hoc improve model calibration.
Researcher Affiliation Collaboration 1 CSAIL, MIT 2 Harvard Medical School 3 MIT-IBM Watson AI Lab
Pseudocode No The paper describes methods and equations but does not contain a dedicated section, figure, or block explicitly labeled as "Pseudocode" or "Algorithm" with structured steps.
Open Source Code Yes Correspondence to EMAIL; Code available on Git Hub
Open Datasets Yes Sci Q (Welbl et al., 2017) contains crowd-sourced science exam questions, and (2) Truthful QA (Lin et al., 2022c) contains questions designed to test language models tendency to mimic human misconceptions. We derive confidence distributions {u1, ..., u K} from survey data on radiologists interpretation of diagnostic certainty phrases commonly used in dictating radiology reports (Shinagare et al., 2023). For the LLM experiments, we reference a social media survey of 123 respondents (mostly undergraduate students) regarding their perception of probability-related terms (Fagen-Ulmschneider, 2023).
Dataset Splits Yes To mitigate distribution shifts in the use of certainty phrases, we use stratified sampling and split the dataset equally into calibration and test sets. Each dataset is evenly split into calibration and test sets using stratified sampling.
Hardware Specification Yes We found that the Llama-3-8B base model (Dubey & et al., 2024) provides the best performance that fit under our computing resource of one 24GB memory A5000 GPU.
Software Dependencies No The paper mentions using "POT’s solver implementation (Flamary et al., 2021)" and "adaptive quadrature algorithms through scipy.integrate" but does not specify version numbers for these or other software libraries/frameworks.
Experiment Setup Yes Specifically, we partition [0, 1] into 100 equal-width bins and compute the estimators in Equation (6) for each bin... We use bootstrap resampling with 100 samples to calculate the mean and 95% confidence interval for these estimators. We set ϵ = 1e-3 to minimize mass splitting and simplify interpretation of the calibration map. By default, we set τ2 = 1e-3 arbitrarily due to its minimal impact on performance given such a small ϵ. Based on this analysis, we choose K = 12 for subsequent experiments.