Calibrating Expressions of Certainty
Authors: Peiqi Wang, Barbara Lam, Yingcheng Liu, Ameneh Asgari-Targhi, Rameswar Panda, William Wells III, Tina Kapur, Polina Golland
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present a novel approach to calibrating linguistic expressions of certainty... Leveraging these tools, we analyze the calibration of both humans (e.g., radiologists) and computational models (e.g., language models) and provide interpretable suggestions to improve their calibration. We demonstrate our approach by analyzing the calibration of radiologists writing clinical reports, accounting for variables such as the pathology and radiologist’s identity. Moreover, we show how we can guide radiologists to become better calibrated in their use of certainty phrases. In addition, we showcase the calibration of language models and demonstrate the effectiveness of our calibration method to post-hoc improve model calibration. |
| Researcher Affiliation | Collaboration | 1 CSAIL, MIT 2 Harvard Medical School 3 MIT-IBM Watson AI Lab |
| Pseudocode | No | The paper describes methods and equations but does not contain a dedicated section, figure, or block explicitly labeled as "Pseudocode" or "Algorithm" with structured steps. |
| Open Source Code | Yes | Correspondence to EMAIL; Code available on Git Hub |
| Open Datasets | Yes | Sci Q (Welbl et al., 2017) contains crowd-sourced science exam questions, and (2) Truthful QA (Lin et al., 2022c) contains questions designed to test language models tendency to mimic human misconceptions. We derive confidence distributions {u1, ..., u K} from survey data on radiologists interpretation of diagnostic certainty phrases commonly used in dictating radiology reports (Shinagare et al., 2023). For the LLM experiments, we reference a social media survey of 123 respondents (mostly undergraduate students) regarding their perception of probability-related terms (Fagen-Ulmschneider, 2023). |
| Dataset Splits | Yes | To mitigate distribution shifts in the use of certainty phrases, we use stratified sampling and split the dataset equally into calibration and test sets. Each dataset is evenly split into calibration and test sets using stratified sampling. |
| Hardware Specification | Yes | We found that the Llama-3-8B base model (Dubey & et al., 2024) provides the best performance that fit under our computing resource of one 24GB memory A5000 GPU. |
| Software Dependencies | No | The paper mentions using "POT’s solver implementation (Flamary et al., 2021)" and "adaptive quadrature algorithms through scipy.integrate" but does not specify version numbers for these or other software libraries/frameworks. |
| Experiment Setup | Yes | Specifically, we partition [0, 1] into 100 equal-width bins and compute the estimators in Equation (6) for each bin... We use bootstrap resampling with 100 samples to calculate the mean and 95% confidence interval for these estimators. We set ϵ = 1e-3 to minimize mass splitting and simplify interpretation of the calibration map. By default, we set τ2 = 1e-3 arbitrarily due to its minimal impact on performance given such a small ϵ. Based on this analysis, we choose K = 12 for subsequent experiments. |