Calibrating Large Language Models with Sample Consistency
Authors: Qing Lyu, Kumar Shridhar, Chaitanya Malaviya, Li Zhang, Yanai Elazar, Niket Tandon, Marianna Apidianaki, Mrinmaya Sachan, Chris Callison-Burch
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We extensively evaluate eleven open and closed-source models on nine reasoning datasets. Results show that consistency-based calibration methods outperform existing post-hoc approaches in terms of calibration error. Meanwhile, we find that factors such as intermediate explanations, model scaling, and larger sample sizes enhance calibration, while instruction-tuning makes calibration more difficult. Moreover, confidence scores obtained from consistency can potentially enhance model performance. |
| Researcher Affiliation | Collaboration | 1University of Pennsylvania 2ETH Zurich, 3Allen Institute for AI EMAIL, EMAIL |
| Pseudocode | No | The paper describes the consistency measures (Agreement-based, Entropy-based, FSD-based) mathematically, but does not present them or any other method in a structured pseudocode or algorithm block format. Figure 2 shows examples of model outputs, including 'Python Interpreter' code snippets, but these are not pseudocode for the methodology. |
| Open Source Code | Yes | Code https://github.com/veronica320/Calibrating-LLMswith-Consistency |
| Open Datasets | Yes | We experiment with 9 datasets from 4 reasoning tasks following previous work (Wei et al. 2022; Lyu et al. 2023): Math Word Problems (MWPs): ASDiv (Miao, Liang, and Su 2020), GSM8K (Cobbe et al. 2021), Multi Arith (Roy and Roth 2015), and SVAMP (Patel, Bhattamishra, and Goyal 2021). Multi-hop QA: Strategy QA (Geva et al. 2021), and two BIG-BENCH datasets (Srivastava et al. 2022), Date Understanding and Sports Understanding. Planning: Say Can (Brohan et al. 2023). Relational inference: CLUTRR (Sinha et al. 2019). |
| Dataset Splits | No | The paper mentions using a 'development set' for threshold tuning and then evaluating on a 'test set', as well as an 'evaluation set' D = {(xj, yj)}. However, it does not provide specific details on how these sets were split from the original datasets, such as percentages, sample counts, or explicit splitting methodologies needed for reproduction. |
| Hardware Specification | No | The paper lists the Large Language Models used (LLa MA, Mistral, Olmo, Codex, GPT-3.5-turbo, GPT-4) and notes some context length restrictions for Olmo models, implying computational resources. However, it does not provide specific details about the hardware used for training or inference, such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers (e.g., programming languages, libraries, frameworks, or solvers) that would be needed to replicate the experiments. |
| Experiment Setup | Yes | We sample n = 40 candidate outputs with a temperature of T = 0.4 for each input following Lyu et al. (2023) in Section 5, and analyze other values of n in Section 6. We use the same prompts from Lyu et al. (2023), with the same number of shots for each strategy (6 to 10, depending on the dataset), with only exception being Olmo models where we used 4-shot prompts due to their context length restriction to 2K tokens. |