reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Calibrating Large Language Models with Sample Consistency

Authors: Qing Lyu, Kumar Shridhar, Chaitanya Malaviya, Li Zhang, Yanai Elazar, Niket Tandon, Marianna Apidianaki, Mrinmaya Sachan, Chris Callison-Burch

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We extensively evaluate eleven open and closed-source models on nine reasoning datasets. Results show that consistency-based calibration methods outperform existing post-hoc approaches in terms of calibration error. Meanwhile, we ﬁnd that factors such as intermediate explanations, model scaling, and larger sample sizes enhance calibration, while instruction-tuning makes calibration more difﬁcult. Moreover, conﬁdence scores obtained from consistency can potentially enhance model performance.
Researcher Affiliation	Collaboration	1University of Pennsylvania 2ETH Zurich, 3Allen Institute for AI EMAIL, EMAIL
Pseudocode	No	The paper describes the consistency measures (Agreement-based, Entropy-based, FSD-based) mathematically, but does not present them or any other method in a structured pseudocode or algorithm block format. Figure 2 shows examples of model outputs, including 'Python Interpreter' code snippets, but these are not pseudocode for the methodology.
Open Source Code	Yes	Code https://github.com/veronica320/Calibrating-LLMswith-Consistency
Open Datasets	Yes	We experiment with 9 datasets from 4 reasoning tasks following previous work (Wei et al. 2022; Lyu et al. 2023): Math Word Problems (MWPs): ASDiv (Miao, Liang, and Su 2020), GSM8K (Cobbe et al. 2021), Multi Arith (Roy and Roth 2015), and SVAMP (Patel, Bhattamishra, and Goyal 2021). Multi-hop QA: Strategy QA (Geva et al. 2021), and two BIG-BENCH datasets (Srivastava et al. 2022), Date Understanding and Sports Understanding. Planning: Say Can (Brohan et al. 2023). Relational inference: CLUTRR (Sinha et al. 2019).
Dataset Splits	No	The paper mentions using a 'development set' for threshold tuning and then evaluating on a 'test set', as well as an 'evaluation set' D = {(xj, yj)}. However, it does not provide specific details on how these sets were split from the original datasets, such as percentages, sample counts, or explicit splitting methodologies needed for reproduction.
Hardware Specification	No	The paper lists the Large Language Models used (LLa MA, Mistral, Olmo, Codex, GPT-3.5-turbo, GPT-4) and notes some context length restrictions for Olmo models, implying computational resources. However, it does not provide specific details about the hardware used for training or inference, such as GPU models, CPU types, or memory specifications.
Software Dependencies	No	The paper does not provide specific ancillary software details with version numbers (e.g., programming languages, libraries, frameworks, or solvers) that would be needed to replicate the experiments.
Experiment Setup	Yes	We sample n = 40 candidate outputs with a temperature of T = 0.4 for each input following Lyu et al. (2023) in Section 5, and analyze other values of n in Section 6. We use the same prompts from Lyu et al. (2023), with the same number of shots for each strategy (6 to 10, depending on the dataset), with only exception being Olmo models where we used 4-shot prompts due to their context length restriction to 2K tokens.