reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Graph-based Confidence Calibration for Large Language Models

Authors: Yukun Li, Sijia Wang, Lifu Huang, Liping Liu

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments demonstrate that this method has strong calibration performance on various benchmark datasets and generalizes well to out-of-domain cases. We evaluate the performance of the proposed method with an extensive empirical study that includes four datasets from different question domains.
Researcher Affiliation	Academia	Yukun Li EMAIL Department of Computer Science Tufts University Sijia Wang EMAIL Department of Computer Science Virginia Tech Lifu Huang EMAIL Department of Computer Science Virginia Tech, University of California, Davis Li-Ping Liu EMAIL Department of Computer Science Tufts University
Pseudocode	No	The paper describes the proposed method in text and with a figure (Figure 1), but it does not include a dedicated section for pseudocode or an algorithm block.
Open Source Code	No	The paper states, "We performed all the baseline experiments utilizing the open-source codebase and used the default hyperparameters." This refers to the code for the baseline methods, not the authors' own implementation of the proposed method. There is no explicit statement or link providing access to the source code for the methodology described in this paper.
Open Datasets	Yes	We conduct experiments on four datasets: (1) Co QA (Reddy et al., 2019), an open-book conversational question answering task; (2) Trivia QA (Joshi et al., 2017), a commonsense QA task. (3) Truthful QA (Lin et al., 2022a), a comparably more challenging dataset for factual QA tasks. and (4) Hotpot QA, a question answering dataset that requires models to find and combine information from multiple passages to answer complex questions.
Dataset Splits	No	We repeat the experiments 10 times, each with a different train/validation split and test the performance on the test set. This statement indicates that splits are used but does not provide specific percentages, absolute sample counts, or a detailed methodology for these splits, which are crucial for reproducibility.
Hardware Specification	Yes	The experiments are conducted on NVIDIA A100 GPUs with 80GB of memory.
Software Dependencies	No	The paper mentions several tools and models like Sentence-BERT, K-means clustering, Adam optimizer, Llama3-8B, and Vicuna-7b-v1.5. However, it does not provide specific version numbers for any general ancillary software, libraries, or programming languages (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	To ensure our model can capture complex and abstract features at each layer, our model comprises three GNN layers, with embedding dimensions of 256, 512, and 1024 for the first, second, and third layers, respectively. The initial learning rate was set to 10^-4. If the validation loss did not show improvement over ten consecutive epochs, the learning rate was reduced by a factor of 0.9. The optimization was performed using the Adam optimizer, configured with hyperparameters β1 = 0.9 and β2 = 0.98. The batch size was 16. In Appendix A, further details are provided: The embedding dimension was 256, 512, and 1024 for each layer. For the training process, we used the binary cross-entropy loss with a decaying learning rate that reduced the learning rate by 0.9 if the validation loss did not improve 10 epochs (with an initial learning rate of 10^-4 and a minimum learning rate of 10^-7). The optimizer was Adam with β1 = 0.9 and β2 = 0.98. The batch size was 16, 32. For the rephrased prompts, we set k = 3, n = 30, so for each rephrased question, we sampled ten answers. While calculating the ECE, we divide the confidence into B = 10 bins.