reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Large Language Model Confidence Estimation via Black-Box Access

Authors: Tejaswini Pedapati, Amit Dhurandhar, Soumya Ghosh, Soham Dan, Prasanna Sattigeri

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically demonstrate that our simple framework is effective in estimating confidence of Flan-ul2, Llama-13b, Mistral7b and GPT-4 on four benchmark Q&A tasks as well as of Pegasus-large and BART-large on two benchmark summarization tasks with it surpassing baselines by even over 10% (on AUROC) in some cases.
Researcher Affiliation	Industry	Tejaswini Pedapati EMAIL IBM Research, Yorktown Heights, NY Amit Dhurandhar EMAIL IBM Research, Yorktown Heights, NY Soumya Ghosh EMAIL Merck Research Labs, Cambridge, MA Soham Dan EMAIL Microsoft, New York, NY Prasanna Sattigeri EMAIL IBM Research, Cambridge, MA
Pseudocode	Yes	Algorithm 1 Stochastic Decoding (SD): Algorithm to collect decoded features under various strategies. Algorithm 2 Paraphrasing (PP): Algorithm for paraphrasing via back-translation. Algorithm 3 Sentence Permutation (SP): Algorithm for random sentence permutation via NER sampling. Algorithm 4 Entity Frequency Amplification (EFA): Algorithm for amplifying entity frequency via repeated insertion. Algorithm 5 Stopword Removal (SR): Algorithm for removing stopwords from a document. Algorithm 6 Split Response Consistency (SRC): Algorithm for checking consistency over random splits of a generated response. Algorithm 7 Pseudocode for generating features and labels
Open Source Code	No	The paper mentions using open-source code for baselines (Lin et al., 2024) and provides a link to their GitHub, but it does not state that the authors' own methodology described in this paper has open-source code released.
Open Datasets	Yes	For question answering we elicited responses from these models on four datasets, namely, Co QA (Reddy et al., 2019), SQu AD (Rajpurkar et al., 2016), Trivia QA (Joshi et al., 2017) and Natural Questions (NQ) (Kwiatkowski et al., 2019). For summarization, we used CNN Daily Mail (See et al., 2017) and (Hermann et al., 2015) and XSUM (Narayan et al., 2018) datasets.
Dataset Splits	Yes	For our experiments, we use the validation splits for all the datasets as done previously (Lin et al., 2024). Co QA has 7983 datapoints, Trivia QA has 9960 datapoints, SQu AD has 10,600 datapoints and NQ has 7830 datapoints. For summarization, we used CNN Daily Mail (See et al., 2017) and (Hermann et al., 2015) and XSUM (Narayan et al., 2018) datasets. We use a subset of the validation splits of both the datasets comprising of 4000 datapoints. We follow previous works (Lin et al., 2024), which used 1000 datapoints for hyperparameter tuning, to train our Logistic Regression Classifier and the rest of them were used for evaluation.
Hardware Specification	Yes	We used internally hosted models to generate the responses. Thus, we used V100s GPUs for the feature extraction step once the responses were generated. The logistic regression model was trained on an intel core CPU.
Software Dependencies	No	The paper mentions software like 'deberta-large-nli model', 'Helsinki-NLP (MT-Model)', 'Huggingface', 'NLTK', and 'spacy' but does not provide specific version numbers for any of these components.
Experiment Setup	Yes	We use zero-shot prompting for the datasets with context. For Trivia QA, Flan-ul2, Mistral-7BInstruct-v0.2 and GPT-4 also worked well with zero shot prompting while Llama-2-13b chat was performant with a two-shot prompt. For NQ, we used a five shot prompt. The details about the prompts used are provided in the Appendix A. For each of the prompt perturbations specified above, we use five generations for each perturbation for more robust evaluation. In particular, we use the rouge score to compute the similarity between the output and the ground truth and if the score is greater than a threshold of 0.3, it corresponds to label 1, otherwise it is deemed incorrect and is labeled 0 similar to previous works (Lin et al., 2024).