Large Language Model Confidence Estimation via Black-Box Access
Authors: Tejaswini Pedapati, Amit Dhurandhar, Soumya Ghosh, Soham Dan, Prasanna Sattigeri
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically demonstrate that our simple framework is effective in estimating confidence of Flan-ul2, Llama-13b, Mistral7b and GPT-4 on four benchmark Q&A tasks as well as of Pegasus-large and BART-large on two benchmark summarization tasks with it surpassing baselines by even over 10% (on AUROC) in some cases. |
| Researcher Affiliation | Industry | Tejaswini Pedapati EMAIL IBM Research, Yorktown Heights, NY Amit Dhurandhar EMAIL IBM Research, Yorktown Heights, NY Soumya Ghosh EMAIL Merck Research Labs, Cambridge, MA Soham Dan EMAIL Microsoft, New York, NY Prasanna Sattigeri EMAIL IBM Research, Cambridge, MA |
| Pseudocode | Yes | Algorithm 1 Stochastic Decoding (SD): Algorithm to collect decoded features under various strategies. Algorithm 2 Paraphrasing (PP): Algorithm for paraphrasing via back-translation. Algorithm 3 Sentence Permutation (SP): Algorithm for random sentence permutation via NER sampling. Algorithm 4 Entity Frequency Amplification (EFA): Algorithm for amplifying entity frequency via repeated insertion. Algorithm 5 Stopword Removal (SR): Algorithm for removing stopwords from a document. Algorithm 6 Split Response Consistency (SRC): Algorithm for checking consistency over random splits of a generated response. Algorithm 7 Pseudocode for generating features and labels |
| Open Source Code | No | The paper mentions using open-source code for baselines (Lin et al., 2024) and provides a link to their GitHub, but it does not state that the authors' own methodology described in this paper has open-source code released. |
| Open Datasets | Yes | For question answering we elicited responses from these models on four datasets, namely, Co QA (Reddy et al., 2019), SQu AD (Rajpurkar et al., 2016), Trivia QA (Joshi et al., 2017) and Natural Questions (NQ) (Kwiatkowski et al., 2019). For summarization, we used CNN Daily Mail (See et al., 2017) and (Hermann et al., 2015) and XSUM (Narayan et al., 2018) datasets. |
| Dataset Splits | Yes | For our experiments, we use the validation splits for all the datasets as done previously (Lin et al., 2024). Co QA has 7983 datapoints, Trivia QA has 9960 datapoints, SQu AD has 10,600 datapoints and NQ has 7830 datapoints. For summarization, we used CNN Daily Mail (See et al., 2017) and (Hermann et al., 2015) and XSUM (Narayan et al., 2018) datasets. We use a subset of the validation splits of both the datasets comprising of 4000 datapoints. We follow previous works (Lin et al., 2024), which used 1000 datapoints for hyperparameter tuning, to train our Logistic Regression Classifier and the rest of them were used for evaluation. |
| Hardware Specification | Yes | We used internally hosted models to generate the responses. Thus, we used V100s GPUs for the feature extraction step once the responses were generated. The logistic regression model was trained on an intel core CPU. |
| Software Dependencies | No | The paper mentions software like 'deberta-large-nli model', 'Helsinki-NLP (MT-Model)', 'Huggingface', 'NLTK', and 'spacy' but does not provide specific version numbers for any of these components. |
| Experiment Setup | Yes | We use zero-shot prompting for the datasets with context. For Trivia QA, Flan-ul2, Mistral-7BInstruct-v0.2 and GPT-4 also worked well with zero shot prompting while Llama-2-13b chat was performant with a two-shot prompt. For NQ, we used a five shot prompt. The details about the prompts used are provided in the Appendix A. For each of the prompt perturbations specified above, we use five generations for each perturbation for more robust evaluation. In particular, we use the rouge score to compute the similarity between the output and the ground truth and if the score is greater than a threshold of 0.3, it corresponds to label 1, otherwise it is deemed incorrect and is labeled 0 similar to previous works (Lin et al., 2024). |