Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models

Authors: Zhen Lin, Shubhendu Trivedi, Jimeng Sun

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments were carried out with several popular LLMs on question-answering datasets (for evaluation purposes). Results reveal that a simple measure for the semantic dispersion can be a reliable predictor of the quality of LLM responses, providing valuable insights for practitioners on uncertainty management when adopting LLMs.
Researcher Affiliation Academia EMAIL; Shubhendu Trivedi EMAIL; Jimeng Sun1,2 EMAIL; 1 University of Illinois at Urbana-Champaign 2 Carle s Illinois College of Medicine, University of Illinois at Urbana-Champaign
Pseudocode No The paper describes the methods through structured steps but does not present them within a clearly labeled 'Pseudocode' or 'Algorithm' block. For example, Section 4 details steps for quantifying uncertainty, but these are prose descriptions, not formally structured pseudocode.
Open Source Code Yes The code to replicateour experiments is available at https://github.com/zlin7/UQ-NLG.
Open Datasets Yes Following Kuhn et al. (2023), we use the open-book conversational question answering dataset, Co QA (coqa) (Reddy et al., 2019), and the closed-book QA dataset, Trivia QA (trivia) (Joshi et al., 2017). In addition, we also use the more challenging closed-book QA dataset, Natural Questions (nq) (Kwiatkowski et al., 2019).
Dataset Splits Yes We use the development split of coqa with 7,983 questions, the validation split of nq with 3,610 questions, and the validation split of the rc.nocontext subset of trivia with 9,960 (de-duplicated) questions. We repeat all experiments 10 times, each time with a random subset of 1,000 questions as the calibration set for hyper-parameters of U and C measures, and test the performance on the remaining data.
Hardware Specification Yes We perform all experiments on a machine with 2x AMD EPYC Milan 7513 CPU, 512 GB RAM, 8x A6000 GPUs.
Software Dependencies Yes All gpt-3.5-turbo used in this paper are the 0301 version.
Experiment Setup Yes For all U and C measures involving a NLI,entail and a NLI,contra, we need to choose a temperature for the NLI model. The temperature is chosen from 0.1, 0.25, 0.5, 1, 3, 5, and 7. For UEcc and CEcc, we also need to choose a cutoff for eigenvalues. For simplicity we use the same threshold for each experiment/dataset, and the threshold is chosen from 0.4, 0.5, 0.6, 0.7, 0.8, and 0.9. ... In Tables 7 to 9, the temperature is 1 for all models except for LLa MA2 (which uses 0.6) and top_p is 1 for all models except for LLa MA2 (which uses 0.9).