Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models
Authors: Zhen Lin, Shubhendu Trivedi, Jimeng Sun
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments were carried out with several popular LLMs on question-answering datasets (for evaluation purposes). Results reveal that a simple measure for the semantic dispersion can be a reliable predictor of the quality of LLM responses, providing valuable insights for practitioners on uncertainty management when adopting LLMs. |
| Researcher Affiliation | Academia | EMAIL; Shubhendu Trivedi EMAIL; Jimeng Sun1,2 EMAIL; 1 University of Illinois at Urbana-Champaign 2 Carle s Illinois College of Medicine, University of Illinois at Urbana-Champaign |
| Pseudocode | No | The paper describes the methods through structured steps but does not present them within a clearly labeled 'Pseudocode' or 'Algorithm' block. For example, Section 4 details steps for quantifying uncertainty, but these are prose descriptions, not formally structured pseudocode. |
| Open Source Code | Yes | The code to replicateour experiments is available at https://github.com/zlin7/UQ-NLG. |
| Open Datasets | Yes | Following Kuhn et al. (2023), we use the open-book conversational question answering dataset, Co QA (coqa) (Reddy et al., 2019), and the closed-book QA dataset, Trivia QA (trivia) (Joshi et al., 2017). In addition, we also use the more challenging closed-book QA dataset, Natural Questions (nq) (Kwiatkowski et al., 2019). |
| Dataset Splits | Yes | We use the development split of coqa with 7,983 questions, the validation split of nq with 3,610 questions, and the validation split of the rc.nocontext subset of trivia with 9,960 (de-duplicated) questions. We repeat all experiments 10 times, each time with a random subset of 1,000 questions as the calibration set for hyper-parameters of U and C measures, and test the performance on the remaining data. |
| Hardware Specification | Yes | We perform all experiments on a machine with 2x AMD EPYC Milan 7513 CPU, 512 GB RAM, 8x A6000 GPUs. |
| Software Dependencies | Yes | All gpt-3.5-turbo used in this paper are the 0301 version. |
| Experiment Setup | Yes | For all U and C measures involving a NLI,entail and a NLI,contra, we need to choose a temperature for the NLI model. The temperature is chosen from 0.1, 0.25, 0.5, 1, 3, 5, and 7. For UEcc and CEcc, we also need to choose a cutoff for eigenvalues. For simplicity we use the same threshold for each experiment/dataset, and the threshold is chosen from 0.4, 0.5, 0.6, 0.7, 0.8, and 0.9. ... In Tables 7 to 9, the temperature is 1 for all models except for LLa MA2 (which uses 0.6) and top_p is 1 for all models except for LLa MA2 (which uses 0.9). |