reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models

Authors: Zhen Lin, Shubhendu Trivedi, Jimeng Sun

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments were carried out with several popular LLMs on question-answering datasets (for evaluation purposes). Results reveal that a simple measure for the semantic dispersion can be a reliable predictor of the quality of LLM responses, providing valuable insights for practitioners on uncertainty management when adopting LLMs.
Researcher Affiliation	Academia	EMAIL; Shubhendu Trivedi EMAIL; Jimeng Sun1,2 EMAIL; 1 University of Illinois at Urbana-Champaign 2 Carle s Illinois College of Medicine, University of Illinois at Urbana-Champaign
Pseudocode	No	The paper describes the methods through structured steps but does not present them within a clearly labeled 'Pseudocode' or 'Algorithm' block. For example, Section 4 details steps for quantifying uncertainty, but these are prose descriptions, not formally structured pseudocode.
Open Source Code	Yes	The code to replicateour experiments is available at https://github.com/zlin7/UQ-NLG.
Open Datasets	Yes	Following Kuhn et al. (2023), we use the open-book conversational question answering dataset, Co QA (coqa) (Reddy et al., 2019), and the closed-book QA dataset, Trivia QA (trivia) (Joshi et al., 2017). In addition, we also use the more challenging closed-book QA dataset, Natural Questions (nq) (Kwiatkowski et al., 2019).
Dataset Splits	Yes	We use the development split of coqa with 7,983 questions, the validation split of nq with 3,610 questions, and the validation split of the rc.nocontext subset of trivia with 9,960 (de-duplicated) questions. We repeat all experiments 10 times, each time with a random subset of 1,000 questions as the calibration set for hyper-parameters of U and C measures, and test the performance on the remaining data.
Hardware Specification	Yes	We perform all experiments on a machine with 2x AMD EPYC Milan 7513 CPU, 512 GB RAM, 8x A6000 GPUs.
Software Dependencies	Yes	All gpt-3.5-turbo used in this paper are the 0301 version.
Experiment Setup	Yes	For all U and C measures involving a NLI,entail and a NLI,contra, we need to choose a temperature for the NLI model. The temperature is chosen from 0.1, 0.25, 0.5, 1, 3, 5, and 7. For UEcc and CEcc, we also need to choose a cutoff for eigenvalues. For simplicity we use the same threshold for each experiment/dataset, and the threshold is chosen from 0.4, 0.5, 0.6, 0.7, 0.8, and 0.9. ... In Tables 7 to 9, the temperature is 1 for all models except for LLa MA2 (which uses 0.6) and top_p is 1 for all models except for LLa MA2 (which uses 0.9).