FLUE: Streamlined Uncertainty Estimation for Large Language Models

Authors: Shiqi Gao, Tianxiang Gong, Zijie Lin, Runhua Xu, Haoyi Zhou, Jianxin Li

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the efficacy of FLUE through comprehensive comparisons with existing white-box and black-box methods, using various LLMs on open-ended QA tasks, assessing both token-level and sequence-level uncertainty. ... We utilize a lightweight state transition model (Luo and Wang 2024) to assess sequence uncertainty based on token-level uncertainties computed during the inference process. ... As shown in Table 1; FLUE achieved the highest AUROC on Trivia QA, NQ-Open, and SQu AD v2.0, and competitive performance on the remaining two datasets across eight LLMs. ... As shown in Table 2, FLUE achieves higher AUROC and lower ECE overall by emulating black-box model states with a proxy model.
Researcher Affiliation Academia 1 SKLCCSE, School of Computer Science and Engineering, Beihang University 2 School of Software, Beihang University 3 National University of Singapore 4 Zhongguancun Laboratory, Beijing EMAIL, EMAIL
Pseudocode No The paper describes methods and processes through propositions and equations, such as Proposition 1, 2, and 3, and formula (1) to (17). However, it does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks or figures with structured, code-like steps.
Open Source Code No The paper does not contain any explicit statement about releasing source code for the described methodology, nor does it provide a direct link to a code repository.
Open Datasets Yes SQu AD2.0 (Rajpurkar, Jia, and Liang 2018), SVAMP (Patel, Bhattamishra, and Goyal 2021), NQ-open (Lee, Chang, and Toutanova 2019), Trivia QA (Joshi et al. 2017), Bio ASQ (Krithara et al. 2023).
Dataset Splits No The dataset used in this study aligns with (Farquhar et al. 2024), including SQu AD2.0 (Rajpurkar, Jia, and Liang 2018), SVAMP (Patel, Bhattamishra, and Goyal 2021), NQ-open (Lee, Chang, and Toutanova 2019), Trivia QA (Joshi et al. 2017), Bio ASQ (Krithara et al. 2023). ... In this setting, we tested the inference time of the Llama 2 7B model on the SQu AD dataset (after random sampling). The paper mentions using well-known public datasets and briefly refers to 'random sampling' for the SQu AD dataset but does not provide specific details on training/test/validation splits (e.g., percentages, exact counts, or citations to predefined splits) for the experiments conducted.
Hardware Specification Yes For hardware parameters, please refer to Appendix E.
Software Dependencies No The paper does not specify any software dependencies with version numbers (e.g., programming languages, libraries, or frameworks with their specific versions) that would be needed to replicate the experiments.
Experiment Setup Yes For the four methods of estimating uncertainty, for the methods that require multiple generations, we fixed the number of generations at 10. The sampling count for FLUE layers is also set to 10. ... We also explored the impact of selecting various numbers of layers and sampling quantities on uncertainty estimation performance for FLUE. Figure 4 shows that different models exhibit diverse sensitivities to these parameters.