reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

FLUE: Streamlined Uncertainty Estimation for Large Language Models

Authors: Shiqi Gao, Tianxiang Gong, Zijie Lin, Runhua Xu, Haoyi Zhou, Jianxin Li

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the efficacy of FLUE through comprehensive comparisons with existing white-box and black-box methods, using various LLMs on open-ended QA tasks, assessing both token-level and sequence-level uncertainty. ... We utilize a lightweight state transition model (Luo and Wang 2024) to assess sequence uncertainty based on token-level uncertainties computed during the inference process. ... As shown in Table 1; FLUE achieved the highest AUROC on Trivia QA, NQ-Open, and SQu AD v2.0, and competitive performance on the remaining two datasets across eight LLMs. ... As shown in Table 2, FLUE achieves higher AUROC and lower ECE overall by emulating black-box model states with a proxy model.
Researcher Affiliation	Academia	1 SKLCCSE, School of Computer Science and Engineering, Beihang University 2 School of Software, Beihang University 3 National University of Singapore 4 Zhongguancun Laboratory, Beijing EMAIL, EMAIL
Pseudocode	No	The paper describes methods and processes through propositions and equations, such as Proposition 1, 2, and 3, and formula (1) to (17). However, it does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks or figures with structured, code-like steps.
Open Source Code	No	The paper does not contain any explicit statement about releasing source code for the described methodology, nor does it provide a direct link to a code repository.
Open Datasets	Yes	SQu AD2.0 (Rajpurkar, Jia, and Liang 2018), SVAMP (Patel, Bhattamishra, and Goyal 2021), NQ-open (Lee, Chang, and Toutanova 2019), Trivia QA (Joshi et al. 2017), Bio ASQ (Krithara et al. 2023).
Dataset Splits	No	The dataset used in this study aligns with (Farquhar et al. 2024), including SQu AD2.0 (Rajpurkar, Jia, and Liang 2018), SVAMP (Patel, Bhattamishra, and Goyal 2021), NQ-open (Lee, Chang, and Toutanova 2019), Trivia QA (Joshi et al. 2017), Bio ASQ (Krithara et al. 2023). ... In this setting, we tested the inference time of the Llama 2 7B model on the SQu AD dataset (after random sampling). The paper mentions using well-known public datasets and briefly refers to 'random sampling' for the SQu AD dataset but does not provide specific details on training/test/validation splits (e.g., percentages, exact counts, or citations to predefined splits) for the experiments conducted.
Hardware Specification	Yes	For hardware parameters, please refer to Appendix E.
Software Dependencies	No	The paper does not specify any software dependencies with version numbers (e.g., programming languages, libraries, or frameworks with their specific versions) that would be needed to replicate the experiments.
Experiment Setup	Yes	For the four methods of estimating uncertainty, for the methods that require multiple generations, we fixed the number of generations at 10. The sampling count for FLUE layers is also set to 10. ... We also explored the impact of selecting various numbers of layers and sampling quantities on uncertainty estimation performance for FLUE. Figure 4 shows that different models exhibit diverse sensitivities to these parameters.