Improving Uncertainty Estimation through Semantically Diverse Language Generation
Authors: Lukas Aichberger, Kajetan Schweighofer, Mykyta Ielanskyi, Sepp Hochreiter
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on question-answering tasks demonstrate that SDLG consistently outperforms existing methods while being the most computationally efficient, setting a new standard for uncertainty estimation in LLMs. ... Section 6 EXPERIMENTS |
| Researcher Affiliation | Collaboration | Lukas Aichberger1, Kajetan Schweighofer1, Mykyta Ielanskyi1, Sepp Hochreiter1,2 1 ELLIS Unit Linz and LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz, Austria 2 NXAI Gmb H, Linz, Austria |
| Pseudocode | Yes | Algorithm 1 SDLG ... Algorithm 2 Token Score Ranking |
| Open Source Code | Yes | The code and data are available at https://github.com/ml-jku/SDLG. |
| Open Datasets | Yes | Truthful QA (Lin et al., 2022a), CoQA (Reddy et al., 2019), Trivia QA (Joshi et al., 2017) |
| Dataset Splits | Yes | To be concrete, we use the over 800 closed-book questions in Truthful QA (Lin et al., 2022a) corresponding to whole sentence answers, the almost 8,000 open-book questions in the development split of CoQA (Reddy et al., 2019) corresponding to medium to shorter length answers, and about 8,000 closed-book questions in the training split of Trivia QA (Joshi et al., 2017) corresponding to short, precise answers. |
| Hardware Specification | No | The paper mentions evaluating models with sizes ranging from 2.7 to 30 billion parameters and calculating Teraflops, but does not provide specific details on the GPU or CPU models used for these computations. |
| Software Dependencies | No | The paper mentions using 'NLI model DeBERTa' but does not provide specific version numbers for DeBERTa or any other software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | For baseline methods, we performed an extensive hyperparameter search for each dataset with the SEMS temperature {0.25, 0.5, 1.0, 1.5, 2.0} and the SEDBS penalty term {0.2, 0.5, 1.0}. Also, each method uses 10 generations to assign an uncertainty estimator... We empirically found that the performance of our method is quite robust with respect to the weighting of the token scores. Therefore, throughout the experiments, we derive the final token score ranking by straightforwardly averaging the three individual token scores. |