reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Improving Uncertainty Estimation through Semantically Diverse Language Generation

Authors: Lukas Aichberger, Kajetan Schweighofer, Mykyta Ielanskyi, Sepp Hochreiter

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on question-answering tasks demonstrate that SDLG consistently outperforms existing methods while being the most computationally efficient, setting a new standard for uncertainty estimation in LLMs. ... Section 6 EXPERIMENTS
Researcher Affiliation	Collaboration	Lukas Aichberger1, Kajetan Schweighofer1, Mykyta Ielanskyi1, Sepp Hochreiter1,2 1 ELLIS Unit Linz and LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz, Austria 2 NXAI Gmb H, Linz, Austria
Pseudocode	Yes	Algorithm 1 SDLG ... Algorithm 2 Token Score Ranking
Open Source Code	Yes	The code and data are available at https://github.com/ml-jku/SDLG.
Open Datasets	Yes	Truthful QA (Lin et al., 2022a), CoQA (Reddy et al., 2019), Trivia QA (Joshi et al., 2017)
Dataset Splits	Yes	To be concrete, we use the over 800 closed-book questions in Truthful QA (Lin et al., 2022a) corresponding to whole sentence answers, the almost 8,000 open-book questions in the development split of CoQA (Reddy et al., 2019) corresponding to medium to shorter length answers, and about 8,000 closed-book questions in the training split of Trivia QA (Joshi et al., 2017) corresponding to short, precise answers.
Hardware Specification	No	The paper mentions evaluating models with sizes ranging from 2.7 to 30 billion parameters and calculating Teraflops, but does not provide specific details on the GPU or CPU models used for these computations.
Software Dependencies	No	The paper mentions using 'NLI model DeBERTa' but does not provide specific version numbers for DeBERTa or any other software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	For baseline methods, we performed an extensive hyperparameter search for each dataset with the SEMS temperature {0.25, 0.5, 1.0, 1.5, 2.0} and the SEDBS penalty term {0.2, 0.5, 1.0}. Also, each method uses 10 generations to assign an uncertainty estimator... We empirically found that the performance of our method is quite robust with respect to the weighting of the token scores. Therefore, throughout the experiments, we derive the final token score ranking by straightforwardly averaging the three individual token scores.