reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Uncertainty Quantification in Retrieval Augmented Question Answering

Authors: Laura Perez-Beltrachini, Mirella Lapata

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our approach on short-form information-seeking QA tasks (Rodriguez & Boyd-Graber, 2021) (see Figure 1 for an example). Results on six datasets show that our uncertainty estimator is comparable or outperforms existing sampling-based methods while being more test-time efficient.
Researcher Affiliation	Academia	Laura Perez-Beltrachini EMAIL University of Edinburgh Mirella Lapata EMAIL University of Edinburgh
Pseudocode	No	The paper describes the methodology using natural language and mathematical formulations (e.g., Equation 1-13) but does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps.
Open Source Code	Yes	Code and data are available at https://github.com/lauhaide/ragu
Open Datasets	Yes	We evaluate our approach to predicting answer uncertainty on short-form question answering tasks. Specifically, on the following six datasets: Natural Questions (Kwiatkowski et al., 2019), Trivia QA (Joshi et al., 2017), Web Questions (Berant et al., 2013), SQu AD (Rajpurkar et al., 2016), and Pop QA (Mallen et al., 2023). We also evaluate on Refu NQ (Liu et al., 2024a)
Dataset Splits	Yes	In Appendix F.1, we describe each dataset, provide example questions, and make available details about the splits used in our experiments which follow Lee et al. (2019). We follow previous work (Lee et al., 2019) and use only the question and gold answers, i.e., the open versions of NQ, TQA, and SQu AD. We use the unfiltered TQA dataset. We follow the train/dev/test splits as used in previous work Lee et al. (2019) and randomly split Pop QA. Refu NQ only provides a test set so our experiments on this dataset are zero-shot from a Passage Utility predictor trained on SQu AD. We follow Farquhar et al. (2024) and use 400 test examples randomly sampled from the original larger test datasets for evaluation of uncertainty quantification. Table 10 shows dataset statistics, number of instances per Train/Development(Dev)/Test sets.
Hardware Specification	Yes	For all models, inference was run on a single A100-80GB GPU. ... Training and inference was run on a single A100-40GB GPU; training ranges from 2 to 12 hours depending on the dataset.
Software Dependencies	No	The paper mentions various models and tools used (e.g., 'BERT-based encoder', 'ALBERT-xlarge model', 'Qwen2-72B-Instruct', 'Contriever-MSMARCO', 'vLLM for inference'), citing their respective papers. However, it does not specify concrete version numbers for general software libraries, programming languages, or underlying frameworks like Python, PyTorch, TensorFlow, or CUDA, which are necessary for full reproducibility.
Experiment Setup	Yes	For inference, we set the maximum number of generated tokens to 50 for both the greedy (most likely answer) as well as temperature scaled (sampled candidates) decoding. ... We train each predictor for 3 epochs, with a batch size of 32 examples, learning rate equal to 2e 5, and weight decay 0.001 (with the exception of Llama-3.1-8B and Web Q where we used 0.01). For each predictor we performed search on values for λ, i.e., the contribution of the LBCE loss (Equation 5), and different criteria for model selection, i.e., the best at pairwise ranking or at both pairwise ranking and accuracy prediction (combined). Table 13 shows the configuration for each predictor.