Uncertainty Quantification in Retrieval Augmented Question Answering
Authors: Laura Perez-Beltrachini, Mirella Lapata
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our approach on short-form information-seeking QA tasks (Rodriguez & Boyd-Graber, 2021) (see Figure 1 for an example). Results on six datasets show that our uncertainty estimator is comparable or outperforms existing sampling-based methods while being more test-time efficient. |
| Researcher Affiliation | Academia | Laura Perez-Beltrachini EMAIL University of Edinburgh Mirella Lapata EMAIL University of Edinburgh |
| Pseudocode | No | The paper describes the methodology using natural language and mathematical formulations (e.g., Equation 1-13) but does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps. |
| Open Source Code | Yes | Code and data are available at https://github.com/lauhaide/ragu |
| Open Datasets | Yes | We evaluate our approach to predicting answer uncertainty on short-form question answering tasks. Specifically, on the following six datasets: Natural Questions (Kwiatkowski et al., 2019), Trivia QA (Joshi et al., 2017), Web Questions (Berant et al., 2013), SQu AD (Rajpurkar et al., 2016), and Pop QA (Mallen et al., 2023). We also evaluate on Refu NQ (Liu et al., 2024a) |
| Dataset Splits | Yes | In Appendix F.1, we describe each dataset, provide example questions, and make available details about the splits used in our experiments which follow Lee et al. (2019). We follow previous work (Lee et al., 2019) and use only the question and gold answers, i.e., the open versions of NQ, TQA, and SQu AD. We use the unfiltered TQA dataset. We follow the train/dev/test splits as used in previous work Lee et al. (2019) and randomly split Pop QA. Refu NQ only provides a test set so our experiments on this dataset are zero-shot from a Passage Utility predictor trained on SQu AD. We follow Farquhar et al. (2024) and use 400 test examples randomly sampled from the original larger test datasets for evaluation of uncertainty quantification. Table 10 shows dataset statistics, number of instances per Train/Development(Dev)/Test sets. |
| Hardware Specification | Yes | For all models, inference was run on a single A100-80GB GPU. ... Training and inference was run on a single A100-40GB GPU; training ranges from 2 to 12 hours depending on the dataset. |
| Software Dependencies | No | The paper mentions various models and tools used (e.g., 'BERT-based encoder', 'ALBERT-xlarge model', 'Qwen2-72B-Instruct', 'Contriever-MSMARCO', 'vLLM for inference'), citing their respective papers. However, it does not specify concrete version numbers for general software libraries, programming languages, or underlying frameworks like Python, PyTorch, TensorFlow, or CUDA, which are necessary for full reproducibility. |
| Experiment Setup | Yes | For inference, we set the maximum number of generated tokens to 50 for both the greedy (most likely answer) as well as temperature scaled (sampled candidates) decoding. ... We train each predictor for 3 epochs, with a batch size of 32 examples, learning rate equal to 2e 5, and weight decay 0.001 (with the exception of Llama-3.1-8B and Web Q where we used 0.01). For each predictor we performed search on values for λ, i.e., the contribution of the LBCE loss (Equation 5), and different criteria for model selection, i.e., the best at pairwise ranking or at both pairwise ranking and accuracy prediction (combined). Table 13 shows the configuration for each predictor. |