SePer: Measure Retrieval Utility Through The Lens Of Semantic Perplexity Reduction
Authors: Lu Dai, Yijie Xu, Jinhui Ye, Hao Liu, Hui Xiong
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that Se Per not only aligns closely with human preferences but also offers a more precise and efficient evaluation of retrieval utility across diverse RAG scenarios. We conduct theoretical analysis and extensive experiments to demonstrate that Se Per provides a more accurate, fine-grained, and efficient evaluation of retrieval utility. It is also generalizable across a broad range of RAG scenarios. |
| Researcher Affiliation | Academia | 1The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China 2The Hong Kong University of Science and Technology, Hong Kong SAR, China 3Carnegie Mellon University, Pittsburgh, USA EMAIL, EMAIL EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Se Per & Se Per Require: Model M, reference answer a , entailment model E, threshold τ, number of samples N. |
| Open Source Code | Yes | Furthermore, we augment the evaluation of various RAG systems with our Se Per metric for the reference of future research, which is maintained in https://github.com/sepermetric/seper. |
| Open Datasets | Yes | We use EVOUNA Wang et al. (2024), a Question Answering (QA) benchmark to evaluate QA evaluators reliability. Based on NATURAL QUESTIONS (NQ) Kwiatkowski et al. (2019) and TRIVIAQA Joshi et al. (2017), EVOUNA augmented the datasets with LLM-generated responses and asked humans to annotate whether a response is semantically equivalent to the golden answer. ... In the simple open QA setting, we use three representative datasets: NQ, MS MARCO Bajaj et al. (2016), and SQUAD Rajpurkar et al. (2016). ... In the reasoning-involved QA setting, we use four typical Multihop-QA datasets, 2WIKIMULTIHOPQA Ho et al. (2020), HOTPOTQA Yang et al. (2018), IIRC Ferguson et al. (2020), and MUSIQUE Trivedi et al. (2022b). |
| Dataset Splits | Yes | Wherever possible, we perform inference on the test set; if the test set is unavailable, we use the dev set instead. We re-sample the datasets, and for datasets with more than 1000 instances, we randomly select 1000 examples for inference. |
| Hardware Specification | Yes | Our experiments were conducted on high-performance servers, each equipped with either an Intel(R) Xeon(R) Platinum 8378A CPU @ 3.00GHz or an Intel(R) Xeon(R) Platinum 8358P CPU @ 2.60GHz, 1TB of RAM, and 4/6 NVIDIA A800 GPUs with 80GB memory. |
| Software Dependencies | Yes | The software environment included Python 3.11, Py Torch 2.4, and NCCL 2.21.5 for reproductivity. |
| Experiment Setup | Yes | We conduct experiments using the Llama 2 model series Touvron et al. (2023) from the Meta Llama family, specifically Llama-2-7b-chat-hf1, Llama-2-13b-chat-hf2, and Llama-2-70b-chat-hf3. Considering that the task involves instruction-following generation, we choose the chat versions of these models. To generate various and complete answers of various kinds for Se Per computation, we set the temperature parameter of each model to 1.0, enabled do sample, and set the maximum tokens for generation to 512. For the retrieval corpus, we use the DPR version of the Wikipedia December 2018 dataset5 as our retrieval corpus, following the configuration we utilize in the RAG framework Flash RAG Jin et al. (2024). We experiment with the set of top-k values for retrieval being {1, 5, 10}, and follow each method s official implementation for the hyper-parameters of different prompt compression methods. For reranker usage, we set the reranker model as BAAI/bge-reranker-large4. We set the initial top-k value for retrieval to 20 and then apply the set as {1, 5, 10} for the reranker to choose items, leveraging the reranker s ability to both rank and filter out irrelevant content. We enable half precision when calculating Se Per. |