reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SePer: Measure Retrieval Utility Through The Lens Of Semantic Perplexity Reduction

Authors: Lu Dai, Yijie Xu, Jinhui Ye, Hao Liu, Hui Xiong

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that Se Per not only aligns closely with human preferences but also offers a more precise and efficient evaluation of retrieval utility across diverse RAG scenarios. We conduct theoretical analysis and extensive experiments to demonstrate that Se Per provides a more accurate, fine-grained, and efficient evaluation of retrieval utility. It is also generalizable across a broad range of RAG scenarios.
Researcher Affiliation	Academia	1The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China 2The Hong Kong University of Science and Technology, Hong Kong SAR, China 3Carnegie Mellon University, Pittsburgh, USA EMAIL, EMAIL EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Se Per & Se Per Require: Model M, reference answer a , entailment model E, threshold τ, number of samples N.
Open Source Code	Yes	Furthermore, we augment the evaluation of various RAG systems with our Se Per metric for the reference of future research, which is maintained in https://github.com/sepermetric/seper.
Open Datasets	Yes	We use EVOUNA Wang et al. (2024), a Question Answering (QA) benchmark to evaluate QA evaluators reliability. Based on NATURAL QUESTIONS (NQ) Kwiatkowski et al. (2019) and TRIVIAQA Joshi et al. (2017), EVOUNA augmented the datasets with LLM-generated responses and asked humans to annotate whether a response is semantically equivalent to the golden answer. ... In the simple open QA setting, we use three representative datasets: NQ, MS MARCO Bajaj et al. (2016), and SQUAD Rajpurkar et al. (2016). ... In the reasoning-involved QA setting, we use four typical Multihop-QA datasets, 2WIKIMULTIHOPQA Ho et al. (2020), HOTPOTQA Yang et al. (2018), IIRC Ferguson et al. (2020), and MUSIQUE Trivedi et al. (2022b).
Dataset Splits	Yes	Wherever possible, we perform inference on the test set; if the test set is unavailable, we use the dev set instead. We re-sample the datasets, and for datasets with more than 1000 instances, we randomly select 1000 examples for inference.
Hardware Specification	Yes	Our experiments were conducted on high-performance servers, each equipped with either an Intel(R) Xeon(R) Platinum 8378A CPU @ 3.00GHz or an Intel(R) Xeon(R) Platinum 8358P CPU @ 2.60GHz, 1TB of RAM, and 4/6 NVIDIA A800 GPUs with 80GB memory.
Software Dependencies	Yes	The software environment included Python 3.11, Py Torch 2.4, and NCCL 2.21.5 for reproductivity.
Experiment Setup	Yes	We conduct experiments using the Llama 2 model series Touvron et al. (2023) from the Meta Llama family, specifically Llama-2-7b-chat-hf1, Llama-2-13b-chat-hf2, and Llama-2-70b-chat-hf3. Considering that the task involves instruction-following generation, we choose the chat versions of these models. To generate various and complete answers of various kinds for Se Per computation, we set the temperature parameter of each model to 1.0, enabled do sample, and set the maximum tokens for generation to 512. For the retrieval corpus, we use the DPR version of the Wikipedia December 2018 dataset5 as our retrieval corpus, following the configuration we utilize in the RAG framework Flash RAG Jin et al. (2024). We experiment with the set of top-k values for retrieval being {1, 5, 10}, and follow each method s official implementation for the hyper-parameters of different prompt compression methods. For reranker usage, we set the reranker model as BAAI/bge-reranker-large4. We set the initial top-k value for retrieval to 20 and then apply the set as {1, 5, 10} for the reranker to choose items, leveraging the reranker s ability to both rank and filter out irrelevant content. We enable half precision when calculating Se Per.