reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Theory for Token-Level Harmonization in Retrieval-Augmented Generation

Authors: Shicheng Xu, Liang Pang, Huawei Shen, Xueqi Cheng

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments in real-world tasks using LLMs such as OPT, LLa MA-2, and Mistral show the effectiveness of our method and support our theoretical findings. Code is available1.
Researcher Affiliation	Academia	Shicheng Xu1,2, Liang Pang1 , Huawei Shen1, Xueqi Cheng1 1CAS Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences 2University of Chinese Academy of Sciences EMAIL
Pseudocode	No	The paper describes the proposed method, Tok-RAG, through conceptual frameworks, theoretical derivations, and figures (e.g., Figure 1 and Figure 2), but does not present any explicit pseudocode or algorithm blocks. The steps are described in prose.
Open Source Code	Yes	Code is available1. 1https://github.com/xsc1234/Tok-RAG
Open Datasets	Yes	We use Trivia QA Joshi et al. (2017), Squad Rajpurkar et al. (2016) and Web Questions (Web Q) as the datasets. We use a knowledge-intensive dataset T-REx Elsahar et al. (2018). We use ELI5 Fan et al. (2019), a knowledge-intensive dataset for LFQA. We use Wizard of Wikipedia Dinan et al. (2018) (Wo W), a knowledge-powered dialogue dataset. We use Wiki Text-103 Merity (2016), a popular dataset for language modeling. We use Java and Python in Code XGLUE Iyer et al. (2018) for this task.
Dataset Splits	No	The paper describes the method for constructing test data and ground-truth for the benefit-detriment comparison experiment by traversing sentences in datasets, but it does not specify explicit training/test/validation splits (e.g., percentages or counts) for the datasets used in the experiments (like TriviaQA, SQuAD, etc.). While it mentions creating samples from datasets, it doesn't detail how these datasets were partitioned into train, validation, and test sets for the overall experimental evaluation.
Hardware Specification	Yes	All models are run on a V100 GPU with Pytorch (Paszke et al., 2019) and accelerated by Deep Speed 2. ... Experiments are performed on three Q&A datasets (Trvia QA, Web Q, Squad) with V100 GPU, the LLM is LLa MA-2-7B.
Software Dependencies	Yes	All models are run on a V100 GPU with Pytorch (Paszke et al., 2019) and accelerated by Deep Speed 2.
Experiment Setup	Yes	We use OPT-6.7B, LLa MA-2-7B, and Mistral-7B-v0.1 as LLMs in the benefit-detriment comparison experiment and use greedy-decoding strategy for generation. As for retrieval in RAG, we follow (Xu et al., 2023) to use Col BERTv2 (Santhanam et al., 2021)las the retriever, and use Wikipedia consisting of 21,015,324 passages (Karpukhin et al., 2020) as retrieval database. All baselines and Tok-RAG share the same retrieval setup and input. ... For all the above tasks, we give Top-5 retrieved passages to each example.