A Theory for Token-Level Harmonization in Retrieval-Augmented Generation
Authors: Shicheng Xu, Liang Pang, Huawei Shen, Xueqi Cheng
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments in real-world tasks using LLMs such as OPT, LLa MA-2, and Mistral show the effectiveness of our method and support our theoretical findings. Code is available1. |
| Researcher Affiliation | Academia | Shicheng Xu1,2, Liang Pang1 , Huawei Shen1, Xueqi Cheng1 1CAS Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences 2University of Chinese Academy of Sciences EMAIL |
| Pseudocode | No | The paper describes the proposed method, Tok-RAG, through conceptual frameworks, theoretical derivations, and figures (e.g., Figure 1 and Figure 2), but does not present any explicit pseudocode or algorithm blocks. The steps are described in prose. |
| Open Source Code | Yes | Code is available1. 1https://github.com/xsc1234/Tok-RAG |
| Open Datasets | Yes | We use Trivia QA Joshi et al. (2017), Squad Rajpurkar et al. (2016) and Web Questions (Web Q) as the datasets. We use a knowledge-intensive dataset T-REx Elsahar et al. (2018). We use ELI5 Fan et al. (2019), a knowledge-intensive dataset for LFQA. We use Wizard of Wikipedia Dinan et al. (2018) (Wo W), a knowledge-powered dialogue dataset. We use Wiki Text-103 Merity (2016), a popular dataset for language modeling. We use Java and Python in Code XGLUE Iyer et al. (2018) for this task. |
| Dataset Splits | No | The paper describes the method for constructing test data and ground-truth for the benefit-detriment comparison experiment by traversing sentences in datasets, but it does not specify explicit training/test/validation splits (e.g., percentages or counts) for the datasets used in the experiments (like TriviaQA, SQuAD, etc.). While it mentions creating samples from datasets, it doesn't detail how these datasets were partitioned into train, validation, and test sets for the overall experimental evaluation. |
| Hardware Specification | Yes | All models are run on a V100 GPU with Pytorch (Paszke et al., 2019) and accelerated by Deep Speed 2. ... Experiments are performed on three Q&A datasets (Trvia QA, Web Q, Squad) with V100 GPU, the LLM is LLa MA-2-7B. |
| Software Dependencies | Yes | All models are run on a V100 GPU with Pytorch (Paszke et al., 2019) and accelerated by Deep Speed 2. |
| Experiment Setup | Yes | We use OPT-6.7B, LLa MA-2-7B, and Mistral-7B-v0.1 as LLMs in the benefit-detriment comparison experiment and use greedy-decoding strategy for generation. As for retrieval in RAG, we follow (Xu et al., 2023) to use Col BERTv2 (Santhanam et al., 2021)las the retriever, and use Wikipedia consisting of 21,015,324 passages (Karpukhin et al., 2020) as retrieval database. All baselines and Tok-RAG share the same retrieval setup and input. ... For all the above tasks, we give Top-5 retrieved passages to each example. |