RAG-DDR: Optimizing Retrieval-Augmented Generation Using Differentiable Data Rewards
Authors: Xinze Li, Sen Mei, Zhenghao Liu, Yukun Yan, Shuo Wang, Shi Yu, Zheni Zeng, Hao Chen, Ge Yu, Zhiyuan Liu, Maosong Sun, Chenyan Xiong
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on various knowledge-intensive tasks demonstrate that DDR significantly outperforms the SFT method, particularly for LLMs with smaller-scale parameters that depend more on the retrieved knowledge. Additionally, DDR exhibits a stronger capability to align the data preference between RAG modules. The DDR method makes the generation module more effective in extracting key information from documents and mitigating conflicts between parametric memory and external knowledge. |
| Researcher Affiliation | Academia | 1Northeastern University 2Tsinghua University 3Carnegie Mellon University |
| Pseudocode | No | The paper describes its methods using prose, mathematical equations, and a figure, but it does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | All codes are available at https://github.com/Open Match/RAG-DDR. |
| Open Datasets | Yes | For all datasets and all baselines, we use bgelarge (Xiao et al., 2023) to retrieve documents from the MS MARCO 2.1 (Bajaj et al., 2016). The open-domain QA tasks consist of NQ (Kwiatkowski et al., 2019), MARCO QA (Bajaj et al., 2016) and Trivia QA (Joshi et al., 2017), which require models to retrieve factual knowledge to help LLMs answer the given question. For more complex tasks, such as multi-hop QA and dialogue, we use Hotpot QA dataset (Yang et al., 2018) and Wikipedia of Wizard (Wo W) (Dinan et al., 2019) for evaluation. Besides, we also employ T-REx (Elsahar et al., 2018) to measure one-hop fact look-up abilities of models. |
| Dataset Splits | Yes | During the training of DDR, we collect ten datasets covering two tasks, open-domain QA and reasoning. Specifically, we randomly sample 32,805 samples for the training set and 2,000 samples for the development set in our experiments. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments, such as exact GPU or CPU models, or memory amounts. It mentions using models like Llama3-8B and Mini CPM-2.4B, which are software, not hardware specifications. |
| Software Dependencies | No | The paper mentions several software components and models such as bgelarge (Xiao et al., 2023), Minicpm-2.4B-sft (Hu et al., 2024), Llama3-8B-Instruct (Touvron et al., 2023), Lo RA (Hu et al., 2022), DPO (Rafailov et al., 2024), and Lang Chain. However, it does not provide specific version numbers for these software components or any other libraries or runtime environments, which are necessary for full reproducibility. |
| Experiment Setup | Yes | For DDR training, we use automatic metrics such as Rouge-L and Accuracy to calculate the reward and set β = 0.1. The learning rate is set to 5e-5, and each model is trained for one epoch. For the generation module, we feed 5 retrieved passages as external knowledge for augmenting the generation process. |