reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

RAG-DDR: Optimizing Retrieval-Augmented Generation Using Differentiable Data Rewards

Authors: Xinze Li, Sen Mei, Zhenghao Liu, Yukun Yan, Shuo Wang, Shi Yu, Zheni Zeng, Hao Chen, Ge Yu, Zhiyuan Liu, Maosong Sun, Chenyan Xiong

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments on various knowledge-intensive tasks demonstrate that DDR significantly outperforms the SFT method, particularly for LLMs with smaller-scale parameters that depend more on the retrieved knowledge. Additionally, DDR exhibits a stronger capability to align the data preference between RAG modules. The DDR method makes the generation module more effective in extracting key information from documents and mitigating conflicts between parametric memory and external knowledge.
Researcher Affiliation	Academia	1Northeastern University 2Tsinghua University 3Carnegie Mellon University
Pseudocode	No	The paper describes its methods using prose, mathematical equations, and a figure, but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	All codes are available at https://github.com/Open Match/RAG-DDR.
Open Datasets	Yes	For all datasets and all baselines, we use bgelarge (Xiao et al., 2023) to retrieve documents from the MS MARCO 2.1 (Bajaj et al., 2016). The open-domain QA tasks consist of NQ (Kwiatkowski et al., 2019), MARCO QA (Bajaj et al., 2016) and Trivia QA (Joshi et al., 2017), which require models to retrieve factual knowledge to help LLMs answer the given question. For more complex tasks, such as multi-hop QA and dialogue, we use Hotpot QA dataset (Yang et al., 2018) and Wikipedia of Wizard (Wo W) (Dinan et al., 2019) for evaluation. Besides, we also employ T-REx (Elsahar et al., 2018) to measure one-hop fact look-up abilities of models.
Dataset Splits	Yes	During the training of DDR, we collect ten datasets covering two tasks, open-domain QA and reasoning. Specifically, we randomly sample 32,805 samples for the training set and 2,000 samples for the development set in our experiments.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments, such as exact GPU or CPU models, or memory amounts. It mentions using models like Llama3-8B and Mini CPM-2.4B, which are software, not hardware specifications.
Software Dependencies	No	The paper mentions several software components and models such as bgelarge (Xiao et al., 2023), Minicpm-2.4B-sft (Hu et al., 2024), Llama3-8B-Instruct (Touvron et al., 2023), Lo RA (Hu et al., 2022), DPO (Rafailov et al., 2024), and Lang Chain. However, it does not provide specific version numbers for these software components or any other libraries or runtime environments, which are necessary for full reproducibility.
Experiment Setup	Yes	For DDR training, we use automatic metrics such as Rouge-L and Accuracy to calculate the reward and set β = 0.1. The learning rate is set to 5e-5, and each model is trained for one epoch. For the generation module, we feed 5 retrieved passages as external knowledge for augmenting the generation process.