Improving Retrieval Augmented Language Model with Self-Reasoning
Authors: Yuan Xia, Jingbo Zhou, Zhenhui Shi, Jun Chen, Haifeng Huang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluated our framework across four public datasets (two short-form QA datasets, one long-form QA dataset, and one fact verification dataset) to demonstrate its superiority. Our method can outperform existing state-of-the-art models and achieve performance comparable with GPT-4, using only 2,000 training samples. |
| Researcher Affiliation | Industry | Yuan Xia1, Jingbo Zhou2,*, Zhenhui Shi1, Jun Chen1, Haifeng Huang1 1Baidu Inc., China 2Baidu Research, China EMAIL |
| Pseudocode | Yes | More details and pseudo-codes can be found in the Appendix. |
| Open Source Code | No | The paper does not explicitly state that source code for the methodology described is publicly available, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We conduct an extensive experimental evaluation on two short-form QA datasets (Natural Question (Kwiatkowski et al. 2019) and Pop QA (Mallen et al. 2023)), one long-form QA dataset (ASQA (Stelmakh et al. 2022)), and one fact verification dataset (FEVER (Thorne et al. 2018)). |
| Dataset Splits | No | The paper mentions generating its own training samples ('We totally generate 10,000 training samples by GPT-4, after the filtering strategy by quality control, we finally keep 2,000 training samples with high quality'), but it does not provide specific train/test/validation splits for the public datasets used in the experiments (Natural Question, Pop QA, ASQA, FEVER). |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions tools like DPR (Karpukhin et al. 2020) and Contriever (Izacard et al. 2021) and models like LLaMA2, but it does not specify version numbers for any key software components or libraries used. |
| Experiment Setup | No | The paper states that "Hyper-parameters for training are described in the Appendix." and mentions abstract learning rates ra, rb, rc without providing their specific values in the main text. Therefore, specific experimental setup details are deferred to the appendix. |