reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Improving Retrieval Augmented Language Model with Self-Reasoning

Authors: Yuan Xia, Jingbo Zhou, Zhenhui Shi, Jun Chen, Haifeng Huang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluated our framework across four public datasets (two short-form QA datasets, one long-form QA dataset, and one fact verification dataset) to demonstrate its superiority. Our method can outperform existing state-of-the-art models and achieve performance comparable with GPT-4, using only 2,000 training samples.
Researcher Affiliation	Industry	Yuan Xia1, Jingbo Zhou2,*, Zhenhui Shi1, Jun Chen1, Haifeng Huang1 1Baidu Inc., China 2Baidu Research, China EMAIL
Pseudocode	Yes	More details and pseudo-codes can be found in the Appendix.
Open Source Code	No	The paper does not explicitly state that source code for the methodology described is publicly available, nor does it provide a link to a code repository.
Open Datasets	Yes	We conduct an extensive experimental evaluation on two short-form QA datasets (Natural Question (Kwiatkowski et al. 2019) and Pop QA (Mallen et al. 2023)), one long-form QA dataset (ASQA (Stelmakh et al. 2022)), and one fact verification dataset (FEVER (Thorne et al. 2018)).
Dataset Splits	No	The paper mentions generating its own training samples ('We totally generate 10,000 training samples by GPT-4, after the filtering strategy by quality control, we finally keep 2,000 training samples with high quality'), but it does not provide specific train/test/validation splits for the public datasets used in the experiments (Natural Question, Pop QA, ASQA, FEVER).
Hardware Specification	No	The paper does not provide any specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies	No	The paper mentions tools like DPR (Karpukhin et al. 2020) and Contriever (Izacard et al. 2021) and models like LLaMA2, but it does not specify version numbers for any key software components or libraries used.
Experiment Setup	No	The paper states that "Hyper-parameters for training are described in the Appendix." and mentions abstract learning rates ra, rb, rc without providing their specific values in the main text. Therefore, specific experimental setup details are deferred to the appendix.