reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MaFeRw: Query Rewriting with Multi-Aspect Feedbacks for Retrieval-Augmented Large Language Models

Authors: Yujing Wang, Hainan Zhang, Liang Pang, Binghui Guo, Hongwei Zheng, Zhiming Zheng

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on two conversational RAG datasets demonstrate that Ma Fe Rw achieves superior generation metrics and more stable training compared to baselines. Further analysis shows that multi-aspect dense rewards provide a more stable training process and generation results than single reward, validating the stability and transferability of Ma Fe Rw.
Researcher Affiliation	Academia	Yujing Wang1,2, Hainan Zhang1,2*, Liang Pang4, Binghui Guo2, Hongwei Zheng3, Zhiming Zheng1,2 1Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing 2 School of Artificial Intelligence, Beihang University, China 3Beijing Academy of Blockchain and Edge Computing, China 4Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China EMAIL
Pseudocode	No	The paper describes the methodology using prose and mathematical formulations but does not include a clearly labeled pseudocode block or algorithm.
Open Source Code	Yes	Code https://github.com/TAP-LLM/Ma Fe Rw
Open Datasets	Yes	We conduct main experiments on two multi-turn dialogue RAG datasets, including QRe CC (Anantha et al. 2020) and Topi OCQA (Adlakha et al. 2022). And conduct the transferability experiment on the WSDM@24 Multi Doc QA dataset 1. 1https://sites.google.com/view/wsdm24-docqa
Dataset Splits	No	The paper mentions using 'test sets' for reward model accuracy but does not provide specific details on how the datasets (QRe CC, Topi OCQA, WSDM@24 Multi Doc QA) were split into training, validation, or test sets with percentages or sample counts.
Hardware Specification	No	The paper mentions the use of specific models like T5-base and Llama-2-13b-chat, but does not provide any specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies	No	The paper mentions using the 'pre-trained T5-base model', 'FAISS (Johnson, Douze, and Jegou 2021)', 'msmarco-roberta-base-ance-firstp (Reimers and Gurevych 2019)', and 'Llama-2-13b-chat model (Touvron et al. 2023)'. However, it does not provide specific version numbers for these software libraries, models, or any underlying programming languages or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup	No	Details on hyperparameter determination are provided in the Appendix2.