reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Toward Verifiable Instruction-Following Alignment for Retrieval Augmented Generation

Authors: Guanting Dong, Xiaoshuai Song, Yutao Zhu, Runqi Qiao, Zhicheng Dou, Ji-Rong Wen

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental Setup Datasets. We evaluate over 10+ benchmarks to comprehensively evaluate the VIF-RAG. For the instruction-following tasks in RAG scenarios, we use the Follow RAG benchmark as mentioned in Section 5, which covering 4 questionanswering (QA) datasets. For general IF evaluation, we selected two commonly used complex IF datasets, IFEval and Follow Bench, along with the natural instruction dataset MT-Bench (Zheng et al. 2024) and the challenging Chat Bot IF bench, Arena-Hard (Li et al. 2024c). Additionally, to measure that the foundational abilities of LLMs, we further evaluate two widely used LLM s general abilties evaluation sets, C-Eval (Huang et al. 2023) and MMLU (Hendrycks et al. 2021), as well as the mathematical reasoning dataset GSM8K (Cobbe et al. 2021) and the code evaluation bench Human Eval (Chen et al. 2021). Baselines. We select Mistral-7B (Jiang et al. 2023), Llama3-8B (Meta 2024), Qwen1.5-7B, and Qwen1.514B (Yang et al. 2024) as our backbone models, fine-tuning Share GPT and four QA training sets as SFT version. Main Result Our primary findings are presented in Table 1. Overall, VIFRAG consistently surpasses all baselines in Follow RAG across multiple setups, highlighting the advantages of our method. Additionally, we have several key insights: Cross-Domain Validation To explore the transferability of VIF-RAG, we conduct cross-domain validation on four natural instructionfollowing datasets and four foundational abilities benchmarks for LLMs in Tabel 2. Quantitative Analysis Ablation Study. To examine the effects of various components in VIF-RAG, we conduct an ablation study in Table 3.
Researcher Affiliation	Academia	1Gaoling School of Artificial Intelligence, Renmin University of China 2School of Artificial Intelligence, Beijing University of Posts and Telecommunications EMAIL
Pseudocode	No	The paper describes methods and processes in paragraph form (e.g., "Instruction Synthesis from Scratch", "Instruction Composition & Verification"), but it does not contain any structured pseudocode or algorithm blocks with numbered steps, special formatting, or keywords like "Algorithm" or "Pseudocode".
Open Source Code	Yes	Code https://github.com/dongguanting/Follow RAG
Open Datasets	Yes	2We use the training sets from Natural Questions, Trivia QA, Hotpot QA, and Web Questions SP as mixed QA sources. ... 1) Open-Domain QA: Natural Questions (NQ) (Kwiatkowski et al. 2019) and Trivia QA (TQA) (Joshi et al. 2017); 2) Multi-Hop QA: Hotpot QA (HQA) (Yang et al. 2018); and 3) Knowledge Base QA: Web Questions SP (Web QSP) (Yih et al. 2016). ... Share GPT (Chiang et al. 2023), which provides authentic multi-turn human dialogue data, is our natural choice.
Dataset Splits	Yes	2We use the training sets from Natural Questions, Trivia QA, Hotpot QA, and Web Questions SP as mixed QA sources. ... Follow RAG is the first instruction-following evaluation dataset under RAG scenario comprising 2.8K samples, covering 22 fine-grained atomic instructions across 6 categories. ... Follow RAG includes 0.9K samples of single and dual atomic instructions, as well as 0.5K complex multi-instruction samples containing 3 and 4 atomic instructions, respectively. ... Follow RAG Benchmark, which includes approximately 3K test samples, covering 22 categories of general instruction constraints and four knowledge-intensive QA datasets.
Hardware Specification	No	The paper does not explicitly describe any specific hardware details such as GPU models, CPU types, or memory amounts used for running the experiments.
Software Dependencies	No	The paper mentions using a "Python executor" and "Python compiler" for verification, but no specific version numbers for Python or any libraries are provided. It also mentions "GPT-4-turbo-2024-04-09" and "GPT-4o" as supervised models, which are specific model instances rather than general software dependencies with version numbers.
Experiment Setup	Yes	We use a supervised model1 to iteratively rewrite instructions from the Dseed set in batches of 50 for K rounds, generating an augmented set Daug. ... For the supervised model, we use GPT-4-turbo-2024-04-09. ... Samples scoring below 8 are excluded to refine our high-quality complex instruction set Dcomplex seed. ... Based on the above cross metrics, we require that at least one Accfunc and Acccase of the each instruction must exceed 0.5... we use the supervision model to evaluate the alignment between queries and instructions on a scale of 1 to 10, discarding samples that receive a score below 8. ... we use GPT-4o to evaluate whether the model s outputs correctly address the questions. The scoring criteria are as follows: Completely correct (1), Partially correct (0.5), Completely incorrect (0).