reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Towards Verifiable Text Generation with Generative Agent

Authors: Bin Ji, Huijun Liu, Mingzhe Du, Shasha Li, Xiaodong Liu, Jun Ma, Jie Yu, See-Kiong Ng

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate R2-MGA across five LLMs on the ALCE benchmark. The results reveal R2-MGA exceptional capabilities in text generation with citations. In particular, compared to the selected baselines, it delivers up to +58.8% and +154.7% relative performance gains on answer correctness and citation quality, respectively. Extensive analyses strongly support the motivations of R2-MGA.
Researcher Affiliation	Academia	1College of Computer Science and Technology, National University of Defense Technology 2Nanyang Technological University 3 National University of Singapore EMAIL, EMAIL
Pseudocode	No	The paper describes the R2-MGA framework and its modules (Memory, Initialization, Assessment, Planning & Action) in detail in natural language and with a diagram (Figure 2), but it does not include any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any explicit statement about releasing its own source code, nor does it include a link to a code repository for the R2-MGA methodology.
Open Datasets	Yes	The ALCE benchmark collects three datasets, i.e., ASQA (Stelmakh et al. 2022), QAMPARI (Rubin et al. 2022), and ELI5 (Fan et al. 2019) and pre-defines automatic evaluation metrics.
Dataset Splits	Yes	For fair comparisons, we follow the task formalization presented in ALCE (Gao et al. 2023), as shown below: Given a question Q and retrieved documents (i.e., D = {d1, d2, , dm}) that contain the knowledge to answer Q, a generation system is required to generate an answer A by synthesizing di D. ... The ALCE benchmark collects three datasets, i.e., ASQA (Stelmakh et al. 2022), QAMPARI (Rubin et al. 2022), and ELI5 (Fan et al. 2019) and pre-defines automatic evaluation metrics. We report more benchmark details in Appendix A.
Hardware Specification	Yes	We use four NVIDIA A100 40GB GPUs to run R2MGA.
Software Dependencies	No	We build R2-MGA upon five LLMs including closed-source Chat GPT (gpt-3.5-turbo-0301) and GPT4 (gpt-4-0613), and open-source LLa MA-2-70B-Chat ( LLa MA-70B for short), Vicuna-13B, and LLa MA-2-7BChat ( LLa MA-7B for short). ... to generate an in-depth rationale r explaining the reason for the decision-making process included in M. We combine the retrieved memory M with the reasoning rationale r as the best-matched demonstration in the Initialization, Assessment, and Planning & Action modules.
Experiment Setup	Yes	For R2-MGA built upon open-source LLMs, we evaluate it by setting the LLMs temperature to 0.001, 0.1, 0.3, 0.5, 0.7, 0.9, and 1, respectively, and report the averaged performance. Limited by the API costs of Chat GPT and GPT-4, we solely set the temperature value to 1 for them.