Towards Verifiable Text Generation with Generative Agent

Authors: Bin Ji, Huijun Liu, Mingzhe Du, Shasha Li, Xiaodong Liu, Jun Ma, Jie Yu, See-Kiong Ng

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate R2-MGA across five LLMs on the ALCE benchmark. The results reveal R2-MGA exceptional capabilities in text generation with citations. In particular, compared to the selected baselines, it delivers up to +58.8% and +154.7% relative performance gains on answer correctness and citation quality, respectively. Extensive analyses strongly support the motivations of R2-MGA.
Researcher Affiliation Academia 1College of Computer Science and Technology, National University of Defense Technology 2Nanyang Technological University 3 National University of Singapore EMAIL, EMAIL
Pseudocode No The paper describes the R2-MGA framework and its modules (Memory, Initialization, Assessment, Planning & Action) in detail in natural language and with a diagram (Figure 2), but it does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statement about releasing its own source code, nor does it include a link to a code repository for the R2-MGA methodology.
Open Datasets Yes The ALCE benchmark collects three datasets, i.e., ASQA (Stelmakh et al. 2022), QAMPARI (Rubin et al. 2022), and ELI5 (Fan et al. 2019) and pre-defines automatic evaluation metrics.
Dataset Splits Yes For fair comparisons, we follow the task formalization presented in ALCE (Gao et al. 2023), as shown below: Given a question Q and retrieved documents (i.e., D = {d1, d2, , dm}) that contain the knowledge to answer Q, a generation system is required to generate an answer A by synthesizing di D. ... The ALCE benchmark collects three datasets, i.e., ASQA (Stelmakh et al. 2022), QAMPARI (Rubin et al. 2022), and ELI5 (Fan et al. 2019) and pre-defines automatic evaluation metrics. We report more benchmark details in Appendix A.
Hardware Specification Yes We use four NVIDIA A100 40GB GPUs to run R2MGA.
Software Dependencies No We build R2-MGA upon five LLMs including closed-source Chat GPT (gpt-3.5-turbo-0301) and GPT4 (gpt-4-0613), and open-source LLa MA-2-70B-Chat ( LLa MA-70B for short), Vicuna-13B, and LLa MA-2-7BChat ( LLa MA-7B for short). ... to generate an in-depth rationale r explaining the reason for the decision-making process included in M. We combine the retrieved memory M with the reasoning rationale r as the best-matched demonstration in the Initialization, Assessment, and Planning & Action modules.
Experiment Setup Yes For R2-MGA built upon open-source LLMs, we evaluate it by setting the LLMs temperature to 0.001, 0.1, 0.3, 0.5, 0.7, 0.9, and 1, respectively, and report the averaged performance. Limited by the API costs of Chat GPT and GPT-4, we solely set the temperature value to 1 for them.