reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Citations and Trust in LLM Generated Responses

Authors: Yifan Ding, Matthew Facciani, Ellen Joyce, Amrit Poudel, Sanmitra Bhattacharya, Balaji Veeramani, Sal Aguinaga, Tim Weninger

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We tested this hypothesis with a live question-answering experiment that presented text responses generated using a commercial Chatbot along with varying citations (zero, one, or five), both relevant and random, and recorded if participants checked the citations and their self-reported trust in the generated responses. We found a significant increase in trust when citations were present, a result that held true even when the citations were random; we also found a significant decrease in trust when participants checked the citations.
Researcher Affiliation	Collaboration	1Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556 2AI Center for Excellence, Deloitte & Touche LLP, New York City, NY, 10112
Pseudocode	No	The paper describes the methodology of the experiment in narrative form and with diagrams, but does not include any specific pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/yifding/Trust Citation LLM
Open Datasets	Yes	Datasets https://https://osf.io/yqm8z/ All participant questions, their responses and, their ratings are available in an Excel file. This file is publicly available online at https://osf.io/yqm8z/
Dataset Splits	Yes	The study had 303 total participants who were randomly assigned to the experimental groups (i.e., between-subjects design). Participants either saw zero (N=108), one (N=96), or five (N=101) citations. Of the two groups who saw citations (N=197), a random split of the participants received a random citation (N=87) or valid citation (N=110).
Hardware Specification	No	The paper mentions using a commercial Chatbot (ChatGPT4) and a Web search API, but does not specify any hardware details (e.g., GPU models, CPU types, memory amounts) used for conducting their experiments or analysis.
Software Dependencies	No	We used Stata Software for data and statistical analysis. The paper also mentions using Chat GPT4, but no specific version for Stata or Chat GPT4 (beyond the model name 'GPT4') is provided, nor are other software dependencies with version numbers.
Experiment Setup	Yes	The experiment was a Randomized Controlled Trial (RCT) with a between-subjects 3 by 2 factorial design (see Fig. 2). The first factor corresponded to the number of citations: zero, one, or five; the second factor corresponded to the nature of the citations: valid or random. For the first factor: in the no citation condition, the response was taken directly from the output of Chat GPT, truncated to three sentences if necessary, and provided to the participant. Each question was then fed directly to Chat GPT4 and the responses were collected. Responses were truncated if they were longer than three sentences. In the one-citation condition, the top citation was provided to participant as a numeral (e.g., [1]). In the five-citation condition, all five citations were provided to the participant as a list of numerals (e.g., [1,2,3,4,5]). In the random citation condition, the actual citations were recorded, but the citation URL(s) shown to the participant were randomly selected from citations of previous participant s questions. Participants were asked to enter ten questions and rate each response;