reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Shared Imagination: LLMs Hallucinate Alike

Authors: Yilun Zhou, Caiming Xiong, Silvio Savarese, Chien-Sheng Wu

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct a series of investigations into this phenomenon and discuss the implications of such model homogeneity on hallucination detection and computational creativity. We study 13 models from four model families... Fig. 2 shows the correctness and answering rates, along with the respective averages.
Researcher Affiliation	Industry	Yilun Zhou EMAIL Salesforce AI Research Caiming Xiong EMAIL Salesforce AI Research Silvio Savarese EMAIL Salesforce AI Research Chien-Sheng Wu EMAIL Salesforce AI Research
Pseudocode	No	The paper describes the experimental procedures and prompts in natural language and tables (e.g., Tables 2, 3, 4) and uses figures (e.g., Figure 1) to illustrate the framework, but does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	We will release and maintain code and data on a public website. Project website: https: // yilunzhou. github. io/ shared-imagination/
Open Datasets	Yes	We will release and maintain code and data on a public website. Project website: https: // yilunzhou. github. io/ shared-imagination/
Dataset Splits	No	The paper describes the generation of 20 direct questions and 20 context questions for each of 17 topics per Question Model (QM), totaling 8840 questions, which are then used for evaluation. However, it does not specify traditional training, validation, or test splits for a dataset that a model would be trained on, as the experiment focuses on evaluating pre-trained LLMs on newly generated content.
Hardware Specification	No	The paper does not provide specific details on the hardware (e.g., GPU models, CPU types, memory) used to run the experiments, such as generating questions or evaluating answer models.
Software Dependencies	No	The paper mentions specific LLM APIs and models, such as 'gpt-3.5-turbo-0125' and 'Mistral-7B-Instruct-v0.2', and refers to Open AI's text-embedding-3-large, but it does not specify version numbers for other ancillary software dependencies or libraries (e.g., Python, PyTorch, or CUDA versions) used for the experiments.
Experiment Setup	Yes	QMs use temperature 1 to balance output quality and stochasticity, and AMs use temperature 0 for greedy answer selection. We select 17 topics on common college subjects... Each QM in Tab. 5 generates 20 direct questions and 20 context questions for each topic... The setup for the IQA procedure is summarized in Fig. 1. In the direct question generation mode (Tab. 2), the QM is asked to generate a standalone question. In the context-based question generation mode (Tab. 3), the model first writes a paragraph on a fictional concept, and then generates a question based on it. To elicit answers from the AM, we use the prompt in Tab. 4.