Shared Imagination: LLMs Hallucinate Alike

Authors: Yilun Zhou, Caiming Xiong, Silvio Savarese, Chien-Sheng Wu

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct a series of investigations into this phenomenon and discuss the implications of such model homogeneity on hallucination detection and computational creativity. We study 13 models from four model families... Fig. 2 shows the correctness and answering rates, along with the respective averages.
Researcher Affiliation Industry Yilun Zhou EMAIL Salesforce AI Research Caiming Xiong EMAIL Salesforce AI Research Silvio Savarese EMAIL Salesforce AI Research Chien-Sheng Wu EMAIL Salesforce AI Research
Pseudocode No The paper describes the experimental procedures and prompts in natural language and tables (e.g., Tables 2, 3, 4) and uses figures (e.g., Figure 1) to illustrate the framework, but does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes We will release and maintain code and data on a public website. Project website: https: // yilunzhou. github. io/ shared-imagination/
Open Datasets Yes We will release and maintain code and data on a public website. Project website: https: // yilunzhou. github. io/ shared-imagination/
Dataset Splits No The paper describes the generation of 20 direct questions and 20 context questions for each of 17 topics per Question Model (QM), totaling 8840 questions, which are then used for evaluation. However, it does not specify traditional training, validation, or test splits for a dataset that a model would be trained on, as the experiment focuses on evaluating pre-trained LLMs on newly generated content.
Hardware Specification No The paper does not provide specific details on the hardware (e.g., GPU models, CPU types, memory) used to run the experiments, such as generating questions or evaluating answer models.
Software Dependencies No The paper mentions specific LLM APIs and models, such as 'gpt-3.5-turbo-0125' and 'Mistral-7B-Instruct-v0.2', and refers to Open AI's text-embedding-3-large, but it does not specify version numbers for other ancillary software dependencies or libraries (e.g., Python, PyTorch, or CUDA versions) used for the experiments.
Experiment Setup Yes QMs use temperature 1 to balance output quality and stochasticity, and AMs use temperature 0 for greedy answer selection. We select 17 topics on common college subjects... Each QM in Tab. 5 generates 20 direct questions and 20 context questions for each topic... The setup for the IQA procedure is summarized in Fig. 1. In the direct question generation mode (Tab. 2), the QM is asked to generate a standalone question. In the context-based question generation mode (Tab. 3), the model first writes a paragraph on a fictional concept, and then generates a question based on it. To elicit answers from the AM, we use the prompt in Tab. 4.