Shared Imagination: LLMs Hallucinate Alike
Authors: Yilun Zhou, Caiming Xiong, Silvio Savarese, Chien-Sheng Wu
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct a series of investigations into this phenomenon and discuss the implications of such model homogeneity on hallucination detection and computational creativity. We study 13 models from four model families... Fig. 2 shows the correctness and answering rates, along with the respective averages. |
| Researcher Affiliation | Industry | Yilun Zhou EMAIL Salesforce AI Research Caiming Xiong EMAIL Salesforce AI Research Silvio Savarese EMAIL Salesforce AI Research Chien-Sheng Wu EMAIL Salesforce AI Research |
| Pseudocode | No | The paper describes the experimental procedures and prompts in natural language and tables (e.g., Tables 2, 3, 4) and uses figures (e.g., Figure 1) to illustrate the framework, but does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We will release and maintain code and data on a public website. Project website: https: // yilunzhou. github. io/ shared-imagination/ |
| Open Datasets | Yes | We will release and maintain code and data on a public website. Project website: https: // yilunzhou. github. io/ shared-imagination/ |
| Dataset Splits | No | The paper describes the generation of 20 direct questions and 20 context questions for each of 17 topics per Question Model (QM), totaling 8840 questions, which are then used for evaluation. However, it does not specify traditional training, validation, or test splits for a dataset that a model would be trained on, as the experiment focuses on evaluating pre-trained LLMs on newly generated content. |
| Hardware Specification | No | The paper does not provide specific details on the hardware (e.g., GPU models, CPU types, memory) used to run the experiments, such as generating questions or evaluating answer models. |
| Software Dependencies | No | The paper mentions specific LLM APIs and models, such as 'gpt-3.5-turbo-0125' and 'Mistral-7B-Instruct-v0.2', and refers to Open AI's text-embedding-3-large, but it does not specify version numbers for other ancillary software dependencies or libraries (e.g., Python, PyTorch, or CUDA versions) used for the experiments. |
| Experiment Setup | Yes | QMs use temperature 1 to balance output quality and stochasticity, and AMs use temperature 0 for greedy answer selection. We select 17 topics on common college subjects... Each QM in Tab. 5 generates 20 direct questions and 20 context questions for each topic... The setup for the IQA procedure is summarized in Fig. 1. In the direct question generation mode (Tab. 2), the QM is asked to generate a standalone question. In the context-based question generation mode (Tab. 3), the model first writes a paragraph on a fictional concept, and then generates a question based on it. To elicit answers from the AM, we use the prompt in Tab. 4. |