MP: Endowing Large Language Models with Lateral Thinking
Authors: Tian Bai, Yongwang Cao, Yan Ge, Haitao Yu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experimental results with five base LLMs across three lateral thinking datasets demonstrate that: All LLMs armed with MP consistently outperform the representative baseline methods. For example, MP demonstrates superior performance over Co T prompting across Sentence Puzzle (+5.00%), Word Puzzle (+10.07%), Bi Rd QA (+6.48%), and Riddle Sense (+2.65%) with GPT-3.5-turbo model. |
| Researcher Affiliation | Academia | 1College of Computer Science and Technology, Key Laboratory of Symbolic Computation and Knowledge Engineering, Ministry of Education, Jilin University 2Graduate School of Comprehensive Human Sciences, University of Tsukuba 3Institute of Library, Information and Media Science, University of Tsukuba EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper includes 'Figure 2: An overview of metacognitive prompting.' which visually represents the three steps (Strategy Formulating, Information Processing, Deep Reflection) with prompt examples. It also provides mathematical representations of processes like 'y = LLM(Q, C, S, Promptidentify)'. However, it does not contain structured pseudocode blocks or algorithms with typical programming constructs like loops or conditional statements. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code for the described methodology, nor does it provide links to a code repository. |
| Open Datasets | Yes | To evaluate the efficacy of MP, we conducted experiments on three datasets: BRAINTEASER (Jiang et al. 2023), Bi Rd QA (Zhang and Wan 2022), and Riddle Sense (Lin et al. 2021) |
| Dataset Splits | No | The paper states 'As adopted in datasets BRAINTEASER (Jiang et al. 2023), Bi Rd QA (Zhang and Wan 2022), and Riddle Sense (RS) (Lin et al. 2021), we evaluate the model performance with accuracy.' and mentions 'Riddle Sense (Dev)' in Table 1. While these imply the use of standard or development splits from cited benchmarks, explicit details like percentages or sample counts for training/validation/test splits are not provided in the main text for all datasets. |
| Hardware Specification | No | The paper mentions using 'closed-source GPT4, GPT-3.5-turbo and Qwen-max [...] accessed via API invocations' and 'open-source models in our experiments: LLa MA3-8B [...], Qwen1.5-14B, and Qwen1.5-110B'. However, it does not specify any hardware details (e.g., GPU models, CPU types, memory) used for running the experiments with the open-source models or for the API calls. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions). It only mentions the LLM models used and that 'default settings, including temperature, top k, and top p' were utilized. |
| Experiment Setup | Yes | In the context of the few-shot setting, the number of demonstrations utilized is consistently set to 4 across all three baseline methods as well as in our proposed approach. For all models, we utilized the default settings, including temperature, top k, and top p, to maintain consistency and reproducibility. |