MP: Endowing Large Language Models with Lateral Thinking

Authors: Tian Bai, Yongwang Cao, Yan Ge, Haitao Yu

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experimental results with five base LLMs across three lateral thinking datasets demonstrate that: All LLMs armed with MP consistently outperform the representative baseline methods. For example, MP demonstrates superior performance over Co T prompting across Sentence Puzzle (+5.00%), Word Puzzle (+10.07%), Bi Rd QA (+6.48%), and Riddle Sense (+2.65%) with GPT-3.5-turbo model.
Researcher Affiliation Academia 1College of Computer Science and Technology, Key Laboratory of Symbolic Computation and Knowledge Engineering, Ministry of Education, Jilin University 2Graduate School of Comprehensive Human Sciences, University of Tsukuba 3Institute of Library, Information and Media Science, University of Tsukuba EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper includes 'Figure 2: An overview of metacognitive prompting.' which visually represents the three steps (Strategy Formulating, Information Processing, Deep Reflection) with prompt examples. It also provides mathematical representations of processes like 'y = LLM(Q, C, S, Promptidentify)'. However, it does not contain structured pseudocode blocks or algorithms with typical programming constructs like loops or conditional statements.
Open Source Code No The paper does not contain any explicit statements about releasing source code for the described methodology, nor does it provide links to a code repository.
Open Datasets Yes To evaluate the efficacy of MP, we conducted experiments on three datasets: BRAINTEASER (Jiang et al. 2023), Bi Rd QA (Zhang and Wan 2022), and Riddle Sense (Lin et al. 2021)
Dataset Splits No The paper states 'As adopted in datasets BRAINTEASER (Jiang et al. 2023), Bi Rd QA (Zhang and Wan 2022), and Riddle Sense (RS) (Lin et al. 2021), we evaluate the model performance with accuracy.' and mentions 'Riddle Sense (Dev)' in Table 1. While these imply the use of standard or development splits from cited benchmarks, explicit details like percentages or sample counts for training/validation/test splits are not provided in the main text for all datasets.
Hardware Specification No The paper mentions using 'closed-source GPT4, GPT-3.5-turbo and Qwen-max [...] accessed via API invocations' and 'open-source models in our experiments: LLa MA3-8B [...], Qwen1.5-14B, and Qwen1.5-110B'. However, it does not specify any hardware details (e.g., GPU models, CPU types, memory) used for running the experiments with the open-source models or for the API calls.
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions). It only mentions the LLM models used and that 'default settings, including temperature, top k, and top p' were utilized.
Experiment Setup Yes In the context of the few-shot setting, the number of demonstrations utilized is consistently set to 4 across all three baseline methods as well as in our proposed approach. For all models, we utilized the default settings, including temperature, top k, and top p, to maintain consistency and reproducibility.