reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MP: Endowing Large Language Models with Lateral Thinking

Authors: Tian Bai, Yongwang Cao, Yan Ge, Haitao Yu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The experimental results with five base LLMs across three lateral thinking datasets demonstrate that: All LLMs armed with MP consistently outperform the representative baseline methods. For example, MP demonstrates superior performance over Co T prompting across Sentence Puzzle (+5.00%), Word Puzzle (+10.07%), Bi Rd QA (+6.48%), and Riddle Sense (+2.65%) with GPT-3.5-turbo model.
Researcher Affiliation	Academia	1College of Computer Science and Technology, Key Laboratory of Symbolic Computation and Knowledge Engineering, Ministry of Education, Jilin University 2Graduate School of Comprehensive Human Sciences, University of Tsukuba 3Institute of Library, Information and Media Science, University of Tsukuba EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper includes 'Figure 2: An overview of metacognitive prompting.' which visually represents the three steps (Strategy Formulating, Information Processing, Deep Reflection) with prompt examples. It also provides mathematical representations of processes like 'y = LLM(Q, C, S, Promptidentify)'. However, it does not contain structured pseudocode blocks or algorithms with typical programming constructs like loops or conditional statements.
Open Source Code	No	The paper does not contain any explicit statements about releasing source code for the described methodology, nor does it provide links to a code repository.
Open Datasets	Yes	To evaluate the efficacy of MP, we conducted experiments on three datasets: BRAINTEASER (Jiang et al. 2023), Bi Rd QA (Zhang and Wan 2022), and Riddle Sense (Lin et al. 2021)
Dataset Splits	No	The paper states 'As adopted in datasets BRAINTEASER (Jiang et al. 2023), Bi Rd QA (Zhang and Wan 2022), and Riddle Sense (RS) (Lin et al. 2021), we evaluate the model performance with accuracy.' and mentions 'Riddle Sense (Dev)' in Table 1. While these imply the use of standard or development splits from cited benchmarks, explicit details like percentages or sample counts for training/validation/test splits are not provided in the main text for all datasets.
Hardware Specification	No	The paper mentions using 'closed-source GPT4, GPT-3.5-turbo and Qwen-max [...] accessed via API invocations' and 'open-source models in our experiments: LLa MA3-8B [...], Qwen1.5-14B, and Qwen1.5-110B'. However, it does not specify any hardware details (e.g., GPU models, CPU types, memory) used for running the experiments with the open-source models or for the API calls.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions). It only mentions the LLM models used and that 'default settings, including temperature, top k, and top p' were utilized.
Experiment Setup	Yes	In the context of the few-shot setting, the number of demonstrations utilized is consistently set to 4 across all three baseline methods as well as in our proposed approach. For all models, we utilized the default settings, including temperature, top k, and top p, to maintain consistency and reproducibility.