Prompt Engineering Techniques for Language Model Reasoning Lack Replicability
Authors: Laurène Vaugrante, Mathias Niepert, Thilo Hagendorff
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We tested GPT-3.5, GPT-4o, Gemini 1.5 Pro, Claude 3 Opus, Llama 3, Vicuna, and BLOOM on the chain-of-thought, Sandbagging, Emotion Prompting, Re-Reading, Rephraseand-Respond (Ra R), and Expert Prompting prompt engineering techniques. We applied them on manually double-checked subsets of reasoning benchmarks including Commonsense QA, CRT, Num GLUE, Science QA, and Strategy QA. Our findings reveal a general lack of statistically significant differences across nearly all techniques tested, highlighting, among others, several methodological weaknesses in previous research. |
| Researcher Affiliation | Academia | Laurène Vaugrante EMAIL Interchange Forum for Reflecting on Intelligent Systems University of Stuttgart Mathias Niepert Institute for Artificial Intelligence University of Stuttgart Thilo Hagendorff Interchange Forum for Reflecting on Intelligent Systems University of Stuttgart |
| Pseudocode | No | The paper describes methodologies and experimental steps in prose, such as in section 2.3 'Experiments' and sections 3.1-3.6 discussing each prompt engineering technique. It does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is accessible here: https://github.com/Laurene-v/replicating PET. By lowering the practical barrier to replication and promoting iterative experimentation, we aim to foster a culture of methodological transparency and empirical verification in prompt-engineering research. The datasets and code generated during this study are available in the Replication Crisis In LLMEvaluation repository on the Open Science Framework (OSF) at https://osf.io/hcygf/?view_only= fe25a85157734f68882777404aeb655c and at https://github.com/Laurene-v/replicating PET. |
| Open Datasets | Yes | To replicate the claimed impact of the selected prompt engineering techniques on LLM reasoning abilities, we selected five different benchmarks, each measuring a different type of reasoning: Commonsense QA (Talmor et al., 2019), CRT (Hagendorff et al., 2023), Num GLUE (Mishra et al., 2022), Science QA (Lu et et al., 2022) and Strategy QA (Geva et al., 2021). The datasets and code generated during this study are available in the Replication Crisis In LLMEvaluation repository on the Open Science Framework (OSF) at https://osf.io/hcygf/?view_only= fe25a85157734f68882777404aeb655c and at https://github.com/Laurene-v/replicating PET. |
| Dataset Splits | Yes | Therefore, we chose to hand-pick (through rule-based filtering and manual checks) 150 faultless questions out of a random sample of 200 questions per benchmark, with a total of n = 750, preferring accuracy over large sample sizes. |
| Hardware Specification | No | The paper mentions the LLMs tested (GPT-3.5, GPT-4o, Gemini 1.5 Pro, Claude 3 Opus, Llama 3, Vicuna, BLOOM) and their temperature settings, but it does not specify any particular hardware (e.g., GPU, CPU models, memory) used for running the experiments. |
| Software Dependencies | Yes | All statistical analyses were performed using Python (version 3.11.4). The Sci Py library (version 1.13.1) was used for statistical computations, while visualizations were created with Matplotlib (version 3.7.1) and Seaborn (version 0.12.2). |
| Experiment Setup | Yes | For all experiments, LLM temperature parameters were set to 0, or 0.00001 when 0 was not permitted. ... When the studies used several preor suffixes as a basis to their claim, such as in the Emotion Prompting study where 11 different emotional stimuli were used, we randomly selected one of them for each task using a seed. |