reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Prompt Engineering Techniques for Language Model Reasoning Lack Replicability

Authors: Laurène Vaugrante, Mathias Niepert, Thilo Hagendorff

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We tested GPT-3.5, GPT-4o, Gemini 1.5 Pro, Claude 3 Opus, Llama 3, Vicuna, and BLOOM on the chain-of-thought, Sandbagging, Emotion Prompting, Re-Reading, Rephraseand-Respond (Ra R), and Expert Prompting prompt engineering techniques. We applied them on manually double-checked subsets of reasoning benchmarks including Commonsense QA, CRT, Num GLUE, Science QA, and Strategy QA. Our findings reveal a general lack of statistically significant differences across nearly all techniques tested, highlighting, among others, several methodological weaknesses in previous research.
Researcher Affiliation	Academia	Laurène Vaugrante EMAIL Interchange Forum for Reflecting on Intelligent Systems University of Stuttgart Mathias Niepert Institute for Artificial Intelligence University of Stuttgart Thilo Hagendorff Interchange Forum for Reflecting on Intelligent Systems University of Stuttgart
Pseudocode	No	The paper describes methodologies and experimental steps in prose, such as in section 2.3 'Experiments' and sections 3.1-3.6 discussing each prompt engineering technique. It does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	The code is accessible here: https://github.com/Laurene-v/replicating PET. By lowering the practical barrier to replication and promoting iterative experimentation, we aim to foster a culture of methodological transparency and empirical verification in prompt-engineering research. The datasets and code generated during this study are available in the Replication Crisis In LLMEvaluation repository on the Open Science Framework (OSF) at https://osf.io/hcygf/?view_only= fe25a85157734f68882777404aeb655c and at https://github.com/Laurene-v/replicating PET.
Open Datasets	Yes	To replicate the claimed impact of the selected prompt engineering techniques on LLM reasoning abilities, we selected five different benchmarks, each measuring a different type of reasoning: Commonsense QA (Talmor et al., 2019), CRT (Hagendorff et al., 2023), Num GLUE (Mishra et al., 2022), Science QA (Lu et et al., 2022) and Strategy QA (Geva et al., 2021). The datasets and code generated during this study are available in the Replication Crisis In LLMEvaluation repository on the Open Science Framework (OSF) at https://osf.io/hcygf/?view_only= fe25a85157734f68882777404aeb655c and at https://github.com/Laurene-v/replicating PET.
Dataset Splits	Yes	Therefore, we chose to hand-pick (through rule-based filtering and manual checks) 150 faultless questions out of a random sample of 200 questions per benchmark, with a total of n = 750, preferring accuracy over large sample sizes.
Hardware Specification	No	The paper mentions the LLMs tested (GPT-3.5, GPT-4o, Gemini 1.5 Pro, Claude 3 Opus, Llama 3, Vicuna, BLOOM) and their temperature settings, but it does not specify any particular hardware (e.g., GPU, CPU models, memory) used for running the experiments.
Software Dependencies	Yes	All statistical analyses were performed using Python (version 3.11.4). The Sci Py library (version 1.13.1) was used for statistical computations, while visualizations were created with Matplotlib (version 3.7.1) and Seaborn (version 0.12.2).
Experiment Setup	Yes	For all experiments, LLM temperature parameters were set to 0, or 0.00001 when 0 was not permitted. ... When the studies used several preor suffixes as a basis to their claim, such as in the Emotion Prompting study where 11 different emotional stimuli were used, we randomly selected one of them for each task using a seed.