reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Pitfalls in Evaluating Inference-time Methods for Improving LLM Reliability

Authors: Michael M. Jerge, David Evans

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Motivated by these findings, we conduct an experiment to comprehensively evaluate a few methods with a group of representative benchmarks and a diverse set of models (Section 4). Our main findings are that claims about the effectiveness of inference-time LLM methods are fragile to both the models and benchmarks used. Our work combines an analysis of evaluation practices in the literature with an empirical study designed to highlight pitfalls in current evaluation methodologies.
Researcher Affiliation	Academia	Michael Jerge EMAIL Department of Computer Science University of Virginia David Evans EMAIL Department of Computer Science University of Virginia
Pseudocode	No	The paper provides descriptions of the methods and examples of prompts used (Appendix C) but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	The code for our automated system and all of the data and code to reproduce our work, is available under an open source license in this public repository: https://github.com/mmjerge/LLM-Evaluation-Framework.
Open Datasets	Yes	For our analysis, we selected five of the most commonly used benchmarks in the literature GSM8K, MMLU, AQu A, SVAMP, Truthful QA. These benchmarks are described in Section 3.2, and include three popular mathematical reasoning benchmarks (GSM8K, AQu A, and SVAMP) and two broad language understanding benchmarks (MMLU and Truthful QA). We also include GSM-Symbolic (Mirzadeh et al., 2024), a relatively new benchmark... We include two domain-focused benchmarks, Med QA (Jin et al., 2021) and Legal Bench (Guha et al., 2023; Koreeda & Manning, 2021; Hendrycks et al., 2021b; Wang et al., 2023b; Wilson et al., 2016; Zheng et al., 2021; Zimmeck et al., 2019; Ravichander et al., 2019; Holzenberger & Van Durme, 2021; Lippi et al., 2019). We include two additional benchmarks that were not commonly used, but were chosen to further evaluate generalization. Sorting 032 (Besta et al., 2024) evaluates a model s ability to sort a sequence of numbers in ascending order. Document Merging was introduced in the Graph of Thoughts paper (Besta et al., 2024).
Dataset Splits	Yes	Using 100 templates from GSM8K, it generates 50 samples per template, resulting in 5 000 total examples for each benchmark variant. The dataset includes different difficulty levels, from simpler versions with clauses removed (GSM-Symbolic-M1) to more complex versions with additional clauses (GSM-Symbolic-P1, P2), and a special variant (GSM-No Op) that tests models ability to identify relevant information. For the Document Merging benchmark... The final score is the harmonic mean of these two values and the average rating across multiple document sets is used as the evaluation metric. Due to the high cost of running some of the methods (which we discuss in Section 5.3), for each of the methods we randomly sample 150 data points from each of benchmarks.
Hardware Specification	No	The paper discusses the 'high cost of running some of the methods' and API calls as a measure of cost, indicating usage of external LLM APIs (e.g., OpenAI, Anthropic) rather than specific local hardware for their experiments. No specific GPU, CPU, or other hardware details are provided.
Software Dependencies	No	The paper mentions using a 'custom Lang Chain agent' for ReAct (Chase, 2022) and refers to various original method repositories for implementations. However, it does not specify version numbers for programming languages (e.g., Python), libraries (e.g., LangChain), or frameworks (e.g., PyTorch, TensorFlow) used in their experimental setup.
Experiment Setup	Yes	For evaluating models on multiple-choice benchmarks, we used a Chain of Thought implementation based on the approach outlined in the referenced repository (Yao et al., 2023a). For more generative benchmarks, we used the method outlined in Besta et al. (2024). Each of the models were configured with a temperature of 0.7 and a maximum token limit of 1024 to allow for more elaborate reasoning chains. The prompt included multiple stages, with the model first analyzing the problem, laying out intermediate thought processes, and then computing or inferring the final result. We use Let s think step by step as a leading instruction guided the model in decomposing tasks into manageable chunks. The method was applied by generating 3, 5, and 10 diverse reasoning paths for each task. For our experiments, Re Act was implemented using a custom Lang Chain agent based on the original reference repository Chase (2022). Format errors were corrected during execution to ensure the output followed the required structure. The model was configured with a temperature of 0.5, a maximum token limit of 512, and up to two retries in case of errors. For benchmarks based on multiple-choice question answering solutions, we used a Tree of Thought implementation based on the approach outlined in the referenced repository (Yao et al., 2023c). The default parameters prompted the model to generate sequences 10 times and evaluate each generated state five times, while the BFS algorithm was configured to retain the top three states at each step to explore further. For our experiments, we used the open-source repository (Blach et al., 2023). For our experiments we used the repositories (Du et al., 2023) for proprietary models and (Gauss5930, 2023) for open-source models. Furthermore, we utilized three agents, all based on the same model, and conducted two rounds of debate per task.