reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SELF-[IN]CORRECT: LLMs Struggle with Discriminating Self-Generated Responses

Authors: Dongwei Jiang, Jingyu Zhang, Orion Weller, Nathaniel Weir, Benjamin Van Durme, Daniel Khashabi

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our resulting experimental analysis of several open-source and industrial LLMs, we observe that model s are not reliably better at discriminating among previously-generated alternatives than generating initial responses.
Researcher Affiliation	Academia	Johns Hopkins University EMAIL
Pseudocode	No	The paper describes methodologies in text and provides figures like Figure 1 to illustrate phases, but it does not contain a dedicated pseudocode or algorithm block.
Open Source Code	No	The paper references third-party tools and their repositories (e.g., lm-evaluation-harness3, llm judge4) but does not provide specific access information or an explicit statement about releasing its own source code for the methodology described.
Open Datasets	Yes	We assess our hypothesis on a diverse set of tasks including GSM8K (Cobbe et al. 2021) for math, Trivia QA (Joshi et al. 2017) for world knowledge, Truthful QA (Lin, Hilton, and Evans 2022) for truthfulness in question answering, and MT-Bench (Zheng et al. 2023a) for instruction following.
Dataset Splits	Yes	Table 1: Configuration of experimental tasks. Split specifies which subset the data originates from. #Eval indicates the number of instances used for evaluation. GSM8K Test 1319, Trivia QA Val 17944, MT-Bench Test 160, Truthful QA Val 817.
Hardware Specification	No	The GPUs for conducting experiments were provided by the DSAI cluster. This statement is too general and does not provide specific hardware models or detailed specifications.
Software Dependencies	No	The paper mentions using 'lm-evaluation-harness' and 'llm judge' and model versions like 'GPT-3.5-turbo-0125' but does not specify version numbers for programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow) used for their own implementation.
Experiment Setup	Yes	During the generation phase, we use the default hyperparameter specified in lm-eval-harness for all tasks, except for temperature, which we have adjusted to 0.7. We use an above 0 temperature to obtain distinct generations upon multiple rounds of sampling. At the same time, during the discrimination phase, we set the temperature to 0 to avoid any randomness.