SELF-[IN]CORRECT: LLMs Struggle with Discriminating Self-Generated Responses

Authors: Dongwei Jiang, Jingyu Zhang, Orion Weller, Nathaniel Weir, Benjamin Van Durme, Daniel Khashabi

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our resulting experimental analysis of several open-source and industrial LLMs, we observe that model s are not reliably better at discriminating among previously-generated alternatives than generating initial responses.
Researcher Affiliation Academia Johns Hopkins University EMAIL
Pseudocode No The paper describes methodologies in text and provides figures like Figure 1 to illustrate phases, but it does not contain a dedicated pseudocode or algorithm block.
Open Source Code No The paper references third-party tools and their repositories (e.g., lm-evaluation-harness3, llm judge4) but does not provide specific access information or an explicit statement about releasing its own source code for the methodology described.
Open Datasets Yes We assess our hypothesis on a diverse set of tasks including GSM8K (Cobbe et al. 2021) for math, Trivia QA (Joshi et al. 2017) for world knowledge, Truthful QA (Lin, Hilton, and Evans 2022) for truthfulness in question answering, and MT-Bench (Zheng et al. 2023a) for instruction following.
Dataset Splits Yes Table 1: Configuration of experimental tasks. Split specifies which subset the data originates from. #Eval indicates the number of instances used for evaluation. GSM8K Test 1319, Trivia QA Val 17944, MT-Bench Test 160, Truthful QA Val 817.
Hardware Specification No The GPUs for conducting experiments were provided by the DSAI cluster. This statement is too general and does not provide specific hardware models or detailed specifications.
Software Dependencies No The paper mentions using 'lm-evaluation-harness' and 'llm judge' and model versions like 'GPT-3.5-turbo-0125' but does not specify version numbers for programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow) used for their own implementation.
Experiment Setup Yes During the generation phase, we use the default hyperparameter specified in lm-eval-harness for all tasks, except for temperature, which we have adjusted to 0.7. We use an above 0 temperature to obtain distinct generations upon multiple rounds of sampling. At the same time, during the discrimination phase, we set the temperature to 0 to avoid any randomness.