SELF-[IN]CORRECT: LLMs Struggle with Discriminating Self-Generated Responses
Authors: Dongwei Jiang, Jingyu Zhang, Orion Weller, Nathaniel Weir, Benjamin Van Durme, Daniel Khashabi
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our resulting experimental analysis of several open-source and industrial LLMs, we observe that model s are not reliably better at discriminating among previously-generated alternatives than generating initial responses. |
| Researcher Affiliation | Academia | Johns Hopkins University EMAIL |
| Pseudocode | No | The paper describes methodologies in text and provides figures like Figure 1 to illustrate phases, but it does not contain a dedicated pseudocode or algorithm block. |
| Open Source Code | No | The paper references third-party tools and their repositories (e.g., lm-evaluation-harness3, llm judge4) but does not provide specific access information or an explicit statement about releasing its own source code for the methodology described. |
| Open Datasets | Yes | We assess our hypothesis on a diverse set of tasks including GSM8K (Cobbe et al. 2021) for math, Trivia QA (Joshi et al. 2017) for world knowledge, Truthful QA (Lin, Hilton, and Evans 2022) for truthfulness in question answering, and MT-Bench (Zheng et al. 2023a) for instruction following. |
| Dataset Splits | Yes | Table 1: Configuration of experimental tasks. Split specifies which subset the data originates from. #Eval indicates the number of instances used for evaluation. GSM8K Test 1319, Trivia QA Val 17944, MT-Bench Test 160, Truthful QA Val 817. |
| Hardware Specification | No | The GPUs for conducting experiments were provided by the DSAI cluster. This statement is too general and does not provide specific hardware models or detailed specifications. |
| Software Dependencies | No | The paper mentions using 'lm-evaluation-harness' and 'llm judge' and model versions like 'GPT-3.5-turbo-0125' but does not specify version numbers for programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow) used for their own implementation. |
| Experiment Setup | Yes | During the generation phase, we use the default hyperparameter specified in lm-eval-harness for all tasks, except for temperature, which we have adjusted to 0.7. We use an above 0 temperature to obtain distinct generations upon multiple rounds of sampling. At the same time, during the discrimination phase, we set the temperature to 0 to avoid any randomness. |