Varying Shades of Wrong: Aligning LLMs with Wrong Answers Only
Authors: Jihan Yao, Wenxuan Ding, Shangbin Feng, Lucy Lu Wang, Yulia Tsvetkov
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments with seven LLMs and eight datasets demonstrate that (1) LLMs do have preliminary capability in distinguishing various shades of wrong, achieving up to 20.9% higher performance than random guess; (2) Alignment with such wrong-over-wrong preferences helps LLMs to produce less wrong and sometimes even outright correct answers, while improving overall model calibration. |
| Researcher Affiliation | Collaboration | Jihan Yao 1 Wenxuan Ding 2 Shangbin Feng 1 Lucy Lu Wang1,3 Yulia Tsvetkov1 1University of Washington 2The University of Texas at Austin 3 Allen Institute for AI EMAIL EMAIL |
| Pseudocode | Yes | Full details of wrong-over-wrong dataset construction are available in Algorithm 1. ... Algorithm 1 DWo W generation pipeline |
| Open Source Code | Yes | Code and data are publicly available at https://github.com/yaojh18/Varying-Shades-of-Wrong. |
| Open Datasets | Yes | Knowledge Crosswords (KC) (Ding et al., 2023) is a multiple-choice structured knowledge reasoning benchmark... NLGraph (NLG) (Wang et al., 2023a) is a graph reasoning benchmark... Bio Generation (BG) LLMs are asked to generate a biography... COM2 (Fang et al., 2024) is a multiple-choice commonsense reasoning benchmark... Hellaswag (Zellers et al., 2019)... Chess Puzzle (Lichess Team, 2023)... Sci Bench (Wang et al., 2024b)... Med MCQA (Pal et al., 2022) |
| Dataset Splits | Yes | We sample 625, 625, 625, and 380 questions from each dataset, each split into training sets Dtrain, validation sets Dval, and test sets Dtest with an approximately 8:1:1 ratio. ... We sample 125 questions from the official validation split and split them into validation, test sets with a 1:1 ratio. |
| Hardware Specification | No | The paper does not explicitly mention specific hardware specifications like GPU models, CPU models, or memory details used for running experiments. |
| Software Dependencies | No | We employ Unsloth and Transformers libraries for preference optimization. ... We employ three open and proprietary LLMs for experiments spanning different scales and access levels. First, we use LLAMA3-8B (Dubey et al., 2024), GPT-3.5, and GPT-4O (Achiam et al., 2023) ... MISTRAL-7B (Jiang et al., 2023), GEMINI-FLASH, GEMINI-PRO (Team et al., 2023), MISTRAL-7B (Jiang et al., 2023), GEMMA-7B (Team et al., 2024). |
| Experiment Setup | Yes | We employ a temperature of 1.0 and a max generation length of 1024. ... We conduct QLo RA fine-tuning (Dettmers et al., 2023) on LLAMA3-8B using the collected wrong-over-wrong preferences through DPO. ... We apply grid search on learning rate (1e-4, 5e-5, 1e-5), learning rate scheduler (cosine, cosine with restart and reduce lr on plateau), weight decay (0, 1e-5, 1e-3) and number of train epochs (1, 3, 5) for main experiments and right-over-wrong alignment experiments. We use random seed = 42 for all of our experiments. ... In Table 1, we use batch size = 5 for all score methods due to optimal empirical results. |