Varying Shades of Wrong: Aligning LLMs with Wrong Answers Only

Authors: Jihan Yao, Wenxuan Ding, Shangbin Feng, Lucy Lu Wang, Yulia Tsvetkov

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments with seven LLMs and eight datasets demonstrate that (1) LLMs do have preliminary capability in distinguishing various shades of wrong, achieving up to 20.9% higher performance than random guess; (2) Alignment with such wrong-over-wrong preferences helps LLMs to produce less wrong and sometimes even outright correct answers, while improving overall model calibration.
Researcher Affiliation Collaboration Jihan Yao 1 Wenxuan Ding 2 Shangbin Feng 1 Lucy Lu Wang1,3 Yulia Tsvetkov1 1University of Washington 2The University of Texas at Austin 3 Allen Institute for AI EMAIL EMAIL
Pseudocode Yes Full details of wrong-over-wrong dataset construction are available in Algorithm 1. ... Algorithm 1 DWo W generation pipeline
Open Source Code Yes Code and data are publicly available at https://github.com/yaojh18/Varying-Shades-of-Wrong.
Open Datasets Yes Knowledge Crosswords (KC) (Ding et al., 2023) is a multiple-choice structured knowledge reasoning benchmark... NLGraph (NLG) (Wang et al., 2023a) is a graph reasoning benchmark... Bio Generation (BG) LLMs are asked to generate a biography... COM2 (Fang et al., 2024) is a multiple-choice commonsense reasoning benchmark... Hellaswag (Zellers et al., 2019)... Chess Puzzle (Lichess Team, 2023)... Sci Bench (Wang et al., 2024b)... Med MCQA (Pal et al., 2022)
Dataset Splits Yes We sample 625, 625, 625, and 380 questions from each dataset, each split into training sets Dtrain, validation sets Dval, and test sets Dtest with an approximately 8:1:1 ratio. ... We sample 125 questions from the official validation split and split them into validation, test sets with a 1:1 ratio.
Hardware Specification No The paper does not explicitly mention specific hardware specifications like GPU models, CPU models, or memory details used for running experiments.
Software Dependencies No We employ Unsloth and Transformers libraries for preference optimization. ... We employ three open and proprietary LLMs for experiments spanning different scales and access levels. First, we use LLAMA3-8B (Dubey et al., 2024), GPT-3.5, and GPT-4O (Achiam et al., 2023) ... MISTRAL-7B (Jiang et al., 2023), GEMINI-FLASH, GEMINI-PRO (Team et al., 2023), MISTRAL-7B (Jiang et al., 2023), GEMMA-7B (Team et al., 2024).
Experiment Setup Yes We employ a temperature of 1.0 and a max generation length of 1024. ... We conduct QLo RA fine-tuning (Dettmers et al., 2023) on LLAMA3-8B using the collected wrong-over-wrong preferences through DPO. ... We apply grid search on learning rate (1e-4, 5e-5, 1e-5), learning rate scheduler (cosine, cosine with restart and reduce lr on plateau), weight decay (0, 1e-5, 1e-3) and number of train epochs (1, 3, 5) for main experiments and right-over-wrong alignment experiments. We use random seed = 42 for all of our experiments. ... In Table 1, we use batch size = 5 for all score methods due to optimal empirical results.