reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Varying Shades of Wrong: Aligning LLMs with Wrong Answers Only

Authors: Jihan Yao, Wenxuan Ding, Shangbin Feng, Lucy Lu Wang, Yulia Tsvetkov

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments with seven LLMs and eight datasets demonstrate that (1) LLMs do have preliminary capability in distinguishing various shades of wrong, achieving up to 20.9% higher performance than random guess; (2) Alignment with such wrong-over-wrong preferences helps LLMs to produce less wrong and sometimes even outright correct answers, while improving overall model calibration.
Researcher Affiliation	Collaboration	Jihan Yao 1 Wenxuan Ding 2 Shangbin Feng 1 Lucy Lu Wang1,3 Yulia Tsvetkov1 1University of Washington 2The University of Texas at Austin 3 Allen Institute for AI EMAIL EMAIL
Pseudocode	Yes	Full details of wrong-over-wrong dataset construction are available in Algorithm 1. ... Algorithm 1 DWo W generation pipeline
Open Source Code	Yes	Code and data are publicly available at https://github.com/yaojh18/Varying-Shades-of-Wrong.
Open Datasets	Yes	Knowledge Crosswords (KC) (Ding et al., 2023) is a multiple-choice structured knowledge reasoning benchmark... NLGraph (NLG) (Wang et al., 2023a) is a graph reasoning benchmark... Bio Generation (BG) LLMs are asked to generate a biography... COM2 (Fang et al., 2024) is a multiple-choice commonsense reasoning benchmark... Hellaswag (Zellers et al., 2019)... Chess Puzzle (Lichess Team, 2023)... Sci Bench (Wang et al., 2024b)... Med MCQA (Pal et al., 2022)
Dataset Splits	Yes	We sample 625, 625, 625, and 380 questions from each dataset, each split into training sets Dtrain, validation sets Dval, and test sets Dtest with an approximately 8:1:1 ratio. ... We sample 125 questions from the official validation split and split them into validation, test sets with a 1:1 ratio.
Hardware Specification	No	The paper does not explicitly mention specific hardware specifications like GPU models, CPU models, or memory details used for running experiments.
Software Dependencies	No	We employ Unsloth and Transformers libraries for preference optimization. ... We employ three open and proprietary LLMs for experiments spanning different scales and access levels. First, we use LLAMA3-8B (Dubey et al., 2024), GPT-3.5, and GPT-4O (Achiam et al., 2023) ... MISTRAL-7B (Jiang et al., 2023), GEMINI-FLASH, GEMINI-PRO (Team et al., 2023), MISTRAL-7B (Jiang et al., 2023), GEMMA-7B (Team et al., 2024).
Experiment Setup	Yes	We employ a temperature of 1.0 and a max generation length of 1024. ... We conduct QLo RA fine-tuning (Dettmers et al., 2023) on LLAMA3-8B using the collected wrong-over-wrong preferences through DPO. ... We apply grid search on learning rate (1e-4, 5e-5, 1e-5), learning rate scheduler (cosine, cosine with restart and reduce lr on plateau), weight decay (0, 1e-5, 1e-3) and number of train epochs (1, 3, 5) for main experiments and right-over-wrong alignment experiments. We use random seed = 42 for all of our experiments. ... In Table 1, we use batch size = 5 for all score methods due to optimal empirical results.