Iterative Label Refinement Matters More than Preference Optimization under Weak Supervision

Authors: Yaowen Ye, Cassidy Laidlaw, Jacob Steinhardt

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To test this, we simulate unreliable demonstrations and comparison feedback using small LMs and time-constrained humans. We find that in the presence of unreliable supervision, SFT still retains some effectiveness, but DPO (a common RLHF algorithm) fails to improve the model beyond SFT. To address this, we propose iterative label refinement (ILR) as an alternative to RLHF. ILR improves the SFT data by using comparison feedback to decide whether human demonstrations should be replaced by model-generated alternatives, then retrains the model via SFT on the updated data. SFT+ILR outperforms SFT+DPO on several tasks with unreliable supervision (math, coding, and safe instruction-following).
Researcher Affiliation Academia Yaowen Ye The University of Hong Kong EMAIL Cassidy Laidlaw Jacob Steinhardt University of California, Berkeley EMAIL
Pseudocode Yes We present the pseudo-code of ILR in Algorithm 1.
Open Source Code Yes Our code and data are available at https: //github.com/helloelwin/iterative-label-refinement.
Open Datasets Yes Datasets. We test SFT+DPO with unreliable feedback on three text generation tasks: mathematical problem-solving using GSM8K (Cobbe et al., 2021), SQL code generation with BIRD (Li et al., 2024), and safe instruction following with Safer Paca (Bianchi et al., 2023), which is a mix of the Alpaca dataset (Taori et al., 2023) and refusal demonstrations to unsafe instructions.
Dataset Splits Yes For GSM8K, we parse the numerical answer following #### at the end of each response and compute exact match accuracy by comparing it with the ground truth. For BIRD, we measure execution accuracy by running the generated code on corresponding test databases, following Li et al. (2024). For Safer Paca, we follow Li et al. (2023) and use GPT-4o (Open AI, 2024) to compute win rate against reference answers.
Hardware Specification No No specific hardware details (GPU models, CPU models, etc.) are provided in the paper. It mentions training various language models (Gemma 2B, Mistral 7B, Meta Llama 3 70B) and using vLLM for sampling but not the underlying hardware.
Software Dependencies No The paper mentions using DPOTrainer in Hugging Face Transformers (Wolf et al., 2020) but does not specify a version number for the library. It also mentions Adam and AdaFactor optimizers, and LoRA, which are techniques rather than specific software packages with versions.
Experiment Setup Yes Table 1: Training Hyperparameter for Each Dataset (GSM8K: Epoch 2, Batch Size 32, Max Answer Token 256; BIRD: Epoch 2, Batch Size 32, Max Answer Token 256; Safer Paca: Epoch 4, Batch Size 32, Max Answer Token 512). We set learning rate to 5e-4 for Gemma 2B and 1e-4 for Mistral 7B and Meta Llama 3 70B across all tasks. We use Adam (Kingma, 2014) optimizer for Gemma 2B and Mistral 7B and Ada Factor (Shazeer & Stern, 2018) for Meta Llama 3 70B. We enable gradient checkpointing and use gradient accumulation with a mini batch size of 1 for all models. In all experiments we set α = 0.15.