Principled Data Selection for Alignment: The Hidden Risks of Difficult Examples

Authors: Chengqian Gao, Haonan Li, Liu Liu, Zeke Xie, Peilin Zhao, Zhiqiang Xu

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through systematic experimentation, we validate this principle with three key findings... Building on this principle, we introduce Selective DPO, which filters out overly difficult examples. This simple adjustment improves alignment performance by 9-16% in win rates on the Alpaca Eval 2 benchmark compared to the DPO baseline, surpassing a series of DPO variants with different algorithmic adjustments. These results together illuminate the importance of aligning data difficulty with model capacity, offering a transformative perspective for improving alignment strategies in LLMs. Code is available at https://github.com/glorg ao/Selective DPO
Researcher Affiliation Collaboration 1MBZUAI 2Tencent Inc 3HKUST (Guangzhou) 4SJTU. Correspondence to: Liu Liu <EMAIL>, Peilin Zhao <EMAIL>, Zhiqiang Xu <EMAIL>.
Pseudocode Yes A. Pseudocode for the Instantiated Algorithm: Selective DPO. Algorithm 1 Selective DPO
Open Source Code Yes Code is available at https://github.com/glorg ao/Selective DPO
Open Datasets Yes We use Ultra Feedback-binarized2,where darker colors indicate more training steps needed for model comprehension. Results from 10 runs show consistent learning order across different models (Jiang et al., 2023; AI@Meta, 2024; Team et al., 2024) varying in size (2B 9B), training stage, and data sampling. This consistency confirms that examples vary in difficulty, allowing us to discuss difficult examples without debating various definitions of difficulty. 2https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized. We use Ultra Feedback-binarized, a widely adopted alignment dataset (Tunstall et al., 2023; Meng et al., 2024; Zhou et al., 2024; Pattnaik et al., 2024), and Argilladpo-mix-7k3, a small but high-quality dataset. 3https://huggingface.co/datasets/argilla/ dpo-mix-7k
Dataset Splits Yes To compute the validation loss, we partition D equally into ˆD and D \ ˆD, train on one partition, evaluate on the other, and finally output average results over three runs. ... The training dataset is randomly split into two partitions. ... The easiest examples, comprising the lowest τ percent of validation losses, are selected for alignment training. ... For the evaluation in the next section, we set τ = 50 for the Ultra Feedback-binarized dataset, based on insights from Figure 3.
Hardware Specification Yes All training experiments in this paper were conducted on compute nodes equipped with 8 H100 GPUs. To facilitate reproduction with limited computational resources, we also provide key benchmarking results for selected models trained using 4 A100 40G GPUs with Lo RA.
Software Dependencies No The paper does not provide specific software names with version numbers, such as Python or specific library versions.
Experiment Setup Yes Following prior work, we set β = 0.01 (Zhou et al., 2024). The learning rate is swept for DPO with random ordering and directly applied to DPO with other settings. We conduct the alignment with one epoch following Meng et al. (2024). ... Appendix C.2 SFT Hyper-Parameters and C.3 Key Hyper-Parameters for Alignment provide detailed tables (Table 2, 3, 4, 5, 6) listing Batch Size, Learning Rate, Epoch, and Optimizer for various models and datasets. For example, Table 2 lists 'Batch Size 128', 'Learning Rate 2e-5', 'Epoch 1', 'Optimizer Adam'. Appendix C.4 Lo RA Configuration for Alignment details 'lora alpha 16', 'lora dropout 0.05', 'lora target modules q proj,k proj,v proj,o proj,gate proj,up proj,down proj'.