Leveraging Sparsity for Sample-Efficient Preference Learning: A Theoretical Perspective

Authors: Yunzhen Yao, Lie He, Michael Gastpar

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on synthetic data and LLM alignment data validate our theoretical findings, showing that sparsity-aware methods significantly reduce sample complexity and improve prediction accuracy. Our experimental evaluations demonstrate that sparsity-aware estimators outperform widely used baselines in reward modeling, evaluated on both synthetic datasets and LLM alignment datasets using popular language models.
Researcher Affiliation Academia 1LINX, EPFL, Lausanne, Switzerland 2Key Laboratory of Interdisciplinary Research of Computation and Economics (Shanghai University of Finance and Economics), Ministry of Education, China 3School of Computing and Artificial Intelligence, Shanghai University of Finance and Economics, Shanghai, China.
Pseudocode No The paper describes methods verbally and mathematically but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code can be found at this link: https://github.com/yaoyzh/Sparse Preference Learning
Open Datasets Yes We train reward models using the rm-static dataset (Bai et al., 2022)4 and SHP dataset (Ethayarajh et al., 2022)5. 4https://huggingface.co/datasets/Dahoas/ rm-static 5https://huggingface.co/datasets/ stanfordnlp/SHP
Dataset Splits No The paper mentions using datasets for training and evaluating test accuracy but does not provide specific details on how the datasets were split into training, validation, or test sets (e.g., percentages, sample counts, or explicit standard splits).
Hardware Specification No The paper mentions the use of pretrained language models (e.g., Pythia-70M, Llama-3.2-1B) but does not specify any hardware details like GPU or CPU models used for running the experiments.
Software Dependencies No The paper mentions using the Sci Py package (Virtanen et al., 2020) and that the code is based on Deepspeed-Chat (Yao et al., 2023), but it does not provide specific version numbers for these software components.
Experiment Setup Yes The learning rate is set to 10-5, and the weight decay is set to 0.1. The batch size is 8 for Pythia-70M, 16 for Llama-3.2-1B and 32 for Llama-3.2-3B, and the training runs for 1 epoch. The regularization hyperparameter β for the ℓ1-regularized method is selected from the range 10[-4.5:0.5:0] {2, 4, 8}. Each β value, including β = 0, is evaluated across 5 trials with random seeds in {0, 1, 2, 3, 4} for Pythia-70M and Llama-3.2-1B, and 3 trials with random seeds in {0, 1, 2} for Llama-3.2-3B.