reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Leveraging Sparsity for Sample-Efficient Preference Learning: A Theoretical Perspective

Authors: Yunzhen Yao, Lie He, Michael Gastpar

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on synthetic data and LLM alignment data validate our theoretical findings, showing that sparsity-aware methods significantly reduce sample complexity and improve prediction accuracy. Our experimental evaluations demonstrate that sparsity-aware estimators outperform widely used baselines in reward modeling, evaluated on both synthetic datasets and LLM alignment datasets using popular language models.
Researcher Affiliation	Academia	1LINX, EPFL, Lausanne, Switzerland 2Key Laboratory of Interdisciplinary Research of Computation and Economics (Shanghai University of Finance and Economics), Ministry of Education, China 3School of Computing and Artificial Intelligence, Shanghai University of Finance and Economics, Shanghai, China.
Pseudocode	No	The paper describes methods verbally and mathematically but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code can be found at this link: https://github.com/yaoyzh/Sparse Preference Learning
Open Datasets	Yes	We train reward models using the rm-static dataset (Bai et al., 2022)4 and SHP dataset (Ethayarajh et al., 2022)5. 4https://huggingface.co/datasets/Dahoas/ rm-static 5https://huggingface.co/datasets/ stanfordnlp/SHP
Dataset Splits	No	The paper mentions using datasets for training and evaluating test accuracy but does not provide specific details on how the datasets were split into training, validation, or test sets (e.g., percentages, sample counts, or explicit standard splits).
Hardware Specification	No	The paper mentions the use of pretrained language models (e.g., Pythia-70M, Llama-3.2-1B) but does not specify any hardware details like GPU or CPU models used for running the experiments.
Software Dependencies	No	The paper mentions using the Sci Py package (Virtanen et al., 2020) and that the code is based on Deepspeed-Chat (Yao et al., 2023), but it does not provide specific version numbers for these software components.
Experiment Setup	Yes	The learning rate is set to 10-5, and the weight decay is set to 0.1. The batch size is 8 for Pythia-70M, 16 for Llama-3.2-1B and 32 for Llama-3.2-3B, and the training runs for 1 epoch. The regularization hyperparameter β for the ℓ1-regularized method is selected from the range 10[-4.5:0.5:0] {2, 4, 8}. Each β value, including β = 0, is evaluated across 5 trials with random seeds in {0, 1, 2, 3, 4} for Pythia-70M and Llama-3.2-1B, and 3 trials with random seeds in {0, 1, 2} for Llama-3.2-3B.