Self-Consistency Preference Optimization

Authors: Archiki Prasad, Weizhe Yuan, Richard Yuanzhe Pang, Jing Xu, Maryam Fazel-Zarandi, Mohit Bansal, Sainbayar Sukhbaatar, Jason E Weston, Jane Yu

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments using Llama-3 8B models (Dubey et al., 2024), we show that even without access to any gold answers during training, two iterations of unsupervised SCPO improves zero-shot accuracy of the base model by 22.74% and 5.26% (absolute) on GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021) respectively, closely matching the performance (< 1% difference) of the supervised baseline from Pang et al. (2024).
Researcher Affiliation Collaboration 1Meta FAIR 2UNC Chapel Hill 3New York University. Correspondence to: Archiki Prasad <EMAIL>.
Pseudocode No No. The paper describes the method's steps in paragraph text and uses a diagram (Figure 1) to illustrate the process, but it does not include a structured pseudocode or algorithm block.
Open Source Code No No. The paper does not contain any explicit statement about releasing source code for the methodology described, nor does it provide a link to a code repository.
Open Datasets Yes We evaluate the effectiveness of SCPO on a range of math and logical reasoning datasets: GSM8K (Cobbe et al., 2021) contains a train/test split of 7.5K/1.3K grade school math word problems. MATH (Hendrycks et al., 2021) is a dataset of challenging high-school math competitions that contains a train/test split of 7.5K/5K problems, respectively. Zebra Logic (Dziri et al., 2024) is a logical reasoning benchmark.
Dataset Splits Yes GSM8K (Cobbe et al., 2021) contains a train/test split of 7.5K/1.3K grade school math word problems. For the purpose of this work, we split the train set into a train/dev split with 6.7K/0.8K problems respectively. The overall data split becomes 6.7K/0.8K/1.3K in the train/dev/test set, respectively. MATH (Hendrycks et al., 2021) ... we reserve 10% of samples from the train set to create a held-out dev set for model selection and hyperparameter tuning, resulting in our final train/dev/test splits with 6.7K/0.8K/5K problems, respectively.
Hardware Specification No No. The paper mentions the use of 'Llama-3 8B models' but does not specify any details regarding the hardware (e.g., GPU models, CPU types, or memory) used for conducting the experiments.
Software Dependencies No No. The paper does not list specific software dependencies with version numbers, such as programming languages, libraries, or frameworks used for implementation.
Experiment Setup Yes When generating multiple response or new problems from the LLM, we sample with temperature of 0.7 and top-p = 0.9. For GSM8K and MATH, we set k = 8. ... All models are trained for 10 epochs with a learning rate of 5e-6 (cosine scheduling), and effective batch size of 16. Lastly, we set DPO loss term hyperparameter β = 0.5 and NLL regularization coefficient α = 1.