Self-Consistency Preference Optimization
Authors: Archiki Prasad, Weizhe Yuan, Richard Yuanzhe Pang, Jing Xu, Maryam Fazel-Zarandi, Mohit Bansal, Sainbayar Sukhbaatar, Jason E Weston, Jane Yu
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments using Llama-3 8B models (Dubey et al., 2024), we show that even without access to any gold answers during training, two iterations of unsupervised SCPO improves zero-shot accuracy of the base model by 22.74% and 5.26% (absolute) on GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021) respectively, closely matching the performance (< 1% difference) of the supervised baseline from Pang et al. (2024). |
| Researcher Affiliation | Collaboration | 1Meta FAIR 2UNC Chapel Hill 3New York University. Correspondence to: Archiki Prasad <EMAIL>. |
| Pseudocode | No | No. The paper describes the method's steps in paragraph text and uses a diagram (Figure 1) to illustrate the process, but it does not include a structured pseudocode or algorithm block. |
| Open Source Code | No | No. The paper does not contain any explicit statement about releasing source code for the methodology described, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We evaluate the effectiveness of SCPO on a range of math and logical reasoning datasets: GSM8K (Cobbe et al., 2021) contains a train/test split of 7.5K/1.3K grade school math word problems. MATH (Hendrycks et al., 2021) is a dataset of challenging high-school math competitions that contains a train/test split of 7.5K/5K problems, respectively. Zebra Logic (Dziri et al., 2024) is a logical reasoning benchmark. |
| Dataset Splits | Yes | GSM8K (Cobbe et al., 2021) contains a train/test split of 7.5K/1.3K grade school math word problems. For the purpose of this work, we split the train set into a train/dev split with 6.7K/0.8K problems respectively. The overall data split becomes 6.7K/0.8K/1.3K in the train/dev/test set, respectively. MATH (Hendrycks et al., 2021) ... we reserve 10% of samples from the train set to create a held-out dev set for model selection and hyperparameter tuning, resulting in our final train/dev/test splits with 6.7K/0.8K/5K problems, respectively. |
| Hardware Specification | No | No. The paper mentions the use of 'Llama-3 8B models' but does not specify any details regarding the hardware (e.g., GPU models, CPU types, or memory) used for conducting the experiments. |
| Software Dependencies | No | No. The paper does not list specific software dependencies with version numbers, such as programming languages, libraries, or frameworks used for implementation. |
| Experiment Setup | Yes | When generating multiple response or new problems from the LLM, we sample with temperature of 0.7 and top-p = 0.9. For GSM8K and MATH, we set k = 8. ... All models are trained for 10 epochs with a learning rate of 5e-6 (cosine scheduling), and effective batch size of 16. Lastly, we set DPO loss term hyperparameter β = 0.5 and NLL regularization coefficient α = 1. |