reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Self-Consistency Preference Optimization

Authors: Archiki Prasad, Weizhe Yuan, Richard Yuanzhe Pang, Jing Xu, Maryam Fazel-Zarandi, Mohit Bansal, Sainbayar Sukhbaatar, Jason E Weston, Jane Yu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our experiments using Llama-3 8B models (Dubey et al., 2024), we show that even without access to any gold answers during training, two iterations of unsupervised SCPO improves zero-shot accuracy of the base model by 22.74% and 5.26% (absolute) on GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021) respectively, closely matching the performance (< 1% difference) of the supervised baseline from Pang et al. (2024).
Researcher Affiliation	Collaboration	1Meta FAIR 2UNC Chapel Hill 3New York University. Correspondence to: Archiki Prasad <EMAIL>.
Pseudocode	No	No. The paper describes the method's steps in paragraph text and uses a diagram (Figure 1) to illustrate the process, but it does not include a structured pseudocode or algorithm block.
Open Source Code	No	No. The paper does not contain any explicit statement about releasing source code for the methodology described, nor does it provide a link to a code repository.
Open Datasets	Yes	We evaluate the effectiveness of SCPO on a range of math and logical reasoning datasets: GSM8K (Cobbe et al., 2021) contains a train/test split of 7.5K/1.3K grade school math word problems. MATH (Hendrycks et al., 2021) is a dataset of challenging high-school math competitions that contains a train/test split of 7.5K/5K problems, respectively. Zebra Logic (Dziri et al., 2024) is a logical reasoning benchmark.
Dataset Splits	Yes	GSM8K (Cobbe et al., 2021) contains a train/test split of 7.5K/1.3K grade school math word problems. For the purpose of this work, we split the train set into a train/dev split with 6.7K/0.8K problems respectively. The overall data split becomes 6.7K/0.8K/1.3K in the train/dev/test set, respectively. MATH (Hendrycks et al., 2021) ... we reserve 10% of samples from the train set to create a held-out dev set for model selection and hyperparameter tuning, resulting in our final train/dev/test splits with 6.7K/0.8K/5K problems, respectively.
Hardware Specification	No	No. The paper mentions the use of 'Llama-3 8B models' but does not specify any details regarding the hardware (e.g., GPU models, CPU types, or memory) used for conducting the experiments.
Software Dependencies	No	No. The paper does not list specific software dependencies with version numbers, such as programming languages, libraries, or frameworks used for implementation.
Experiment Setup	Yes	When generating multiple response or new problems from the LLM, we sample with temperature of 0.7 and top-p = 0.9. For GSM8K and MATH, we set k = 8. ... All models are trained for 10 epochs with a learning rate of 5e-6 (cosine scheduling), and effective batch size of 16. Lastly, we set DPO loss term hyperparameter β = 0.5 and NLL regularization coefficient α = 1.