reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Preference Optimization for Reasoning with Pseudo Feedback

Authors: Fangkai Jiao, Geyang Guo, Xingxing Zhang, Nancy F Chen, Shafiq Joty, Furu Wei

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments on both mathematical reasoning and coding tasks using pseudo feedback for preference optimization, and observe improvements across both tasks. Specifically, using Mathstral-7B as our base model, we improve MATH results from 58.3 to 68.6, surpassing both Numina Math-72B and GPT-4-Turbo-1106-preview. In GSM8K and College Math, our scores increase from 85.6 to 90.3 and from 34.3 to 42.3, respectively. Building on Deepseek-coder-7B-v1.5, we achieve a score of 24.3 on Live Code Bench (from 21.1), surpassing Claude-3-Haiku.
Researcher Affiliation	Collaboration	1Nanyang Technological University 2Microsoft Research 3I2R, A*STAR 4Georgia Institute of Technology 5Salesforce Research
Pseudocode	No	No explicit pseudocode or algorithm blocks were found in the paper. The methodology is described in prose and illustrated with diagrams like Figure 1.
Open Source Code	Yes	1The code is released at: https://github.com/microsoft/unilm/tree/master/PFPO
Open Datasets	Yes	For mathematical reasoning, we followed Tang et al. (2024) to create 800K prompts... Numina Math 3... 3https://huggingface.co/datasets/AI-MO/Numina Math-Co T. For code generation, we have collected the problems from the training set of APPs (Hendrycks et al., 2021a), Magicoder (Wei et al., 2024) and x Code Eval (Khan et al., 2024b)...
Dataset Splits	Yes	For validation, we randomly sampled 2,000 question-solution pairs from the training set of MWPBench (Tang et al., 2024)... We randomly sampled 500 questions from the training set of APPs for validation. The prompts are divided into non-overlapped splits for iterative training. (Appendix A.2) For Numina Math-790K, collecting all prompts for single iteration of DPO training can make it more challenging to avoid policy shifting... we split the whole dataset into several parts for iterative training. During each iteration, we use around 160K prompts to collect solutions, construct pseudo feedback, and optimize the policy model.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types) used for running its experiments.
Software Dependencies	No	The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup	Yes	All hyper-parameters are listed in Table 5.