Preference Optimization for Reasoning with Pseudo Feedback

Authors: Fangkai Jiao, Geyang Guo, Xingxing Zhang, Nancy F Chen, Shafiq Joty, Furu Wei

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on both mathematical reasoning and coding tasks using pseudo feedback for preference optimization, and observe improvements across both tasks. Specifically, using Mathstral-7B as our base model, we improve MATH results from 58.3 to 68.6, surpassing both Numina Math-72B and GPT-4-Turbo-1106-preview. In GSM8K and College Math, our scores increase from 85.6 to 90.3 and from 34.3 to 42.3, respectively. Building on Deepseek-coder-7B-v1.5, we achieve a score of 24.3 on Live Code Bench (from 21.1), surpassing Claude-3-Haiku.
Researcher Affiliation Collaboration 1Nanyang Technological University 2Microsoft Research 3I2R, A*STAR 4Georgia Institute of Technology 5Salesforce Research
Pseudocode No No explicit pseudocode or algorithm blocks were found in the paper. The methodology is described in prose and illustrated with diagrams like Figure 1.
Open Source Code Yes 1The code is released at: https://github.com/microsoft/unilm/tree/master/PFPO
Open Datasets Yes For mathematical reasoning, we followed Tang et al. (2024) to create 800K prompts... Numina Math 3... 3https://huggingface.co/datasets/AI-MO/Numina Math-Co T. For code generation, we have collected the problems from the training set of APPs (Hendrycks et al., 2021a), Magicoder (Wei et al., 2024) and x Code Eval (Khan et al., 2024b)...
Dataset Splits Yes For validation, we randomly sampled 2,000 question-solution pairs from the training set of MWPBench (Tang et al., 2024)... We randomly sampled 500 questions from the training set of APPs for validation. The prompts are divided into non-overlapped splits for iterative training. (Appendix A.2) For Numina Math-790K, collecting all prompts for single iteration of DPO training can make it more challenging to avoid policy shifting... we split the whole dataset into several parts for iterative training. During each iteration, we use around 160K prompts to collect solutions, construct pseudo feedback, and optimize the policy model.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup Yes All hyper-parameters are listed in Table 5.