Advancing LLM Reasoning Generalists with Preference Trees
Authors: Lifan Yuan, Ganqu Cui, Hanbin Wang, Ning Ding, Xingyao Wang, Boji Shan, Zeyuan Liu, Jia Deng, Huimin Chen, Ruobing Xie, Yankai Lin, Zhenghao Liu, Bowen Zhou, Hao Peng, Zhiyuan Liu, Maosong Sun
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce EURUS, a suite of large language models (LLMs) optimized for reasoning. Finetuned from Mistral-7B, Llama-3-8B, and Mixtral-8x22B, EURUS models achieve state-of-the-art results among open-source models on a diverse set of benchmarks covering mathematics, code generation, and logical reasoning problems. Notably, EURUX-8X22B outperforms GPT-3.5 Turbo in reasoning through a comprehensive benchmarking across 12 test sets covering five tasks. The strong performance of EURUS can be primarily attributed to ULTRAINTERACT, our newly-curated large-scale, high-quality training data dataset specifically designed for complex reasoning tasks. Our investigation reveals that some wellestablished preference learning algorithms may be less suitable for reasoning tasks compared to their effectiveness in general conversations. Section 4 EVALUATION OF EURUS MODELS. Section 5 EVALUATION OF EURUS-RM-7B. Section 6 ABLATION STUDY. |
| Researcher Affiliation | Collaboration | 1Tsinghua University 2University of Illinois Urbana-Champaign 3Peking University 4Northeastern University 5 Model Best.Inc 6 Renmin University of China 7 Tencent |
| Pseudocode | No | The paper describes methods with steps (e.g., in Section 2.2 'DECOMPOSITION AND INTERACTION AT EACH TURN' and Figure 3 with code snippets), but it does not present any formal pseudocode or algorithm blocks. Figure 3 shows an example of an ULTRAINTERACT trajectory with generated text and code, but not a general algorithm block. |
| Open Source Code | No | The paper does not explicitly state that source code for the methodology or models (EURUS-series LLMs) described in the paper is released or available, nor does it provide a direct link to a code repository. It mentions 'releasing a high-quality multi-turn reasoning dataset ULTRAINTERACT' in the conclusion, which refers to data, not code. |
| Open Datasets | Yes | ULTRAINTERACT, our newly-curated large-scale, high-quality training data dataset specifically designed for complex reasoning tasks. In short, we compiled this work by first synthesizing both SFT and preference datasets to improve the reasoning ability of open-source models (Section 2). (1) releasing a high-quality multi-turn reasoning dataset ULTRAINTERACT with preference trees |
| Dataset Splits | No | The paper describes the datasets used for evaluation (e.g., Human Eval, MBPP, Leet Code for coding), and also mentions ULTRAINTERACT as a training dataset. It states, 'We evaluate with pass@1 accuracy.' and 'All test sets except MATH are out-of-distribution to our models and most baselines.' It also describes a decontamination process for ULTRAINTERACT against test sets. However, it does not provide specific train/validation/test splits (percentages or counts) for its own ULTRAINTERACT dataset or how it was used across different stages of model training. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory used for running its experiments. |
| Software Dependencies | No | The paper mentions a 'Python interpreter' and 'Jupyter notebook environment' in Section 2.2 and Figure 3, but does not provide specific version numbers for Python or any other software libraries or dependencies used in their experiments. |
| Experiment Setup | Yes | Supervised Fine-Tuning. We finetune base models for 1 epoch with a 2e-5 learning rate and 0.1 warmup ratio using a cosine scheduler. Preference Learning. For hyperparameters, all β is set to 0.1, and λ+/λ in KTO is set to 1.33 as recommended. We finetune models for 1 epoch with a 5e-7 learning rate and 0.1 warmup ratio using a cosine scheduler. Reward Modeling. We train RM for 1 epoch with lr=1e-5 learning rate. We also use a cosine scheduler with a warmup ratio of 0.1. |