reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Advancing LLM Reasoning Generalists with Preference Trees

Authors: Lifan Yuan, Ganqu Cui, Hanbin Wang, Ning Ding, Xingyao Wang, Boji Shan, Zeyuan Liu, Jia Deng, Huimin Chen, Ruobing Xie, Yankai Lin, Zhenghao Liu, Bowen Zhou, Hao Peng, Zhiyuan Liu, Maosong Sun

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce EURUS, a suite of large language models (LLMs) optimized for reasoning. Finetuned from Mistral-7B, Llama-3-8B, and Mixtral-8x22B, EURUS models achieve state-of-the-art results among open-source models on a diverse set of benchmarks covering mathematics, code generation, and logical reasoning problems. Notably, EURUX-8X22B outperforms GPT-3.5 Turbo in reasoning through a comprehensive benchmarking across 12 test sets covering five tasks. The strong performance of EURUS can be primarily attributed to ULTRAINTERACT, our newly-curated large-scale, high-quality training data dataset specifically designed for complex reasoning tasks. Our investigation reveals that some wellestablished preference learning algorithms may be less suitable for reasoning tasks compared to their effectiveness in general conversations. Section 4 EVALUATION OF EURUS MODELS. Section 5 EVALUATION OF EURUS-RM-7B. Section 6 ABLATION STUDY.
Researcher Affiliation	Collaboration	1Tsinghua University 2University of Illinois Urbana-Champaign 3Peking University 4Northeastern University 5 Model Best.Inc 6 Renmin University of China 7 Tencent
Pseudocode	No	The paper describes methods with steps (e.g., in Section 2.2 'DECOMPOSITION AND INTERACTION AT EACH TURN' and Figure 3 with code snippets), but it does not present any formal pseudocode or algorithm blocks. Figure 3 shows an example of an ULTRAINTERACT trajectory with generated text and code, but not a general algorithm block.
Open Source Code	No	The paper does not explicitly state that source code for the methodology or models (EURUS-series LLMs) described in the paper is released or available, nor does it provide a direct link to a code repository. It mentions 'releasing a high-quality multi-turn reasoning dataset ULTRAINTERACT' in the conclusion, which refers to data, not code.
Open Datasets	Yes	ULTRAINTERACT, our newly-curated large-scale, high-quality training data dataset specifically designed for complex reasoning tasks. In short, we compiled this work by first synthesizing both SFT and preference datasets to improve the reasoning ability of open-source models (Section 2). (1) releasing a high-quality multi-turn reasoning dataset ULTRAINTERACT with preference trees
Dataset Splits	No	The paper describes the datasets used for evaluation (e.g., Human Eval, MBPP, Leet Code for coding), and also mentions ULTRAINTERACT as a training dataset. It states, 'We evaluate with pass@1 accuracy.' and 'All test sets except MATH are out-of-distribution to our models and most baselines.' It also describes a decontamination process for ULTRAINTERACT against test sets. However, it does not provide specific train/validation/test splits (percentages or counts) for its own ULTRAINTERACT dataset or how it was used across different stages of model training.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory used for running its experiments.
Software Dependencies	No	The paper mentions a 'Python interpreter' and 'Jupyter notebook environment' in Section 2.2 and Figure 3, but does not provide specific version numbers for Python or any other software libraries or dependencies used in their experiments.
Experiment Setup	Yes	Supervised Fine-Tuning. We finetune base models for 1 epoch with a 2e-5 learning rate and 0.1 warmup ratio using a cosine scheduler. Preference Learning. For hyperparameters, all β is set to 0.1, and λ+/λ in KTO is set to 1.33 as recommended. We finetune models for 1 epoch with a 5e-7 learning rate and 0.1 warmup ratio using a cosine scheduler. Reward Modeling. We train RM for 1 epoch with lr=1e-5 learning rate. We also use a cosine scheduler with a warmup ratio of 0.1.