reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

3D-Properties: Identifying Challenges in DPO and Charting a Path Forward

Authors: Yuzi Yan, Yibo Miao, Jialian Li, YipinZhang, Jian Xie, Zhijie Deng, Dong Yan

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our findings are supported by experiments on both a controlled toy model and real-world LLM tasks, including mathematical problem-solving and instruction following. [...] Our experimental approach begins with the design of a toy model to quickly validate our hypotheses, followed by a rigorous test of the actual performance of real LLMs on tasks such as mathematical problem solving and instruction following.
Researcher Affiliation	Collaboration	Yuzi Yan1,3 , Yibo Miao2,3 , Jialian Li3, Yipin Zhang3, Jian Xie3, Zhijie Deng2 , Dong Yan3 1Department of Electronic Engineering, Tsinghua University 2Shanghai Jiao Tong University 3Baichuan AI EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes mathematical formulations and theoretical analyses but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	The code is provided in the supplementary material. [...] We provide the evaluation prompt in our code in the supplementary material. [...] We also provide the scoring and evaluation rule-based criteria in our code in the supplementary material.
Open Datasets	Yes	For mathematical reasoning, we used MATH (Hendrycks et al., 2021) as the main dataset for both training and testing. [...] we selected Super CLUE-Math (Xu et al., 2020) [...] sourced from HH-rlhf (Bai et al., 2022) and Ultra Feed Back (Cui et al., 2024). [...] We provide part of the in-house datasets in the supplementary material to clarify the format and the content.
Dataset Splits	Yes	We compiled the MATH dataset, which contains 5,826 pairs {x, a+, a }. We randomly selected 2,000 samples from the original test set to serve as the test set for MATH . [...] Table 6: The statistic of used datasets. MATH train set 5,826 test set 2,000
Hardware Specification	Yes	All experiments were conducted on a cluster consisting of 40 A100 GPUs.
Software Dependencies	No	The paper mentions using the Adam optimizer and other DPO variants (IPO, SLiC) but does not specify any software libraries or their version numbers, apart from the general optimizer name.
Experiment Setup	Yes	Following the implementation of Rafailov et al. (2024), we use the Adam optimizer with the learning rate set to 5e-7. The most sensitive parameter in the DPO algorithm is β (and learning rate but less significant). Here we use the default setting aligned with the original DPO paper, The β here is set to be 0.1 and the learning rate is set to be 5 × 10−6, which is the best hyperparameter set as far as we explored. We set the batch size to 80 and the number of gradient accumulation steps to 2. The training epoch was set to 1. In IPO training, we set η to be 0.1. In SLi C training, we set δ = 5, η = 0.1.