3D-Properties: Identifying Challenges in DPO and Charting a Path Forward

Authors: Yuzi Yan, Yibo Miao, Jialian Li, YipinZhang, Jian Xie, Zhijie Deng, Dong Yan

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our findings are supported by experiments on both a controlled toy model and real-world LLM tasks, including mathematical problem-solving and instruction following. [...] Our experimental approach begins with the design of a toy model to quickly validate our hypotheses, followed by a rigorous test of the actual performance of real LLMs on tasks such as mathematical problem solving and instruction following.
Researcher Affiliation Collaboration Yuzi Yan1,3 , Yibo Miao2,3 , Jialian Li3, Yipin Zhang3, Jian Xie3, Zhijie Deng2 , Dong Yan3 1Department of Electronic Engineering, Tsinghua University 2Shanghai Jiao Tong University 3Baichuan AI EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes mathematical formulations and theoretical analyses but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code Yes The code is provided in the supplementary material. [...] We provide the evaluation prompt in our code in the supplementary material. [...] We also provide the scoring and evaluation rule-based criteria in our code in the supplementary material.
Open Datasets Yes For mathematical reasoning, we used MATH (Hendrycks et al., 2021) as the main dataset for both training and testing. [...] we selected Super CLUE-Math (Xu et al., 2020) [...] sourced from HH-rlhf (Bai et al., 2022) and Ultra Feed Back (Cui et al., 2024). [...] We provide part of the in-house datasets in the supplementary material to clarify the format and the content.
Dataset Splits Yes We compiled the MATH dataset, which contains 5,826 pairs {x, a+, a }. We randomly selected 2,000 samples from the original test set to serve as the test set for MATH . [...] Table 6: The statistic of used datasets. MATH train set 5,826 test set 2,000
Hardware Specification Yes All experiments were conducted on a cluster consisting of 40 A100 GPUs.
Software Dependencies No The paper mentions using the Adam optimizer and other DPO variants (IPO, SLiC) but does not specify any software libraries or their version numbers, apart from the general optimizer name.
Experiment Setup Yes Following the implementation of Rafailov et al. (2024), we use the Adam optimizer with the learning rate set to 5e-7. The most sensitive parameter in the DPO algorithm is β (and learning rate but less significant). Here we use the default setting aligned with the original DPO paper, The β here is set to be 0.1 and the learning rate is set to be 5 × 10−6, which is the best hyperparameter set as far as we explored. We set the batch size to 80 and the number of gradient accumulation steps to 2. The training epoch was set to 1. In IPO training, we set η to be 0.1. In SLi C training, we set δ = 5, η = 0.1.