TODO: Enhancing LLM Alignment with Ternary Preferences
Authors: Yuxiang Guo, Lu Yin, Bo Jiang, Jiaqi Zhang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In evaluations on Mistral-7B and Llama 3-8B models, TODO consistently outperforms DPO in modeling preferences across both in-distribution and out-of-distribution datasets. Additional assessments using MT Bench and benchmarks such as Piqa, ARC-c, and MMLU further demonstrate TODO s superior alignment performance. [...] We use Mistral-7B and Llama 3-8B to conduct experimental validation. |
| Researcher Affiliation | Collaboration | Yuxiang Guo1,2 Lu Yin1,3 Bo Jiang2 Jiaqi Zhang1 1Meituan Inc. 2Beihang University 3University of Surrey |
| Pseudocode | No | The paper provides mathematical derivations and gradient update equations (e.g., Equations 18 and 19), but it does not include any clearly labeled pseudocode or algorithm blocks with structured steps. |
| Open Source Code | Yes | The implementation details and datasets can be found in https://github.com/XXares/TODO. |
| Open Datasets | Yes | Table 1 shows a sample from the Ultrafeedback-binaried dataset1 (Tunstall et al.)...1https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized. [...] Reward Bench (Lambert et al.). [...] MT Bench (Zheng et al., 2023) and other popular benchmarks such as Piqa (Bisk et al., 2019), ARC-c, ARC-e (Clark et al., 2018), Hellaswag (Zellers et al., 2019), MMLU (Hendrycks et al., 2021) and Winogrande (Sakaguchi et al., 2019). [...] Chatarena (lms, 2024) dataset, which encompasses multiple language pairs. [...] lmsys-chatbot-arena-conversations, 2024. URL https://huggingface.co/datasets/agie-ai/lmsys-chatbot_arena_conversations. |
| Dataset Splits | Yes | For the datasets used in the preference alignment process, we construct 20k-size datasets with different tie data proportions from Ultrafeedback (Cui et al., 2023). [...] Each sampled dataset exhibits a tie data ratio that varies within the set {0, 0.1, 0.2, 0.3}. [...] we curate an in-distribution test set containing 1500 non-tied samples and select the Reward Bench (Lambert et al.) as an out-of-distribution dataset. [...] In our experiments, we used a training set of 20,000 pairs with tie data ratios of 0 and 0.17 (the natural tie ratio of this dataset) and a test set of 1,500 randomly selected samples. |
| Hardware Specification | No | The paper does not explicitly mention the specific hardware used for running the experiments (e.g., GPU models, CPU types, or memory specifications). |
| Software Dependencies | No | We use Adam optimizer and the weight decay is set into 0. We use cosine learning rate scheduler... The paper mentions software components like 'Adam optimizer' and 'cosine learning rate scheduler' but does not provide specific version numbers for these or any other libraries, frameworks, or programming languages used. |
| Experiment Setup | Yes | All comparative results are derived from training each model for 3 epochs on their respective training datasets. As demonstrated in the analysis presented in Appendix A.3, the model performance is relatively insensitive to variations in α within a reasonable range of α (0.1, 0.8), we set α = 0.5 in TODO. Other hyperparameters are shown in Appendix A.9, where we adopt the settings from previous works (Saeidi et al.; Meng et al., 2024). We ensure the consistency of training hyperparameters among experiments for a fair comparison. [...] Table 10: Training hyperparameters settings of DPO and TODO. Model Learning rate Batch size β Mistral+SFT 5e-7 64 0.01 Llama 3+SFT 1e-6 128 0.01 |