Triple Preference Optimization: Achieving Better Alignment using a Single Step Optimization
Authors: Amir Saeidi, Shivanshu Verma, Kashif Rasul, Aswin RRV, Chitta Baral
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluation covers a comprehensive range of chat-based and reasoning benchmarks. The results demonstrate that TPO achieves significant improvements over existing methods... Comprehensive experiments demonstrate that TPO and TPO-L achieve simultaneous improvements in instruction-following and reasoning benchmarks... Section 4 Experimental Results |
| Researcher Affiliation | Collaboration | Amir Saeidi EMAIL School of Computing and Augmented Intelligence Arizona State University Shivanshu Verma EMAIL School of Computing and Augmented Intelligence Arizona State University Kashif Rasul EMAIL Morgan Stanley New York, NY |
| Pseudocode | No | The paper provides mathematical derivations in Appendix A, notably in sections like 'Deriving the optimal policy under the Preference Objective' and 'Deriving the Gradient of the TPO Objective', but it does not contain any distinct section, figure, or block explicitly labeled 'Pseudocode' or 'Algorithm', nor are there any structured steps for a method formatted like code. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code for the methodology described, nor does it provide a direct link to a code repository. |
| Open Datasets | Yes | To evaluate TPO and existing preference optimization methods, we follow the Sim PO setup with minor adjustments, focusing on the Mistral Jiang et al. (2023a) and LLa MA Touvron et al. (2023) models... For this setting, we used the Ultra Feedback (Cui et al., 2023) dataset... Reasoning benchmarks included MMLU (Hendrycks et al., 2021), MMLU-pro (Wang et al., 2024e), and GSM8K (Cobbe et al., 2021) |
| Dataset Splits | Yes | For this setting, we used the Ultra Feedback (Cui et al., 2023) dataset, containing 60,000 data points... This resulted in a final dataset of 40,000 data points. To compare preference optimization methods fairly, we fine-tuned a pre-trained model on the gold responses and used the preferred and rejected responses for the current preference optimization methods in two steps. For TPO, we utilized all data in one optimization step. Moreover, we evaluated preference optimization methods on subsets of 5,000, 10,000, and 20,000 points randomly selected from the processed dataset. |
| Hardware Specification | Yes | Moreover, all the training experiments in this paper were conducted on 8 A100 GPUs. |
| Software Dependencies | No | The paper mentions specific language models used as backbones (e.g., Llama-3-8B, Mistral-7B-v0.3) and linked to their Hugging Face repositories, but it does not provide specific version numbers for software libraries or frameworks (e.g., Python, PyTorch, CUDA versions) that constitute ancillary software dependencies. |
| Experiment Setup | Yes | Hyperparameter tuning is essential for optimizing the performance of preference optimization methods. To identify the best hyperparameters, we explored various learning rates [3e-7, 5e-7, 6e-7, 1e-6] and batch sizes [32, 64, 128, 256]. Our observations indicate that preference optimization methods perform best with a batch size of 32 for a training size of 5,000, 32 for 10,000, 64 for 20,000, and 128 for 60,000. However, for large datasets like 60,000, TPO performs best with a batch size of 256. Based on these findings, we fixed these batch sizes for all preference optimization experiments. Additionally, we set the maximum sequence length to 1024 for Base setting and 2048 for Instruct setting and applied a cosine learning rate schedule with a 10% warm-up phase for the preference optimization dataset. We followed Table 7 in Sim PO (Meng et al., 2024) for a search on the hyperparameter ranges used for the baseline methods, while Table 6 lists the hyperparameters for TPO and TPO-L under each experimental setting. |