reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Triple Preference Optimization: Achieving Better Alignment using a Single Step Optimization

Authors: Amir Saeidi, Shivanshu Verma, Kashif Rasul, Aswin RRV, Chitta Baral

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluation covers a comprehensive range of chat-based and reasoning benchmarks. The results demonstrate that TPO achieves significant improvements over existing methods... Comprehensive experiments demonstrate that TPO and TPO-L achieve simultaneous improvements in instruction-following and reasoning benchmarks... Section 4 Experimental Results
Researcher Affiliation	Collaboration	Amir Saeidi EMAIL School of Computing and Augmented Intelligence Arizona State University Shivanshu Verma EMAIL School of Computing and Augmented Intelligence Arizona State University Kashif Rasul EMAIL Morgan Stanley New York, NY
Pseudocode	No	The paper provides mathematical derivations in Appendix A, notably in sections like 'Deriving the optimal policy under the Preference Objective' and 'Deriving the Gradient of the TPO Objective', but it does not contain any distinct section, figure, or block explicitly labeled 'Pseudocode' or 'Algorithm', nor are there any structured steps for a method formatted like code.
Open Source Code	No	The paper does not contain any explicit statement about releasing source code for the methodology described, nor does it provide a direct link to a code repository.
Open Datasets	Yes	To evaluate TPO and existing preference optimization methods, we follow the Sim PO setup with minor adjustments, focusing on the Mistral Jiang et al. (2023a) and LLa MA Touvron et al. (2023) models... For this setting, we used the Ultra Feedback (Cui et al., 2023) dataset... Reasoning benchmarks included MMLU (Hendrycks et al., 2021), MMLU-pro (Wang et al., 2024e), and GSM8K (Cobbe et al., 2021)
Dataset Splits	Yes	For this setting, we used the Ultra Feedback (Cui et al., 2023) dataset, containing 60,000 data points... This resulted in a final dataset of 40,000 data points. To compare preference optimization methods fairly, we fine-tuned a pre-trained model on the gold responses and used the preferred and rejected responses for the current preference optimization methods in two steps. For TPO, we utilized all data in one optimization step. Moreover, we evaluated preference optimization methods on subsets of 5,000, 10,000, and 20,000 points randomly selected from the processed dataset.
Hardware Specification	Yes	Moreover, all the training experiments in this paper were conducted on 8 A100 GPUs.
Software Dependencies	No	The paper mentions specific language models used as backbones (e.g., Llama-3-8B, Mistral-7B-v0.3) and linked to their Hugging Face repositories, but it does not provide specific version numbers for software libraries or frameworks (e.g., Python, PyTorch, CUDA versions) that constitute ancillary software dependencies.
Experiment Setup	Yes	Hyperparameter tuning is essential for optimizing the performance of preference optimization methods. To identify the best hyperparameters, we explored various learning rates [3e-7, 5e-7, 6e-7, 1e-6] and batch sizes [32, 64, 128, 256]. Our observations indicate that preference optimization methods perform best with a batch size of 32 for a training size of 5,000, 32 for 10,000, 64 for 20,000, and 128 for 60,000. However, for large datasets like 60,000, TPO performs best with a batch size of 256. Based on these findings, we fixed these batch sizes for all preference optimization experiments. Additionally, we set the maximum sequence length to 1024 for Base setting and 2048 for Instruct setting and applied a cosine learning rate schedule with a 10% warm-up phase for the preference optimization dataset. We followed Table 7 in Sim PO (Meng et al., 2024) for a search on the hyperparameter ranges used for the baseline methods, while Table 6 lists the hyperparameters for TPO and TPO-L under each experimental setting.