reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

TODO: Enhancing LLM Alignment with Ternary Preferences

Authors: Yuxiang Guo, Lu Yin, Bo Jiang, Jiaqi Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In evaluations on Mistral-7B and Llama 3-8B models, TODO consistently outperforms DPO in modeling preferences across both in-distribution and out-of-distribution datasets. Additional assessments using MT Bench and benchmarks such as Piqa, ARC-c, and MMLU further demonstrate TODO s superior alignment performance. [...] We use Mistral-7B and Llama 3-8B to conduct experimental validation.
Researcher Affiliation	Collaboration	Yuxiang Guo1,2 Lu Yin1,3 Bo Jiang2 Jiaqi Zhang1 1Meituan Inc. 2Beihang University 3University of Surrey
Pseudocode	No	The paper provides mathematical derivations and gradient update equations (e.g., Equations 18 and 19), but it does not include any clearly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	The implementation details and datasets can be found in https://github.com/XXares/TODO.
Open Datasets	Yes	Table 1 shows a sample from the Ultrafeedback-binaried dataset1 (Tunstall et al.)...1https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized. [...] Reward Bench (Lambert et al.). [...] MT Bench (Zheng et al., 2023) and other popular benchmarks such as Piqa (Bisk et al., 2019), ARC-c, ARC-e (Clark et al., 2018), Hellaswag (Zellers et al., 2019), MMLU (Hendrycks et al., 2021) and Winogrande (Sakaguchi et al., 2019). [...] Chatarena (lms, 2024) dataset, which encompasses multiple language pairs. [...] lmsys-chatbot-arena-conversations, 2024. URL https://huggingface.co/datasets/agie-ai/lmsys-chatbot_arena_conversations.
Dataset Splits	Yes	For the datasets used in the preference alignment process, we construct 20k-size datasets with different tie data proportions from Ultrafeedback (Cui et al., 2023). [...] Each sampled dataset exhibits a tie data ratio that varies within the set {0, 0.1, 0.2, 0.3}. [...] we curate an in-distribution test set containing 1500 non-tied samples and select the Reward Bench (Lambert et al.) as an out-of-distribution dataset. [...] In our experiments, we used a training set of 20,000 pairs with tie data ratios of 0 and 0.17 (the natural tie ratio of this dataset) and a test set of 1,500 randomly selected samples.
Hardware Specification	No	The paper does not explicitly mention the specific hardware used for running the experiments (e.g., GPU models, CPU types, or memory specifications).
Software Dependencies	No	We use Adam optimizer and the weight decay is set into 0. We use cosine learning rate scheduler... The paper mentions software components like 'Adam optimizer' and 'cosine learning rate scheduler' but does not provide specific version numbers for these or any other libraries, frameworks, or programming languages used.
Experiment Setup	Yes	All comparative results are derived from training each model for 3 epochs on their respective training datasets. As demonstrated in the analysis presented in Appendix A.3, the model performance is relatively insensitive to variations in α within a reasonable range of α (0.1, 0.8), we set α = 0.5 in TODO. Other hyperparameters are shown in Appendix A.9, where we adopt the settings from previous works (Saeidi et al.; Meng et al., 2024). We ensure the consistency of training hyperparameters among experiments for a fair comparison. [...] Table 10: Training hyperparameters settings of DPO and TODO. Model Learning rate Batch size β Mistral+SFT 5e-7 64 0.01 Llama 3+SFT 1e-6 128 0.01