reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback

Authors: Yafu Li, Xuyang Hu, Xiaoye Qu, Linjie Li, Yu Cheng

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluations on benchmarks covering instruction following, preference alignment, safety, and mathematics reveal that TPO progressively improves alignment with human preferences. Notably, after only a few TPO steps, the initially unaligned Llama-3.1-70B-SFT model can surpass the aligned counterpart, Llama-3.1-70B-Instruct.
Researcher Affiliation	Academia	1Shanghai AI Laboratory 2University of Washington 3The Chinese University of Hong Kong. Correspondence to: Yu Cheng <EMAIL>.
Pseudocode	No	The paper describes the algorithm steps in paragraph form in Section 4.1 'Components of TPO' and visually in Figure 2, but does not include a dedicated pseudocode or algorithm block.
Open Source Code	Yes	The code is available at https://github.com/yafuly/TPO.
Open Datasets	Yes	We evaluate our models using a comprehensive set of benchmarks that address various aspects, including instruction following (Alpaca Eval 2 Li et al., 2023a and Arena-Hard Tianle Li*, 2024), general preference alignment (HH-RLHF Bai et al., 2022), safety (Beaver Tails-Evaluation Ji et al., 2023 and XSTest R ottger et al., 2024), and mathematical ability (MATH-500 Lightman et al., 2024).
Dataset Splits	Yes	We sample 500 instances from HH-RLHF test set and use the full test set for the other benchmarks, with data statistics shown in Appendix B. For test-time training evaluation, we report the average reward score, calculated as the mean of rewards generated by the reward model across all outputs from the test prompt. Regarding benchmark performance, we follow the official settings for Alpaca Eval 2 and Arena-Hard.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models) used for running the experiments. It mentions calculating computational cost (FLOPs) for models but not the actual experimental hardware.
Software Dependencies	No	The paper mentions using 'Text Grad (Yuksekgonul et al., 2024)' and 'vLLM (Kwon et al., 2023)' but does not provide specific version numbers for these software components.
Experiment Setup	Yes	By default, we set the number of samples per TPO iteration (N) to 5. We optimize all models at test time with TPO for 5 iterations to analyze the test-time training curve, while limiting the maximum iterations (D) to 2 for benchmark evaluation (Section 6.1). For inference, we utilize vLLM (Kwon et al., 2023) to facilitate LLM generation, with a temperature of 0.7 and a top p value of 0.95.