Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback

Authors: Yafu Li, Xuyang Hu, Xiaoye Qu, Linjie Li, Yu Cheng

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Evaluations on benchmarks covering instruction following, preference alignment, safety, and mathematics reveal that TPO progressively improves alignment with human preferences. Notably, after only a few TPO steps, the initially unaligned Llama-3.1-70B-SFT model can surpass the aligned counterpart, Llama-3.1-70B-Instruct.
Researcher Affiliation Academia 1Shanghai AI Laboratory 2University of Washington 3The Chinese University of Hong Kong. Correspondence to: Yu Cheng <EMAIL>.
Pseudocode No The paper describes the algorithm steps in paragraph form in Section 4.1 'Components of TPO' and visually in Figure 2, but does not include a dedicated pseudocode or algorithm block.
Open Source Code Yes The code is available at https://github.com/yafuly/TPO.
Open Datasets Yes We evaluate our models using a comprehensive set of benchmarks that address various aspects, including instruction following (Alpaca Eval 2 Li et al., 2023a and Arena-Hard Tianle Li*, 2024), general preference alignment (HH-RLHF Bai et al., 2022), safety (Beaver Tails-Evaluation Ji et al., 2023 and XSTest R ottger et al., 2024), and mathematical ability (MATH-500 Lightman et al., 2024).
Dataset Splits Yes We sample 500 instances from HH-RLHF test set and use the full test set for the other benchmarks, with data statistics shown in Appendix B. For test-time training evaluation, we report the average reward score, calculated as the mean of rewards generated by the reward model across all outputs from the test prompt. Regarding benchmark performance, we follow the official settings for Alpaca Eval 2 and Arena-Hard.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models) used for running the experiments. It mentions calculating computational cost (FLOPs) for models but not the actual experimental hardware.
Software Dependencies No The paper mentions using 'Text Grad (Yuksekgonul et al., 2024)' and 'vLLM (Kwon et al., 2023)' but does not provide specific version numbers for these software components.
Experiment Setup Yes By default, we set the number of samples per TPO iteration (N) to 5. We optimize all models at test time with TPO for 5 iterations to analyze the test-time training curve, while limiting the maximum iterations (D) to 2 for benchmark evaluation (Section 6.1). For inference, we utilize vLLM (Kwon et al., 2023) to facilitate LLM generation, with a temperature of 0.7 and a top p value of 0.95.