Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback
Authors: Yafu Li, Xuyang Hu, Xiaoye Qu, Linjie Li, Yu Cheng
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluations on benchmarks covering instruction following, preference alignment, safety, and mathematics reveal that TPO progressively improves alignment with human preferences. Notably, after only a few TPO steps, the initially unaligned Llama-3.1-70B-SFT model can surpass the aligned counterpart, Llama-3.1-70B-Instruct. |
| Researcher Affiliation | Academia | 1Shanghai AI Laboratory 2University of Washington 3The Chinese University of Hong Kong. Correspondence to: Yu Cheng <EMAIL>. |
| Pseudocode | No | The paper describes the algorithm steps in paragraph form in Section 4.1 'Components of TPO' and visually in Figure 2, but does not include a dedicated pseudocode or algorithm block. |
| Open Source Code | Yes | The code is available at https://github.com/yafuly/TPO. |
| Open Datasets | Yes | We evaluate our models using a comprehensive set of benchmarks that address various aspects, including instruction following (Alpaca Eval 2 Li et al., 2023a and Arena-Hard Tianle Li*, 2024), general preference alignment (HH-RLHF Bai et al., 2022), safety (Beaver Tails-Evaluation Ji et al., 2023 and XSTest R ottger et al., 2024), and mathematical ability (MATH-500 Lightman et al., 2024). |
| Dataset Splits | Yes | We sample 500 instances from HH-RLHF test set and use the full test set for the other benchmarks, with data statistics shown in Appendix B. For test-time training evaluation, we report the average reward score, calculated as the mean of rewards generated by the reward model across all outputs from the test prompt. Regarding benchmark performance, we follow the official settings for Alpaca Eval 2 and Arena-Hard. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models) used for running the experiments. It mentions calculating computational cost (FLOPs) for models but not the actual experimental hardware. |
| Software Dependencies | No | The paper mentions using 'Text Grad (Yuksekgonul et al., 2024)' and 'vLLM (Kwon et al., 2023)' but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | By default, we set the number of samples per TPO iteration (N) to 5. We optimize all models at test time with TPO for 5 iterations to analyze the test-time training curve, while limiting the maximum iterations (D) to 2 for benchmark evaluation (Section 6.1). For inference, we utilize vLLM (Kwon et al., 2023) to facilitate LLM generation, with a temperature of 0.7 and a top p value of 0.95. |