PARM: Multi-Objective Test-Time Alignment via Preference-Aware Autoregressive Reward Model
Authors: Baijiong Lin, Weisen Jiang, Yuancheng Xu, Hao Chen, Ying-Cong Chen
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we evaluate PARM through experiments on safety alignment and helpful assistant tasks, demonstrating its effectiveness and efficiency in multi-objective test-time alignment. 5.1. Safety Alignment Experimental Setups. Safety alignment aims to balance the helpfulness and harmlessness in language models when responding to red-teaming prompts. We use the PKU-Safe RLHF-10K dataset (Ji et al., 2023; 2024), which provides harmlessness and helpfulness annotations for each question-answering (QA) pair. Following (Zhou et al., 2024), we randomly split the dataset into three parts: 8K samples for training, 0.5K for validation, and the remaining 1.5K for testing. Evaluation. We evaluate all methods on the test dataset using a range of preference vectors evenly sampled from the simplex with an interval of 0.1, i.e., α {(0.0, 1.0), (0.1, 0.9), , (1.0, 0.0)}. Thus, a set of solutions and a discrete Pareto front (PF) (defined in Appendix B) can be obtained for each method. We employ two widely-used multi-objective metrics (Zhang et al., 2024c) for quantitative evaluation: (i) Hypervolume (HV) (Zitzler & Thiele, 1998)... |
| Researcher Affiliation | Academia | 1The Hong Kong University of Science and Technology (Guangzhou) 2The Chinese University of Hong Kong 3University of Maryland, College Park 4The Hong Kong University of Science and Technology. Correspondence to: Baijiong Lin <EMAIL>, Ying-Cong Chen <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Training of PARM. Require: initial model πθ0, ranks r1 and r2 for PBLo RA, numbers of preference dimensions k, datasets for each preference dimension {Di}k i=1. 1: Initialize the parameters of PBLo RA Θ; 2: while not converged do 3: Sample a preference vector α from k 1; 4: Compute θ(α) via Equations (8) and (9); 5: for i in 1, , k do 6: Sample a data batch Bi from Di; 7: Compute loss ℓ(πθ(α), Bi) via Equation (3); 8: end for 9: Compute total loss Pk i=1 αiℓ(πθ(α), Bi); 10: Update Θ via gradient descent; 11: end while 12: return (πθ0, Θ). |
| Open Source Code | Yes | The code is available at https://github.com/Baijiong-Lin/PARM. |
| Open Datasets | Yes | We use the PKU-Safe RLHF-10K dataset (Ji et al., 2023; 2024), which provides harmlessness and helpfulness annotations for each question-answering (QA) pair. We use the HH-RLHF dataset (Bai et al., 2022), which contains 160K prompts and the corresponding responses, in the form of multi-turn dialogue. F. Sources of Datasets and Models In Table 6, we provide the sources of datasets and models used in our experiments. |
| Dataset Splits | Yes | Following (Zhou et al., 2024), we randomly split the dataset into three parts: 8K samples for training, 0.5K for validation, and the remaining 1.5K for testing. We randomly sample 10K, 1K, and 1K data samples from the HH-RLHF dataset for training, validation, and testing. |
| Hardware Specification | Yes | Table 4: Performance of MOD-w2s (Shi et al., 2024), Gen ARM (Xu et al., 2025) and PARM on the helpful assistant tasks. All methods are first fine-tuned on Tiny LLa MA-1.1B-Chat, then guide the frozen LLa MA-2-7B-Chat s generation. Time (second) denotes the inference time of generating 512 tokens on a single NVIDIA A40 GPU. |
| Software Dependencies | No | Our implementation is based on the open-source trl library (von Werra et al., 2020). Our implementation is based on the peft library (Mangrulkar et al., 2022). |
| Experiment Setup | Yes | The proposed PARM is finetuned from the Alpaca-7B model using PBLo RA for 2 epochs with βr = 0.01, a learning rate of 5 × 10−4, and a total batch size of 32. During generation, we set β = 1 and use a maximum generation length of 1024 tokens for all methods. |