PARM: Multi-Objective Test-Time Alignment via Preference-Aware Autoregressive Reward Model

Authors: Baijiong Lin, Weisen Jiang, Yuancheng Xu, Hao Chen, Ying-Cong Chen

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we evaluate PARM through experiments on safety alignment and helpful assistant tasks, demonstrating its effectiveness and efficiency in multi-objective test-time alignment. 5.1. Safety Alignment Experimental Setups. Safety alignment aims to balance the helpfulness and harmlessness in language models when responding to red-teaming prompts. We use the PKU-Safe RLHF-10K dataset (Ji et al., 2023; 2024), which provides harmlessness and helpfulness annotations for each question-answering (QA) pair. Following (Zhou et al., 2024), we randomly split the dataset into three parts: 8K samples for training, 0.5K for validation, and the remaining 1.5K for testing. Evaluation. We evaluate all methods on the test dataset using a range of preference vectors evenly sampled from the simplex with an interval of 0.1, i.e., α {(0.0, 1.0), (0.1, 0.9), , (1.0, 0.0)}. Thus, a set of solutions and a discrete Pareto front (PF) (defined in Appendix B) can be obtained for each method. We employ two widely-used multi-objective metrics (Zhang et al., 2024c) for quantitative evaluation: (i) Hypervolume (HV) (Zitzler & Thiele, 1998)...
Researcher Affiliation Academia 1The Hong Kong University of Science and Technology (Guangzhou) 2The Chinese University of Hong Kong 3University of Maryland, College Park 4The Hong Kong University of Science and Technology. Correspondence to: Baijiong Lin <EMAIL>, Ying-Cong Chen <EMAIL>.
Pseudocode Yes Algorithm 1 Training of PARM. Require: initial model πθ0, ranks r1 and r2 for PBLo RA, numbers of preference dimensions k, datasets for each preference dimension {Di}k i=1. 1: Initialize the parameters of PBLo RA Θ; 2: while not converged do 3: Sample a preference vector α from k 1; 4: Compute θ(α) via Equations (8) and (9); 5: for i in 1, , k do 6: Sample a data batch Bi from Di; 7: Compute loss ℓ(πθ(α), Bi) via Equation (3); 8: end for 9: Compute total loss Pk i=1 αiℓ(πθ(α), Bi); 10: Update Θ via gradient descent; 11: end while 12: return (πθ0, Θ).
Open Source Code Yes The code is available at https://github.com/Baijiong-Lin/PARM.
Open Datasets Yes We use the PKU-Safe RLHF-10K dataset (Ji et al., 2023; 2024), which provides harmlessness and helpfulness annotations for each question-answering (QA) pair. We use the HH-RLHF dataset (Bai et al., 2022), which contains 160K prompts and the corresponding responses, in the form of multi-turn dialogue. F. Sources of Datasets and Models In Table 6, we provide the sources of datasets and models used in our experiments.
Dataset Splits Yes Following (Zhou et al., 2024), we randomly split the dataset into three parts: 8K samples for training, 0.5K for validation, and the remaining 1.5K for testing. We randomly sample 10K, 1K, and 1K data samples from the HH-RLHF dataset for training, validation, and testing.
Hardware Specification Yes Table 4: Performance of MOD-w2s (Shi et al., 2024), Gen ARM (Xu et al., 2025) and PARM on the helpful assistant tasks. All methods are first fine-tuned on Tiny LLa MA-1.1B-Chat, then guide the frozen LLa MA-2-7B-Chat s generation. Time (second) denotes the inference time of generating 512 tokens on a single NVIDIA A40 GPU.
Software Dependencies No Our implementation is based on the open-source trl library (von Werra et al., 2020). Our implementation is based on the peft library (Mangrulkar et al., 2022).
Experiment Setup Yes The proposed PARM is finetuned from the Alpaca-7B model using PBLo RA for 2 epochs with βr = 0.01, a learning rate of 5 × 10−4, and a total batch size of 32. During generation, we set β = 1 and use a maximum generation length of 1024 tokens for all methods.