A Distributional Approach to Uncertainty-Aware Preference Alignment Using Offline Demonstrations
Authors: Sheng Xu, Bo Yue, Hongyuan Zha, Guiliang Liu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In the empirical study, we start by illustrating the learned distributional reward model in discrete Gridworld environments (Section 6.1). Next, we construct three Risky Point Maze environments and empirically evaluate the effectiveness of the proposed UA-Pb RL algorithm with trajectory visualization (Section 6.2). To assess performance in more challenging settings, we also examine two complex robot navigation tasks (Section 6.3). Lastly, we extend the experiments to explore the application in Large Language Model alignment (Section 6.4). Experimental results demonstrate that UA-Pb RL effectively identifies and avoids states with high uncertainty, facilitating risk-averse behaviors across various tasks, including robot control and language model alignment. |
| Researcher Affiliation | Academia | Sheng Xu1, Bo Yue1, Hongyuan Zha1, Guiliang Liu1 1School of Data Science, The Chinese University of Hong Kong, Shenzhen EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1: Uncertainty-Aware Preference-based Reinforcement Learning (UA-Pb RL) |
| Open Source Code | Yes | Experimental results demonstrate that UA-Pb RL effectively identifies and avoids states with high uncertainty, facilitating risk-averse behaviors across various tasks, including robot control and language model alignment. The code is available at https://github.com/Jasonxu1225/UA-Pb RL. |
| Open Datasets | Yes | Furthermore, we extended our experiments to the context of Large Language Model (LLM) alignment, demonstrating the effectiveness of UA-Pb RL for LLM fine-tuning. ... Specifically, we finetune two publicly pre-trained LLMs, Tiny Lla Ma-1.1B (Zhang et al., 2024a) and Lla Ma-3-8B (Dubey et al., 2024), on the PKU-Safe RLHF10K dataset (Ji et al., 2023; Dai et al., 2024), which contains human-labeled preference data on the helpfulness and harmlessness of prompt-response pairs, where we prioritize the safer samples. ... The experiments are mainly based on the Uni-RLHF benchmark (Yuan et al., 2024), which provides human preference labels for the corresponding offline D4RL dataset. |
| Dataset Splits | No | The paper describes how the offline preference dataset is constructed, for example, by generating trajectories from different policies and sampling pairs from them. It also states: 'We evaluate each approach using 100 test episodes by reporting both the mean and CVa R0.1'. However, it does not provide explicit training/validation/test splits of a fixed dataset in terms of percentages or sample counts for training the models described in the paper. |
| Hardware Specification | Yes | In this paper, we utilized a total of 8 NVIDIA Ge Force RTX 4090 GPUs, each equipped with 24 GB of memory. ... fine-tuning the PPO or Distributional PPO actor per epoch takes around 6 hours on 8 A800 GPUs. |
| Software Dependencies | No | The paper mentions using Python implicitly by referring to frameworks like PyTorch (through references to SAC, PPO, etc.), but it does not specify exact version numbers for Python itself or any other key software libraries, frameworks, or solvers. |
| Experiment Setup | Yes | Experiment Settings. Our experiments primarily utilize the public platform Uni-RLHF (Yuan et al., 2024), which is tailored for offline Pb RL. ... The random seeds in the continuous environments are 0, 123, 321, and 666. We trained the agents offline and chose the final epoch for evaluation over 100 episodes. ... Table 4: List of hyperparameters in the proposed UA-Pb RL. To ensure equitable comparisons, we maintain consistency in the parameters of the same neural networks across different models. ... For the choice of the prior α0, β0 for Beta distribution, we utilize an uninformed prior such that α0 = β0 = 1. |