reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Distributional Approach to Uncertainty-Aware Preference Alignment Using Offline Demonstrations

Authors: Sheng Xu, Bo Yue, Hongyuan Zha, Guiliang Liu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In the empirical study, we start by illustrating the learned distributional reward model in discrete Gridworld environments (Section 6.1). Next, we construct three Risky Point Maze environments and empirically evaluate the effectiveness of the proposed UA-Pb RL algorithm with trajectory visualization (Section 6.2). To assess performance in more challenging settings, we also examine two complex robot navigation tasks (Section 6.3). Lastly, we extend the experiments to explore the application in Large Language Model alignment (Section 6.4). Experimental results demonstrate that UA-Pb RL effectively identifies and avoids states with high uncertainty, facilitating risk-averse behaviors across various tasks, including robot control and language model alignment.
Researcher Affiliation	Academia	Sheng Xu1, Bo Yue1, Hongyuan Zha1, Guiliang Liu1 1School of Data Science, The Chinese University of Hong Kong, Shenzhen EMAIL EMAIL
Pseudocode	Yes	Algorithm 1: Uncertainty-Aware Preference-based Reinforcement Learning (UA-Pb RL)
Open Source Code	Yes	Experimental results demonstrate that UA-Pb RL effectively identifies and avoids states with high uncertainty, facilitating risk-averse behaviors across various tasks, including robot control and language model alignment. The code is available at https://github.com/Jasonxu1225/UA-Pb RL.
Open Datasets	Yes	Furthermore, we extended our experiments to the context of Large Language Model (LLM) alignment, demonstrating the effectiveness of UA-Pb RL for LLM fine-tuning. ... Specifically, we finetune two publicly pre-trained LLMs, Tiny Lla Ma-1.1B (Zhang et al., 2024a) and Lla Ma-3-8B (Dubey et al., 2024), on the PKU-Safe RLHF10K dataset (Ji et al., 2023; Dai et al., 2024), which contains human-labeled preference data on the helpfulness and harmlessness of prompt-response pairs, where we prioritize the safer samples. ... The experiments are mainly based on the Uni-RLHF benchmark (Yuan et al., 2024), which provides human preference labels for the corresponding offline D4RL dataset.
Dataset Splits	No	The paper describes how the offline preference dataset is constructed, for example, by generating trajectories from different policies and sampling pairs from them. It also states: 'We evaluate each approach using 100 test episodes by reporting both the mean and CVa R0.1'. However, it does not provide explicit training/validation/test splits of a fixed dataset in terms of percentages or sample counts for training the models described in the paper.
Hardware Specification	Yes	In this paper, we utilized a total of 8 NVIDIA Ge Force RTX 4090 GPUs, each equipped with 24 GB of memory. ... fine-tuning the PPO or Distributional PPO actor per epoch takes around 6 hours on 8 A800 GPUs.
Software Dependencies	No	The paper mentions using Python implicitly by referring to frameworks like PyTorch (through references to SAC, PPO, etc.), but it does not specify exact version numbers for Python itself or any other key software libraries, frameworks, or solvers.
Experiment Setup	Yes	Experiment Settings. Our experiments primarily utilize the public platform Uni-RLHF (Yuan et al., 2024), which is tailored for offline Pb RL. ... The random seeds in the continuous environments are 0, 123, 321, and 666. We trained the agents offline and chose the final epoch for evaluation over 100 episodes. ... Table 4: List of hyperparameters in the proposed UA-Pb RL. To ensure equitable comparisons, we maintain consistency in the parameters of the same neural networks across different models. ... For the choice of the prior α0, β0 for Beta distribution, we utilize an uninformed prior such that α0 = β0 = 1.