Behaviour Preference Regression for Offline Reinforcement Learning

Authors: Padmanaba Srinivasan, William Knottenbelt

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically evaluate BPR on the widely used D4RL Locomotion and Antmaze datasets, as well as the more challenging V-D4RL suite, which operates in image-based state spaces. BPR demonstrates state-of-the-art performance over all domains. Our on-policy experiments suggest that BPR takes advantage of the stability of on-policy value functions with minimal perceptible performance degradation on Locomotion datasets. Evaluation on D4RL (Fu et al. 2020) demonstrates that BPR achieves SOTA performance on Locomotion and Antmaze datasets. Additional tests on the image-based V-D4RL (Lu et al. 2022) tasks reveal that BPR is able to transition across modalities to achieve high performance in non-proprioceptive domains. We examine sensitivity to λ for off-policy BPR in a series of ablation experiments in the D4RL Locomotion tasks.
Researcher Affiliation Academia Department of Computing, Imperial College London EMAIL
Pseudocode Yes Algorithm 1: Policy improvement step. Comment NG denotes steps where gradients do not have to be computed. Require: Offline dataset D, pretrained EBM E( , ), training steps N Output: Trained policy π Let t = 0. for t = 1 to N do Sample (s, a, r, s ) D Sample a1, a2 π # NG Compute log π(a1|s), log π(a2|s) Compute E(s, a1) and E(s, a2) # NG Compute Q(s, a1) and Q(s, a2) # NG Update π using Equation 7. # Update critics end for return π
Open Source Code No The paper does not explicitly state that source code is provided, nor does it include a link to a code repository. It mentions that "Our actor critic implementation follows a standard implementation of SAC (Haarnoja et al. 2018)" but this refers to a third-party implementation, not their own.
Open Datasets Yes We empirically evaluate BPR on the widely used D4RL Locomotion and Antmaze datasets, as well as the more challenging V-D4RL suite, which operates in image-based state spaces. Evaluation on D4RL (Fu et al. 2020) demonstrates that BPR achieves SOTA performance on Locomotion and Antmaze datasets. Additional tests on the image-based V-D4RL (Lu et al. 2022) tasks reveal that BPR is able to transition across modalities to achieve high performance in non-proprioceptive domains.
Dataset Splits No The paper refers to existing datasets like D4RL and V-D4RL, and their inherent categorizations (e.g., 'medium', 'expert', 'replay' for Locomotion datasets), but it does not specify explicit training/test/validation splits (e.g., percentages, sample counts, or explicit references to predefined splits within these benchmarks that *they* used for their experiments) that would allow direct reproduction of data partitioning.
Hardware Specification No The paper does not provide any specific details about the hardware used for running the experiments (e.g., CPU, GPU models, or memory specifications).
Software Dependencies No The paper mentions following a "standard implementation of SAC" and using algorithms like CQL, IQL, TD3+BC, but it does not list any specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, or specific library versions).
Experiment Setup Yes In addition to the standard hyperparameters of SAC (clipped double-Q learning (Fujimoto, Hoof, and Meger 2018), entropy regularized off-policy Q functions), our algorithm introduces the hyperparameter λ, which controls the tradeoff between the KL constraint and maximizing behavioral consistency. In general, we find that simply using λ = 1.0 works well across all tasks; our primary results use this hyperparameter value and we perform ablations to evaluate sensitivity in our experiments.