reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Behaviour Preference Regression for Offline Reinforcement Learning

Authors: Padmanaba Srinivasan, William Knottenbelt

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically evaluate BPR on the widely used D4RL Locomotion and Antmaze datasets, as well as the more challenging V-D4RL suite, which operates in image-based state spaces. BPR demonstrates state-of-the-art performance over all domains. Our on-policy experiments suggest that BPR takes advantage of the stability of on-policy value functions with minimal perceptible performance degradation on Locomotion datasets. Evaluation on D4RL (Fu et al. 2020) demonstrates that BPR achieves SOTA performance on Locomotion and Antmaze datasets. Additional tests on the image-based V-D4RL (Lu et al. 2022) tasks reveal that BPR is able to transition across modalities to achieve high performance in non-proprioceptive domains. We examine sensitivity to λ for off-policy BPR in a series of ablation experiments in the D4RL Locomotion tasks.
Researcher Affiliation	Academia	Department of Computing, Imperial College London EMAIL
Pseudocode	Yes	Algorithm 1: Policy improvement step. Comment NG denotes steps where gradients do not have to be computed. Require: Offline dataset D, pretrained EBM E( , ), training steps N Output: Trained policy π Let t = 0. for t = 1 to N do Sample (s, a, r, s ) D Sample a1, a2 π # NG Compute log π(a1\|s), log π(a2\|s) Compute E(s, a1) and E(s, a2) # NG Compute Q(s, a1) and Q(s, a2) # NG Update π using Equation 7. # Update critics end for return π
Open Source Code	No	The paper does not explicitly state that source code is provided, nor does it include a link to a code repository. It mentions that "Our actor critic implementation follows a standard implementation of SAC (Haarnoja et al. 2018)" but this refers to a third-party implementation, not their own.
Open Datasets	Yes	We empirically evaluate BPR on the widely used D4RL Locomotion and Antmaze datasets, as well as the more challenging V-D4RL suite, which operates in image-based state spaces. Evaluation on D4RL (Fu et al. 2020) demonstrates that BPR achieves SOTA performance on Locomotion and Antmaze datasets. Additional tests on the image-based V-D4RL (Lu et al. 2022) tasks reveal that BPR is able to transition across modalities to achieve high performance in non-proprioceptive domains.
Dataset Splits	No	The paper refers to existing datasets like D4RL and V-D4RL, and their inherent categorizations (e.g., 'medium', 'expert', 'replay' for Locomotion datasets), but it does not specify explicit training/test/validation splits (e.g., percentages, sample counts, or explicit references to predefined splits within these benchmarks that they used for their experiments) that would allow direct reproduction of data partitioning.
Hardware Specification	No	The paper does not provide any specific details about the hardware used for running the experiments (e.g., CPU, GPU models, or memory specifications).
Software Dependencies	No	The paper mentions following a "standard implementation of SAC" and using algorithms like CQL, IQL, TD3+BC, but it does not list any specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, or specific library versions).
Experiment Setup	Yes	In addition to the standard hyperparameters of SAC (clipped double-Q learning (Fujimoto, Hoof, and Meger 2018), entropy regularized off-policy Q functions), our algorithm introduces the hyperparameter λ, which controls the tradeoff between the KL constraint and maximizing behavioral consistency. In general, we find that simply using λ = 1.0 works well across all tasks; our primary results use this hyperparameter value and we perform ablations to evaluate sensitivity in our experiments.