reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Online-to-Offline RL for Agent Alignment

Authors: Xu Liu, Haobo Fu, Stefano V. Albrecht, QIANG FU, Shuai Li

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments across diverse environments and preference types demonstrate the performance of ALIGN-GAP, achieving effective alignment with human preferences.
Researcher Affiliation	Collaboration	Xu Liu1,2, Haobo Fu2, Stefano V. Albrecht3, Qiang Fu2, Shuai Li1 Shanghai Jiao Tong University1, Tencent2, University of Edinburgh3 EMAIL EMAIL, EMAIL
Pseudocode	No	The paper describes its methodology using textual explanations and mathematical formulations, such as Equation 4 for the reward model loss and Equation 6 for the curriculum reward. However, there are no explicitly labeled pseudocode blocks or algorithm listings in the paper.
Open Source Code	No	The paper does not contain an explicit statement or a direct link to a code repository indicating that the source code for ALIGN-GAP or its experiments is publicly available.
Open Datasets	Yes	Specifically, we conduct experiments on the D4RL locomotion tasks, including Half Cheetah, Walker2D, and Hopper environments (Fu et al., 2020). Additionally, we extend our experiments to Atari Pac-Man and Space Invaders games (Brockman, 2016). To assess the performance of our proposed methods against various constructions of human preferences, we utilize Google s Atari Replay Datasets (Agarwal et al., 2020)
Dataset Splits	No	After training the human proxies, we collect a small amount of human data (only 10 episodes) using the proxies for subsequent agent alignment. In each environment, we obtain three distinct preferences of human data, retaining only 10 episodes for each preference.
Hardware Specification	No	The paper does not explicitly mention the specific hardware (e.g., GPU models, CPU types, memory) used to run the experiments.
Software Dependencies	No	The paper mentions using Soft Actor-Critic (SAC) (Haarnoja et al., 2018), Deep Q-Networks (DQN) (Roderick et al., 2017), UMAP (Mc Innes et al., 2018), and PPO (Schulman et al., 2017). However, specific version numbers for these software components or any other libraries are not provided.
Experiment Setup	Yes	The hyper-parameters and values are summarized in Table 5. Table 5: Hyper-parameters for agent alignment with ALIGN-GAP and baselines. Hyper-Parameters Values Online Pre-train Steps 5e6 Offline Alignment Steps 1e6 SAC Actor Learning Rate 3e-4 SAC Critic Learning Rate 3e-4 SAC Batch Size 256 Offline Trajectory Number 10 DQN Learning Rate 5e-5 DQN Target Update Interval 10000 DQN Batch Size 32 Reward Model Sequence Length 64 Reward Model Latent Dim 256 Reward Model Training Batch Size 64 Reward Model Learning Rate 1e-4