Online-to-Offline RL for Agent Alignment

Authors: Xu Liu, Haobo Fu, Stefano V. Albrecht, QIANG FU, Shuai Li

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments across diverse environments and preference types demonstrate the performance of ALIGN-GAP, achieving effective alignment with human preferences.
Researcher Affiliation Collaboration Xu Liu1,2, Haobo Fu2, Stefano V. Albrecht3, Qiang Fu2, Shuai Li1 Shanghai Jiao Tong University1, Tencent2, University of Edinburgh3 EMAIL EMAIL, EMAIL
Pseudocode No The paper describes its methodology using textual explanations and mathematical formulations, such as Equation 4 for the reward model loss and Equation 6 for the curriculum reward. However, there are no explicitly labeled pseudocode blocks or algorithm listings in the paper.
Open Source Code No The paper does not contain an explicit statement or a direct link to a code repository indicating that the source code for ALIGN-GAP or its experiments is publicly available.
Open Datasets Yes Specifically, we conduct experiments on the D4RL locomotion tasks, including Half Cheetah, Walker2D, and Hopper environments (Fu et al., 2020). Additionally, we extend our experiments to Atari Pac-Man and Space Invaders games (Brockman, 2016). To assess the performance of our proposed methods against various constructions of human preferences, we utilize Google s Atari Replay Datasets (Agarwal et al., 2020)
Dataset Splits No After training the human proxies, we collect a small amount of human data (only 10 episodes) using the proxies for subsequent agent alignment. In each environment, we obtain three distinct preferences of human data, retaining only 10 episodes for each preference.
Hardware Specification No The paper does not explicitly mention the specific hardware (e.g., GPU models, CPU types, memory) used to run the experiments.
Software Dependencies No The paper mentions using Soft Actor-Critic (SAC) (Haarnoja et al., 2018), Deep Q-Networks (DQN) (Roderick et al., 2017), UMAP (Mc Innes et al., 2018), and PPO (Schulman et al., 2017). However, specific version numbers for these software components or any other libraries are not provided.
Experiment Setup Yes The hyper-parameters and values are summarized in Table 5. Table 5: Hyper-parameters for agent alignment with ALIGN-GAP and baselines. Hyper-Parameters Values Online Pre-train Steps 5e6 Offline Alignment Steps 1e6 SAC Actor Learning Rate 3e-4 SAC Critic Learning Rate 3e-4 SAC Batch Size 256 Offline Trajectory Number 10 DQN Learning Rate 5e-5 DQN Target Update Interval 10000 DQN Batch Size 32 Reward Model Sequence Length 64 Reward Model Latent Dim 256 Reward Model Training Batch Size 64 Reward Model Learning Rate 1e-4