Online-to-Offline RL for Agent Alignment
Authors: Xu Liu, Haobo Fu, Stefano V. Albrecht, QIANG FU, Shuai Li
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments across diverse environments and preference types demonstrate the performance of ALIGN-GAP, achieving effective alignment with human preferences. |
| Researcher Affiliation | Collaboration | Xu Liu1,2, Haobo Fu2, Stefano V. Albrecht3, Qiang Fu2, Shuai Li1 Shanghai Jiao Tong University1, Tencent2, University of Edinburgh3 EMAIL EMAIL, EMAIL |
| Pseudocode | No | The paper describes its methodology using textual explanations and mathematical formulations, such as Equation 4 for the reward model loss and Equation 6 for the curriculum reward. However, there are no explicitly labeled pseudocode blocks or algorithm listings in the paper. |
| Open Source Code | No | The paper does not contain an explicit statement or a direct link to a code repository indicating that the source code for ALIGN-GAP or its experiments is publicly available. |
| Open Datasets | Yes | Specifically, we conduct experiments on the D4RL locomotion tasks, including Half Cheetah, Walker2D, and Hopper environments (Fu et al., 2020). Additionally, we extend our experiments to Atari Pac-Man and Space Invaders games (Brockman, 2016). To assess the performance of our proposed methods against various constructions of human preferences, we utilize Google s Atari Replay Datasets (Agarwal et al., 2020) |
| Dataset Splits | No | After training the human proxies, we collect a small amount of human data (only 10 episodes) using the proxies for subsequent agent alignment. In each environment, we obtain three distinct preferences of human data, retaining only 10 episodes for each preference. |
| Hardware Specification | No | The paper does not explicitly mention the specific hardware (e.g., GPU models, CPU types, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions using Soft Actor-Critic (SAC) (Haarnoja et al., 2018), Deep Q-Networks (DQN) (Roderick et al., 2017), UMAP (Mc Innes et al., 2018), and PPO (Schulman et al., 2017). However, specific version numbers for these software components or any other libraries are not provided. |
| Experiment Setup | Yes | The hyper-parameters and values are summarized in Table 5. Table 5: Hyper-parameters for agent alignment with ALIGN-GAP and baselines. Hyper-Parameters Values Online Pre-train Steps 5e6 Offline Alignment Steps 1e6 SAC Actor Learning Rate 3e-4 SAC Critic Learning Rate 3e-4 SAC Batch Size 256 Offline Trajectory Number 10 DQN Learning Rate 5e-5 DQN Target Update Interval 10000 DQN Batch Size 32 Reward Model Sequence Length 64 Reward Model Latent Dim 256 Reward Model Training Batch Size 64 Reward Model Learning Rate 1e-4 |