Improving Reward Model Generalization from Adversarial Process Enhanced Preferences
Authors: Zhilong Zhang, Tian Xu, Xinghao Du, Xingchen Cao, Yihao Sun, Yang Yu
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that APEC consistently outperforms baseline methods in generating preferences with broader coverage across both vector-based and pixel-based control tasks. Consequently, the reward models trained with APEC align more closely with ground-truth rewards, deriving improved policy performance. Our code is released at https://github.com/Zzl35/APEC. |
| Researcher Affiliation | Collaboration | 1National Key Laboratory for Novel Software Technology, Nanjing University 2School of Artificial Intelligence, Nanjing University, China 3Polixir.ai. Correspondence to: Yang Yu <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Adversarial Imitation Learning Require: Initialized reward r1, initialized policy bπ1, reward step size ηr = p|S||A|/(4H2K), policy step size ηπ = p(2 ln(|A|))/(H2K). for k = 1, 2, . . . , K do |
| Open Source Code | Yes | Our code is released at https://github.com/Zzl35/APEC. |
| Open Datasets | Yes | Benchmark. We evaluate our method on five tasks from the feature-based Mujoco benchmark (Todorov et al., 2012), and three tasks from the pixel-based DMControl benchmark (Tassa et al., 2018), which are leading benchmarks in reinforcement learning and imitation learning that provide a diverse set of continuous control tasks. |
| Dataset Splits | No | The paper describes how suboptimal demonstrations are selected (performance ranging from 50% to 80% of the optimal) and how test datasets are created (uniformly sampling 1,000 trajectories from replay buffers). It also mentions the number of demonstrations used (one for Mujoco, ten for DMControl). However, it does not provide traditional train/test/validation dataset splits with explicit percentages, sample counts, or citations to predefined splits for a fixed dataset, but rather describes data generation and usage strategy. |
| Hardware Specification | Yes | All experiments were performed on an RTX 4090 GPU platform. |
| Software Dependencies | No | The paper mentions using SAC (Haarnoja et al., 2018), Dr Q-v2 (Yarats et al.), DAC (Kostrikov et al., 2019), PAIL (Cao et al., 2024), and ROT (Haldar et al., 2023) as algorithms or codebases. However, it does not provide specific version numbers for any underlying software dependencies like Python, PyTorch, CUDA, or other libraries. |
| Experiment Setup | Yes | Appendix B.1 provides several tables listing hyperparameters for different stages of the method: Table 5 (Hyperparameters of AIL in Mujoco tasks), Table 6 (Hyperparameters of AIL in DMControl tasks), Table 7 (Hyperparameters of preference generation), Table 8 (Hyperparameters of reward model in Mujoco tasks), and Table 9 (Hyperparameters of reward model in DMControl tasks). These tables include details such as hidden layers, hidden dimensions, activation functions, batch sizes, learning rates, and optimizer settings. |