reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Improving Reward Model Generalization from Adversarial Process Enhanced Preferences

Authors: Zhilong Zhang, Tian Xu, Xinghao Du, Xingchen Cao, Yihao Sun, Yang Yu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate that APEC consistently outperforms baseline methods in generating preferences with broader coverage across both vector-based and pixel-based control tasks. Consequently, the reward models trained with APEC align more closely with ground-truth rewards, deriving improved policy performance. Our code is released at https://github.com/Zzl35/APEC.
Researcher Affiliation	Collaboration	1National Key Laboratory for Novel Software Technology, Nanjing University 2School of Artificial Intelligence, Nanjing University, China 3Polixir.ai. Correspondence to: Yang Yu <EMAIL>.
Pseudocode	Yes	Algorithm 1 Adversarial Imitation Learning Require: Initialized reward r1, initialized policy bπ1, reward step size ηr = p\|S\|\|A\|/(4H2K), policy step size ηπ = p(2 ln(\|A\|))/(H2K). for k = 1, 2, . . . , K do
Open Source Code	Yes	Our code is released at https://github.com/Zzl35/APEC.
Open Datasets	Yes	Benchmark. We evaluate our method on five tasks from the feature-based Mujoco benchmark (Todorov et al., 2012), and three tasks from the pixel-based DMControl benchmark (Tassa et al., 2018), which are leading benchmarks in reinforcement learning and imitation learning that provide a diverse set of continuous control tasks.
Dataset Splits	No	The paper describes how suboptimal demonstrations are selected (performance ranging from 50% to 80% of the optimal) and how test datasets are created (uniformly sampling 1,000 trajectories from replay buffers). It also mentions the number of demonstrations used (one for Mujoco, ten for DMControl). However, it does not provide traditional train/test/validation dataset splits with explicit percentages, sample counts, or citations to predefined splits for a fixed dataset, but rather describes data generation and usage strategy.
Hardware Specification	Yes	All experiments were performed on an RTX 4090 GPU platform.
Software Dependencies	No	The paper mentions using SAC (Haarnoja et al., 2018), Dr Q-v2 (Yarats et al.), DAC (Kostrikov et al., 2019), PAIL (Cao et al., 2024), and ROT (Haldar et al., 2023) as algorithms or codebases. However, it does not provide specific version numbers for any underlying software dependencies like Python, PyTorch, CUDA, or other libraries.
Experiment Setup	Yes	Appendix B.1 provides several tables listing hyperparameters for different stages of the method: Table 5 (Hyperparameters of AIL in Mujoco tasks), Table 6 (Hyperparameters of AIL in DMControl tasks), Table 7 (Hyperparameters of preference generation), Table 8 (Hyperparameters of reward model in Mujoco tasks), and Table 9 (Hyperparameters of reward model in DMControl tasks). These tables include details such as hidden layers, hidden dimensions, activation functions, batch sizes, learning rates, and optimizer settings.