reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Adversarial Policy Optimization for Offline Preference-based Reinforcement Learning

Authors: Hyungkyu Kang, Min-hwan Oh

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on continuous control tasks demonstrate that APPO effectively learns from complex datasets, showing comparable performance with existing state-of-the-art methods.
Researcher Affiliation	Academia	Hyungkyu Kang Seoul National University Seoul, South Korea EMAIL Min-hwan Oh Seoul National University Seoul, South Korea EMAIL
Pseudocode	Yes	Algorithm 1 Adversarial Preference-based Policy Optimization with Rollout (APPO) ... Algorithm 2 Adversarial Preference-based Policy Optimization (APPO) ... Algorithm 3 PE: Monte Carlo Policy Evaluation ... Algorithm 4 APPO (Practical version)
Open Source Code	Yes	Our code is available at https://github.com/oh-lab/APPO.git.
Open Datasets	Yes	We evaluate our proposed algorithm on the Meta-World (Yu et al., 2020) medium-replay and medium-expert datasets from Choi et al. (2024).
Dataset Splits	No	The paper uses Meta-World medium-replay and medium-expert datasets and describes how preference labels are generated or segment lengths are set, but it does not specify explicit training/validation/test splits of the trajectory data or environment data for reproducing experiments.
Hardware Specification	Yes	Experiments were conducted on an Intel Xeon Gold 6226R CPU and an Nvidia Ge Force RTX 3090 GPU.
Software Dependencies	No	The paper mentions 'Adam (Kingma & Ba, 2015)' as an optimizer and 'neural networks' for implementation, but does not provide specific version numbers for software libraries or frameworks (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	Table 4: Implementation details and hyperparameters. For the reward model and MR algorithm, we follow the official implementation of Choi et al. (2024). ... Reward model: Neural networks 3-layers, hidden dimension 128 ... Optimizer Adam ... learning rate 1e-3 Batch size 512 Epochs 300 ... Neural networks (Q, V, π) 3-layers, hidden dimension 256 ... Q, V, π optimizer Adam with learning rate 3e-4 Batch size 256 Target network soft update 0.005 ... discount factor 0.99 ... π optimizer Adam with learning rate 3e-5