Adversarial Policy Optimization for Offline Preference-based Reinforcement Learning

Authors: Hyungkyu Kang, Min-hwan Oh

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on continuous control tasks demonstrate that APPO effectively learns from complex datasets, showing comparable performance with existing state-of-the-art methods.
Researcher Affiliation Academia Hyungkyu Kang Seoul National University Seoul, South Korea EMAIL Min-hwan Oh Seoul National University Seoul, South Korea EMAIL
Pseudocode Yes Algorithm 1 Adversarial Preference-based Policy Optimization with Rollout (APPO) ... Algorithm 2 Adversarial Preference-based Policy Optimization (APPO) ... Algorithm 3 PE: Monte Carlo Policy Evaluation ... Algorithm 4 APPO (Practical version)
Open Source Code Yes Our code is available at https://github.com/oh-lab/APPO.git.
Open Datasets Yes We evaluate our proposed algorithm on the Meta-World (Yu et al., 2020) medium-replay and medium-expert datasets from Choi et al. (2024).
Dataset Splits No The paper uses Meta-World medium-replay and medium-expert datasets and describes how preference labels are generated or segment lengths are set, but it does not specify explicit training/validation/test splits of the trajectory data or environment data for reproducing experiments.
Hardware Specification Yes Experiments were conducted on an Intel Xeon Gold 6226R CPU and an Nvidia Ge Force RTX 3090 GPU.
Software Dependencies No The paper mentions 'Adam (Kingma & Ba, 2015)' as an optimizer and 'neural networks' for implementation, but does not provide specific version numbers for software libraries or frameworks (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes Table 4: Implementation details and hyperparameters. For the reward model and MR algorithm, we follow the official implementation of Choi et al. (2024). ... Reward model: Neural networks 3-layers, hidden dimension 128 ... Optimizer Adam ... learning rate 1e-3 Batch size 512 Epochs 300 ... Neural networks (Q, V, π) 3-layers, hidden dimension 256 ... Q, V, π optimizer Adam with learning rate 3e-4 Batch size 256 Target network soft update 0.005 ... discount factor 0.99 ... π optimizer Adam with learning rate 3e-5