Adversarial Policy Optimization for Offline Preference-based Reinforcement Learning
Authors: Hyungkyu Kang, Min-hwan Oh
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on continuous control tasks demonstrate that APPO effectively learns from complex datasets, showing comparable performance with existing state-of-the-art methods. |
| Researcher Affiliation | Academia | Hyungkyu Kang Seoul National University Seoul, South Korea EMAIL Min-hwan Oh Seoul National University Seoul, South Korea EMAIL |
| Pseudocode | Yes | Algorithm 1 Adversarial Preference-based Policy Optimization with Rollout (APPO) ... Algorithm 2 Adversarial Preference-based Policy Optimization (APPO) ... Algorithm 3 PE: Monte Carlo Policy Evaluation ... Algorithm 4 APPO (Practical version) |
| Open Source Code | Yes | Our code is available at https://github.com/oh-lab/APPO.git. |
| Open Datasets | Yes | We evaluate our proposed algorithm on the Meta-World (Yu et al., 2020) medium-replay and medium-expert datasets from Choi et al. (2024). |
| Dataset Splits | No | The paper uses Meta-World medium-replay and medium-expert datasets and describes how preference labels are generated or segment lengths are set, but it does not specify explicit training/validation/test splits of the trajectory data or environment data for reproducing experiments. |
| Hardware Specification | Yes | Experiments were conducted on an Intel Xeon Gold 6226R CPU and an Nvidia Ge Force RTX 3090 GPU. |
| Software Dependencies | No | The paper mentions 'Adam (Kingma & Ba, 2015)' as an optimizer and 'neural networks' for implementation, but does not provide specific version numbers for software libraries or frameworks (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | Table 4: Implementation details and hyperparameters. For the reward model and MR algorithm, we follow the official implementation of Choi et al. (2024). ... Reward model: Neural networks 3-layers, hidden dimension 128 ... Optimizer Adam ... learning rate 1e-3 Batch size 512 Epochs 300 ... Neural networks (Q, V, π) 3-layers, hidden dimension 256 ... Q, V, π optimizer Adam with learning rate 3e-4 Batch size 256 Target network soft update 0.005 ... discount factor 0.99 ... π optimizer Adam with learning rate 3e-5 |