S-EPOA: Overcoming the Indistinguishability of Segments with Skill-Driven Preference-Based Reinforcement Learning
Authors: Ni Mu, Yao Luan, Yiqin Yang, Bo Xu, Qing-Shan Jia
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on a range of tasks, including robotic manipulation and locomotion, demonstrate that S-EPOA significantly outperforms conventional Pb RL methods in terms of both robustness and learning efficiency. The results highlight the effectiveness of skill-driven learning in overcoming the challenges posed by segment indistinguishability. |
| Researcher Affiliation | Academia | 1Beijing Key Laboratory of Embodied Intelligence Systems, Department of Automation, Tsinghua University 2 The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 S-EPOA Framework |
| Open Source Code | No | The paper does not provide an explicit statement or a direct link to the source code for the methodology described in this paper (S-EPOA). It mentions using APS [Liu and Abbeel, 2021a] for unsupervised skill discovery, but this refers to a third-party tool rather than the authors' own implementation. |
| Open Datasets | Yes | We evaluate S-EPOA on several complex robotic manipulation and locomotion tasks from DMControl [Tassa et al., 2018] and Metaworld [Yu et al., 2020]. Specifically, We choose 4 complex tasks in DMControl: Cheetah_run, Walker_run, Quadruped_walk, Quadruped_run, and 3 complex tasks in Metaworld: Door_open, Button_press, Window_open. |
| Dataset Splits | No | The paper describes evaluations on simulated environments (DMControl, Metaworld) and reports performance across 5 random seeds. It does not provide specific train/test/validation dataset splits in the conventional sense for pre-existing datasets, as data is generated through interaction with the environments. |
| Hardware Specification | No | The paper does not explicitly mention the specific hardware specifications (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions several algorithms and frameworks like SAC, APS, DIAYN, and CIC, but does not specify the version numbers of any software dependencies (e.g., Python, PyTorch, TensorFlow, or specific libraries) used in the implementation. |
| Experiment Setup | Yes | The paper specifies experimental parameters such as the frequency of feedback K, number of queries M per feedback session, total feedback number Ntotal (in Algorithm 1), and error rates ϵ {0.1, 0.2, 0.3} for the noisy teacher. Appendix B.2 further details the architecture of the reward model (two hidden layers of 256 units with ReLU activations, 4 reward models in an ensemble), segment length H=50, and batch size 256, stating that default hyperparameters from PEBBLE [Lee et al., 2021b] were adopted for baselines. |