S-EPOA: Overcoming the Indistinguishability of Segments with Skill-Driven Preference-Based Reinforcement Learning

Authors: Ni Mu, Yao Luan, Yiqin Yang, Bo Xu, Qing-Shan Jia

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on a range of tasks, including robotic manipulation and locomotion, demonstrate that S-EPOA significantly outperforms conventional Pb RL methods in terms of both robustness and learning efficiency. The results highlight the effectiveness of skill-driven learning in overcoming the challenges posed by segment indistinguishability.
Researcher Affiliation Academia 1Beijing Key Laboratory of Embodied Intelligence Systems, Department of Automation, Tsinghua University 2 The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences EMAIL, EMAIL, EMAIL
Pseudocode Yes Algorithm 1 S-EPOA Framework
Open Source Code No The paper does not provide an explicit statement or a direct link to the source code for the methodology described in this paper (S-EPOA). It mentions using APS [Liu and Abbeel, 2021a] for unsupervised skill discovery, but this refers to a third-party tool rather than the authors' own implementation.
Open Datasets Yes We evaluate S-EPOA on several complex robotic manipulation and locomotion tasks from DMControl [Tassa et al., 2018] and Metaworld [Yu et al., 2020]. Specifically, We choose 4 complex tasks in DMControl: Cheetah_run, Walker_run, Quadruped_walk, Quadruped_run, and 3 complex tasks in Metaworld: Door_open, Button_press, Window_open.
Dataset Splits No The paper describes evaluations on simulated environments (DMControl, Metaworld) and reports performance across 5 random seeds. It does not provide specific train/test/validation dataset splits in the conventional sense for pre-existing datasets, as data is generated through interaction with the environments.
Hardware Specification No The paper does not explicitly mention the specific hardware specifications (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions several algorithms and frameworks like SAC, APS, DIAYN, and CIC, but does not specify the version numbers of any software dependencies (e.g., Python, PyTorch, TensorFlow, or specific libraries) used in the implementation.
Experiment Setup Yes The paper specifies experimental parameters such as the frequency of feedback K, number of queries M per feedback session, total feedback number Ntotal (in Algorithm 1), and error rates ϵ {0.1, 0.2, 0.3} for the noisy teacher. Appendix B.2 further details the architecture of the reward model (two hidden layers of 256 units with ReLU activations, 4 reward models in an ensemble), segment length H=50, and batch size 256, stating that default hyperparameters from PEBBLE [Lee et al., 2021b] were adopted for baselines.