reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

S-EPOA: Overcoming the Indistinguishability of Segments with Skill-Driven Preference-Based Reinforcement Learning

Authors: Ni Mu, Yao Luan, Yiqin Yang, Bo Xu, Qing-Shan Jia

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on a range of tasks, including robotic manipulation and locomotion, demonstrate that S-EPOA significantly outperforms conventional Pb RL methods in terms of both robustness and learning efficiency. The results highlight the effectiveness of skill-driven learning in overcoming the challenges posed by segment indistinguishability.
Researcher Affiliation	Academia	1Beijing Key Laboratory of Embodied Intelligence Systems, Department of Automation, Tsinghua University 2 The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 S-EPOA Framework
Open Source Code	No	The paper does not provide an explicit statement or a direct link to the source code for the methodology described in this paper (S-EPOA). It mentions using APS [Liu and Abbeel, 2021a] for unsupervised skill discovery, but this refers to a third-party tool rather than the authors' own implementation.
Open Datasets	Yes	We evaluate S-EPOA on several complex robotic manipulation and locomotion tasks from DMControl [Tassa et al., 2018] and Metaworld [Yu et al., 2020]. Specifically, We choose 4 complex tasks in DMControl: Cheetah_run, Walker_run, Quadruped_walk, Quadruped_run, and 3 complex tasks in Metaworld: Door_open, Button_press, Window_open.
Dataset Splits	No	The paper describes evaluations on simulated environments (DMControl, Metaworld) and reports performance across 5 random seeds. It does not provide specific train/test/validation dataset splits in the conventional sense for pre-existing datasets, as data is generated through interaction with the environments.
Hardware Specification	No	The paper does not explicitly mention the specific hardware specifications (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions several algorithms and frameworks like SAC, APS, DIAYN, and CIC, but does not specify the version numbers of any software dependencies (e.g., Python, PyTorch, TensorFlow, or specific libraries) used in the implementation.
Experiment Setup	Yes	The paper specifies experimental parameters such as the frequency of feedback K, number of queries M per feedback session, total feedback number Ntotal (in Algorithm 1), and error rates ϵ {0.1, 0.2, 0.3} for the noisy teacher. Appendix B.2 further details the architecture of the reward model (two hidden layers of 256 units with ReLU activations, 4 reward models in an ensemble), segment length H=50, and batch size 256, stating that default hyperparameters from PEBBLE [Lee et al., 2021b] were adopted for baselines.