reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Graph-Assisted Stitching for Offline Hierarchical Reinforcement Learning

Authors: Seungho Baek, Taegeon Park, Jongchan Park, Seungjun Oh, Yusung Kim

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate GAS on OGBench (Park et al., 2025a) and D4RL (Fu et al., 2020), spanning diverse dataset types. We compare its performance against offline goal-conditioned and hierarchical baselines. For each dataset, we report the average normalized return across five test-time goals, except for kitchen, which uses a single fixed goal. Each goal is evaluated with 50 rollouts, and results are averaged over 4 random seeds. Bold numbers indicate results that are at least 95% of the best-performing method in each row. Details of the datasets and baselines are provided in Appendices C, D.
Researcher Affiliation	Academia	1Department of Computer Science and Engineering, Sungkyunkwan University, Suwon, Republic of Korea 2Department of Artificial Intelligence, Sungkyunkwan University, Suwon, Republic of Korea. Correspondence to: Yusung Kim <EMAIL>.
Pseudocode	Yes	Algorithm 1 Task Planning and Execution Algorithm 2 TD-Aware Graph Construction
Open Source Code	Yes	Our source code is available at: https: //github.com/qortmdgh4141/GAS.
Open Datasets	Yes	We evaluate GAS on OGBench (Park et al., 2025a) and D4RL (Fu et al., 2020), spanning diverse dataset types.
Dataset Splits	No	For each dataset, we report the average normalized return across five test-time goals, except for kitchen, which uses a single fixed goal. Each goal is evaluated with 50 rollouts, and results are averaged over 4 random seeds. We follow the goal specification protocol of OGBench (Park et al., 2025a), where each task provides five predefined state-goal pairs.
Hardware Specification	Yes	We run our experiments on an internal cluster consisting of RTX 3090 GPUs.
Software Dependencies	No	Our implementations of GAS and seven baselines are based on JAX (Bradbury et al., 2018). We apply layer normalization (Ba et al., 2016) to all MLP layers. For pixel-based environments, we adopt the Impala CNN (Espeholt et al., 2018) to process image inputs. Nonlinearity GELU (Hendrycks & Gimpel, 2016). Optimizer Adam (Kingma & Ba, 2015).
Experiment Setup	Yes	Table 7: Common hyperparameters used across all datasets. Table 8: Task-specific hyperparameters for each dataset. We provide a common list of hyperparameters in Table 7 and task-specific hyperparameters in Table 8. We apply layer normalization (Ba et al., 2016) to all MLP layers. For pixel-based environments, we adopt the Impala CNN (Espeholt et al., 2018) to process image inputs. While most components use 512-dimensional output features, we reduce the output dimension to 32 for the Temporal Distance Representation (TDR) to balance representational capacity and stability, as discussed in Appendix B. Following prior work (Park et al., 2023; 2024c; 2025a), we do not share encoders across components. As a result, in pixel-based environments, we use four separate CNN encoders for TDR, the Q-function, the value function, and the low-level policy. We also apply random crop augmentation (Kostrikov et al., 2021) with a probability of 0.5 to mitigate overfitting (Zheng et al., 2024).