reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

OGBench: Benchmarking Offline Goal-Conditioned RL

Authors: Seohong Park, Kevin Frans, Benjamin Eysenbach, Sergey Levine

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we propose OGBench, a new, high-quality benchmark for algorithms research in offline goal-conditioned RL. OGBench consists of 8 types of environments, 85 datasets, and reference implementations of 6 representative offline GCRL algorithms. We have designed these challenging and realistic environments and datasets to directly probe different capabilities of algorithms, such as stitching, long-horizon reasoning, and the ability to handle high-dimensional inputs and stochasticity. While representative algorithms may rank similarly on prior benchmarks, our experiments reveal stark strengths and weaknesses in these different capabilities, providing a strong foundation for building new algorithms. Project page: https://seohong.me/projects/ogbench
Researcher Affiliation	Academia	Seohong Park1 Kevin Frans1 Benjamin Eysenbach2 Sergey Levine1 1University of California, Berkeley 2Princeton University EMAIL
Pseudocode	No	The paper describes the objectives and methodologies for GCBC, GCIVL, GCIQL, QRL, CRL, and HIQL using mathematical equations (e.g., JGCBC(π) = E(s,a) p D(s,a),g p D traj(g\|s)[log π(a \| s, g)].), but it does not provide any structured pseudocode or algorithm blocks.
Open Source Code	Yes	We provide the code as well as the exact command-line flags to reproduce the entire benchmark table, datasets, and expert policies at https://github.com/seohongpark/ogbench.
Open Datasets	Yes	In this work, we propose OGBench, a new, high-quality benchmark for algorithms research in offline goal-conditioned RL. OGBench consists of 8 types of environments, 85 datasets, and reference implementations of 6 representative offline GCRL algorithms. ... We provide the code as well as the exact command-line flags to reproduce the entire benchmark table, datasets, and expert policies at https://github.com/seohongpark/ogbench.
Dataset Splits	No	The paper states: 'Each task in OGBench accompanies five pre-defined state-goal pairs for evaluation (Appendix H)' and 'We provide a separate validation dataset for each dataset'. While it mentions evaluation goals and validation datasets, it does not provide specific percentages or absolute sample counts for how the primary datasets are split into training, validation, and test sets. The 'evaluation goals' serve as test cases, but the detailed methodology for splitting the core trajectory datasets is not specified in terms of reproducible proportions.
Hardware Specification	Yes	Each run typically takes 2-5 hours (state-based tasks) or 5-12 hours (pixel-based tasks) on an A5000 GPU
Software Dependencies	No	The paper mentions software like JAX, MuJoCo, and Adam, but does not provide specific version numbers for these components. For example, it states: 'Our implementations of six offline GCRL algorithms (...) are based on JAX (Bradbury et al., 2018)' and 'Our benchmark environments only depend on Mu Jo Co (Todorov et al., 2012) and do not require any other dependencies (Table 1).'
Experiment Setup	Yes	We train the agents for 1M gradient steps (500K for pixel-based tasks), and average the results over 8 seeds (4 seeds for pixel-based tasks). We provide the full list of common hyperparameters in Table 10. We find that methods are more sensitive to policy extraction hyperparameters (e.g., the BC coefficient in DDPG+BC) (Park et al., 2024a), and report these in a separate table (Table 11). Specifically, for each method, we use the same value learning hyperparameters across the benchmark except for the discount factor γ (Table 10), but individually tune the policy extraction hyperparameters (e.g., AWR α and DDPG+BC α) for each dataset category (Tables 10 and 11).