reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Synthetic Data is Sufficient for Zero-Shot Visual Generalization from Offline Data

Authors: Ahmet H. Güzel, Ilija Bogunovic, Jack Parker-Holder

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments on the V-D4RL benchmark (continuous control) and Procgen benchmark (discrete control) demonstrate that our approach consistently reduces the generalization gap and improves performance in unseen environments.
Researcher Affiliation	Academia	Ahmet H. Güzel EMAIL University College London AI Centre Ilija Bogunovic EMAIL University College London AI Centre Jack Parker-Holder EMAIL University College London AI Centre
Pseudocode	No	The paper describes the methodology using text and equations, and includes architectural diagrams (e.g., Figure 2), but does not provide any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide a direct link to a code repository, an explicit statement of code release, or mention code in supplementary materials for the methodology described.
Open Datasets	Yes	We evaluated our method on two challenging offline RL benchmarks that test generalization capabilities in different domains: Visual D4RL (V-D4RL) (Lu et al., 2023a): This benchmark is a visual input version of the D4RL benchmark (Fu et al., 2021) and focuses on continuous control tasks with visual input. Offline Procgen (Mediratta et al., 2024): This is an offline version of Procgen benchmark Cobbe et al. (2020) procedurally generated games that targets discrete control tasks. It tests zero-shot generalization to entirely unseen levels.
Dataset Splits	No	The paper refers to using datasets from V-D4RL and Procgen benchmarks and mentions 'training and testing environments', but it does not explicitly provide specific percentages, sample counts, or detailed methodology for how the datasets are split into training, validation, or test sets within the main text. It defers to the original benchmarks for this information or to the supplementary material for experimental setup details.
Hardware Specification	No	The paper does not contain any specific details regarding the hardware (e.g., GPU models, CPU types, memory) used to run the experiments.
Software Dependencies	No	The paper mentions algorithms used (Dr Q+BC, CQL) and refers to standard settings from original benchmark papers, but it does not provide specific version numbers for any ancillary software dependencies such as programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	No	The hyperparameters, network architectures, and other implementation details follow the standard settings provided in the original benchmark papers. For completeness, we provide all hyperparameters and network architecture details in supplementary material.