reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Zero-Shot Reinforcement Learning from Low Quality Data

Authors: Scott Jeen, Tom Bewley, Jonathan Cullen

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our proposals across various datasets, domains and tasks, and show that conservative zero-shot RL algorithms outperform their non-conservative counterparts on low quality datasets, and perform no worse on high quality datasets.
Researcher Affiliation	Academia	Scott Jeen University of Cambridge EMAIL Tom Bewley University of Bristol EMAIL Jonathan M. Cullen University of Cambridge EMAIL
Pseudocode	Yes	Algorithm 1 Pre-training value-conservative forward-backward representations
Open Source Code	Yes	Our code is available via the project page https://enjeeneer.io/projects/zero-shot-rl/.
Open Datasets	Yes	We respond to Q1-Q3 using the Ex ORL benchmark [95]. We respond to Q4 using the D4RL benchmark [21].
Dataset Splits	No	The paper does not explicitly provide training/test/validation dataset splits in terms of percentages or sample counts. It trains on a static offline dataset and evaluates performance via rollouts and task inference from Dlabelled, but does not define a separate 'validation' split for the main dataset.
Hardware Specification	Yes	We train our models on NVIDIA A100 GPUs.
Software Dependencies	No	This work was enabled by: Num Py [30], Py Torch [61], Pandas [56] and Matplotlib [31]. (No version numbers provided for these software packages).
Experiment Setup	Yes	Hyperparameters are reported in Table 4. Latent dimension d 50 (100 for maze) F / ψ dimensions (1024, 1024) B / φ dimensions (256, 256, 256) Preprocessor dimensions (1024, 1024) Std. deviation for policy smoothing σ 0.2 Truncation level for policy smoothing 0.3 Learning steps 1,000,000 Batch size 512 Optimiser Adam [38] Learning rate 0.0001 Discount γ 0.98 (0.99 for maze) Activations (unless otherwise stated) Re LU Target network Polyak smoothing coefficient 0.01 z-inference labels 10,000 z mixing ratio 0.5 Conservative budget τ 50 (45 for D4RL) OOD action samples per policy N 3