Zero-Shot Reinforcement Learning from Low Quality Data

Authors: Scott Jeen, Tom Bewley, Jonathan Cullen

NeurIPS 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our proposals across various datasets, domains and tasks, and show that conservative zero-shot RL algorithms outperform their non-conservative counterparts on low quality datasets, and perform no worse on high quality datasets.
Researcher Affiliation Academia Scott Jeen University of Cambridge EMAIL Tom Bewley University of Bristol EMAIL Jonathan M. Cullen University of Cambridge EMAIL
Pseudocode Yes Algorithm 1 Pre-training value-conservative forward-backward representations
Open Source Code Yes Our code is available via the project page https://enjeeneer.io/projects/zero-shot-rl/.
Open Datasets Yes We respond to Q1-Q3 using the Ex ORL benchmark [95]. We respond to Q4 using the D4RL benchmark [21].
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits in terms of percentages or sample counts. It trains on a static offline dataset and evaluates performance via rollouts and task inference from Dlabelled, but does not define a separate 'validation' split for the main dataset.
Hardware Specification Yes We train our models on NVIDIA A100 GPUs.
Software Dependencies No This work was enabled by: Num Py [30], Py Torch [61], Pandas [56] and Matplotlib [31]. (No version numbers provided for these software packages).
Experiment Setup Yes Hyperparameters are reported in Table 4. Latent dimension d 50 (100 for maze) F / ψ dimensions (1024, 1024) B / φ dimensions (256, 256, 256) Preprocessor dimensions (1024, 1024) Std. deviation for policy smoothing σ 0.2 Truncation level for policy smoothing 0.3 Learning steps 1,000,000 Batch size 512 Optimiser Adam [38] Learning rate 0.0001 Discount γ 0.98 (0.99 for maze) Activations (unless otherwise stated) Re LU Target network Polyak smoothing coefficient 0.01 z-inference labels 10,000 z mixing ratio 0.5 Conservative budget τ 50 (45 for D4RL) OOD action samples per policy N 3