Distributional Successor Features Enable Zero-Shot Policy Optimization

Authors: Chuning Zhu, Xinqi Wang, Tyler Han, Simon S. Du, Abhishek Gupta

NeurIPS 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present a practical instantiation of Di SPOs using diffusion models and show their efficacy as a new class of transferable models, both theoretically and empirically across various simulated robotics problems. Videos and code are available at https://weirdlabuw.github.io/dispo/.
Researcher Affiliation Academia Chuning Zhu University of Washington EMAIL Xinqi Wang University of Washington EMAIL Tyler Han University of Washington EMAIL Simon Shaolei Du University of Washington EMAIL Abhishek Gupta University of Washington EMAIL
Pseudocode Yes Appendix F Algorithm Pseudocode
Open Source Code Yes Videos and code are available at https://weirdlabuw.github.io/dispo/.
Open Datasets Yes We use the D4RL dataset for pretraining and dense rewards described in Appendix D for adaptation. ... We use the offline dataset from [9] for pretraining and shaped rewards for adaptation. ... D4RL: Datasets for deep data-driven reinforcement learning. https://arxiv.org/abs/2004.07219, 2020.
Dataset Splits No The paper explicitly mentions using 'offline dataset for pretraining' and adapting to 'test-time rewards', but it does not specify a distinct validation set or its split ratios/counts for hyperparameter tuning or model selection during its experimental setup.
Hardware Specification Yes Each experiment (pretraining + adaptation) takes 3 hours on a single Nvidia L40 GPU.
Software Dependencies No The paper mentions using 'Adam W optimizer [35]' and 'conditional DDIMs [47]' but does not provide specific version numbers for programming languages, libraries, or other software dependencies.
Experiment Setup Yes We set d = 128 for all of our experiments. ... The noise prediction network is implemented as a 1-D Unet with down dimensions [256, 512, 1024]. ... We train our models on the offline dataset for 100,000 gradient steps using the Adam W optimizer [35] with batch size 2048. The learning rate for the outcome model and the policy are set to 3e 4 and adjusted according to a cosine learning rate schedule with 500 warmup steps.