Distributional Successor Features Enable Zero-Shot Policy Optimization
Authors: Chuning Zhu, Xinqi Wang, Tyler Han, Simon S. Du, Abhishek Gupta
NeurIPS 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present a practical instantiation of Di SPOs using diffusion models and show their efficacy as a new class of transferable models, both theoretically and empirically across various simulated robotics problems. Videos and code are available at https://weirdlabuw.github.io/dispo/. |
| Researcher Affiliation | Academia | Chuning Zhu University of Washington EMAIL Xinqi Wang University of Washington EMAIL Tyler Han University of Washington EMAIL Simon Shaolei Du University of Washington EMAIL Abhishek Gupta University of Washington EMAIL |
| Pseudocode | Yes | Appendix F Algorithm Pseudocode |
| Open Source Code | Yes | Videos and code are available at https://weirdlabuw.github.io/dispo/. |
| Open Datasets | Yes | We use the D4RL dataset for pretraining and dense rewards described in Appendix D for adaptation. ... We use the offline dataset from [9] for pretraining and shaped rewards for adaptation. ... D4RL: Datasets for deep data-driven reinforcement learning. https://arxiv.org/abs/2004.07219, 2020. |
| Dataset Splits | No | The paper explicitly mentions using 'offline dataset for pretraining' and adapting to 'test-time rewards', but it does not specify a distinct validation set or its split ratios/counts for hyperparameter tuning or model selection during its experimental setup. |
| Hardware Specification | Yes | Each experiment (pretraining + adaptation) takes 3 hours on a single Nvidia L40 GPU. |
| Software Dependencies | No | The paper mentions using 'Adam W optimizer [35]' and 'conditional DDIMs [47]' but does not provide specific version numbers for programming languages, libraries, or other software dependencies. |
| Experiment Setup | Yes | We set d = 128 for all of our experiments. ... The noise prediction network is implemented as a 1-D Unet with down dimensions [256, 512, 1024]. ... We train our models on the offline dataset for 100,000 gradient steps using the Adam W optimizer [35] with batch size 2048. The learning rate for the outcome model and the policy are set to 3e 4 and adjusted according to a cosine learning rate schedule with 500 warmup steps. |