Vision-Language Models Provide Promptable Representations for Reinforcement Learning
Authors: William Chen, Oier Mees, Aviral Kumar, Sergey Levine
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our approach on visually-complex, long horizon RL tasks in Minecraft and robot navigation in Habitat. We find that our policies trained on embeddings from off-the-shelf, general-purpose VLMs outperform equivalent policies trained on generic, non-promptable image embeddings. |
| Researcher Affiliation | Academia | William Chen U.C. Berkeley Oier Mees U.C. Berkeley Aviral Kumar Carnegie Mellon University Sergey Levine U.C. Berkeley |
| Pseudocode | Yes | Listing 1: Example policy for PR2L. Listing 2: Example code for extracting promptable representations from a VLM. Listing 3: Example usage of the above function and policy. |
| Open Source Code | No | The paper provides code snippets in Appendix I as examples (Listings 1, 2, 3), but does not explicitly state that the full source code for the methodology is being released, nor does it provide a specific repository link or an explicit code release statement. |
| Open Datasets | Yes | We first conduct experiments in Minecraft...We follow the Mine Dojo definitions of observation/action spaces and reward function structures for these tasks...We run offline BC and RL experiments in the Habitat household simulator. In contrast to Minecraft, tasks in this domain require connecting naturalistic images with real-world common sense about the structure and contents of typical home environments. Our experiments evaluate (1) whether PR2L confers the generalization properties of VLMs to our policies, (2) whether PR2L-based policies can leverage the semantic reasoning capabilities of the underlying VLM (e.g., via chain-of-thought Wei et al. (2023)), and (3) whether PR2L can learn entirely from stale, offline data sources. We use a Llama2-7B Prismatic VLM for the Habitat experiments Karamcheti et al. (2024). Habitat provides a standardized train-validation split, consisting of 80 household scenes for training (from which one can run online RL or collect data for offline RL or BC) and 20 novel scenes for validation, thereby testing policies generalization capabilities. These scenes come from the Habitat-Matterport 3D v1 dataset (Ramakrishnan et al., 2021). We train our policies with behavior cloning (BC) on the Habitat-Web human demonstration dataset of 77k trajectories (12M steps) (Ramrakhya et al., 2022). |
| Dataset Splits | Yes | Habitat provides a standardized train-validation split, consisting of 80 household scenes for training (from which one can run online RL or collect data for offline RL or BC) and 20 novel scenes for validation, thereby testing policies generalization capabilities. These scenes come from the Habitat-Matterport 3D v1 dataset (Ramakrishnan et al., 2021). [...] In total, our subsampled dataset contains approximately 1.1M steps over 7550 trajectories. [...] We collected expert policy data by training a policy on Mine CLIP embeddings to completion on all of our original tasks and saving all transitions to create an offline dataset. |
| Hardware Specification | Yes | Minecraft training runs were run on 16 A5000 GPUs (to accommodate the 16 seeds). All Habitat training was done on an A100 GPU server. Generation of data and evaluations were done on 16 A5000 GPUs for parallelization. |
| Software Dependencies | Yes | For our actual RL algorithm, we use the Stable-Baselines3 (version 2.0.0) implementation of clipping-based PPO (Raffin et al., 2021), with hyperparameters presented in Table 6. |
| Experiment Setup | Yes | For our actual RL algorithm, we use the Stable-Baselines3 (version 2.0.0) implementation of clipping-based PPO (Raffin et al., 2021), with hyperparameters presented in Table 6. Many of these parameters are the same as the ones presented by Fan et al. (2022). [...] We also present the policy and VLM hyperparameters in Table 7. [...] We adopt the same optimizer, scheduler, and associated hyperparameters as Majumdar et al. (2023), but find a learning rate of 1e-4 to be more effective than their 1e-3. [...] For our offline RL experiments in Habitat, we use Conservative Q-Learning (CQL) on top of the Stable Baslines3 Contrib codebase’s implementation of Quantile Regression DQN (QR-DQN) Kumar et al. (2020); Dabney et al. (2017). We choose to multiply the QR-DQN component of the CQL loss by 0.2. Using the notation proposed by Kumar et al. (2020), this is equivalent to α = 5, which said work also uses. Other hyperparameters are τ = 1, γ = 0.99, fixed learning rate of 1e-4, 100 epochs, and 50 quantiles (no exploration hyperparameters are specified, since we do not generate any new online data). |