reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Vision-Language Models Provide Promptable Representations for Reinforcement Learning

Authors: William Chen, Oier Mees, Aviral Kumar, Sergey Levine

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our approach on visually-complex, long horizon RL tasks in Minecraft and robot navigation in Habitat. We find that our policies trained on embeddings from off-the-shelf, general-purpose VLMs outperform equivalent policies trained on generic, non-promptable image embeddings.
Researcher Affiliation	Academia	William Chen U.C. Berkeley Oier Mees U.C. Berkeley Aviral Kumar Carnegie Mellon University Sergey Levine U.C. Berkeley
Pseudocode	Yes	Listing 1: Example policy for PR2L. Listing 2: Example code for extracting promptable representations from a VLM. Listing 3: Example usage of the above function and policy.
Open Source Code	No	The paper provides code snippets in Appendix I as examples (Listings 1, 2, 3), but does not explicitly state that the full source code for the methodology is being released, nor does it provide a specific repository link or an explicit code release statement.
Open Datasets	Yes	We first conduct experiments in Minecraft...We follow the Mine Dojo definitions of observation/action spaces and reward function structures for these tasks...We run offline BC and RL experiments in the Habitat household simulator. In contrast to Minecraft, tasks in this domain require connecting naturalistic images with real-world common sense about the structure and contents of typical home environments. Our experiments evaluate (1) whether PR2L confers the generalization properties of VLMs to our policies, (2) whether PR2L-based policies can leverage the semantic reasoning capabilities of the underlying VLM (e.g., via chain-of-thought Wei et al. (2023)), and (3) whether PR2L can learn entirely from stale, offline data sources. We use a Llama2-7B Prismatic VLM for the Habitat experiments Karamcheti et al. (2024). Habitat provides a standardized train-validation split, consisting of 80 household scenes for training (from which one can run online RL or collect data for offline RL or BC) and 20 novel scenes for validation, thereby testing policies generalization capabilities. These scenes come from the Habitat-Matterport 3D v1 dataset (Ramakrishnan et al., 2021). We train our policies with behavior cloning (BC) on the Habitat-Web human demonstration dataset of 77k trajectories (12M steps) (Ramrakhya et al., 2022).
Dataset Splits	Yes	Habitat provides a standardized train-validation split, consisting of 80 household scenes for training (from which one can run online RL or collect data for offline RL or BC) and 20 novel scenes for validation, thereby testing policies generalization capabilities. These scenes come from the Habitat-Matterport 3D v1 dataset (Ramakrishnan et al., 2021). [...] In total, our subsampled dataset contains approximately 1.1M steps over 7550 trajectories. [...] We collected expert policy data by training a policy on Mine CLIP embeddings to completion on all of our original tasks and saving all transitions to create an offline dataset.
Hardware Specification	Yes	Minecraft training runs were run on 16 A5000 GPUs (to accommodate the 16 seeds). All Habitat training was done on an A100 GPU server. Generation of data and evaluations were done on 16 A5000 GPUs for parallelization.
Software Dependencies	Yes	For our actual RL algorithm, we use the Stable-Baselines3 (version 2.0.0) implementation of clipping-based PPO (Raffin et al., 2021), with hyperparameters presented in Table 6.
Experiment Setup	Yes	For our actual RL algorithm, we use the Stable-Baselines3 (version 2.0.0) implementation of clipping-based PPO (Raffin et al., 2021), with hyperparameters presented in Table 6. Many of these parameters are the same as the ones presented by Fan et al. (2022). [...] We also present the policy and VLM hyperparameters in Table 7. [...] We adopt the same optimizer, scheduler, and associated hyperparameters as Majumdar et al. (2023), but find a learning rate of 1e-4 to be more effective than their 1e-3. [...] For our offline RL experiments in Habitat, we use Conservative Q-Learning (CQL) on top of the Stable Baslines3 Contrib codebase’s implementation of Quantile Regression DQN (QR-DQN) Kumar et al. (2020); Dabney et al. (2017). We choose to multiply the QR-DQN component of the CQL loss by 0.2. Using the notation proposed by Kumar et al. (2020), this is equivalent to α = 5, which said work also uses. Other hyperparameters are τ = 1, γ = 0.99, fixed learning rate of 1e-4, 100 epochs, and 50 quantiles (no exploration hyperparameters are specified, since we do not generate any new online data).