reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Vision Language Models are In-Context Value Learners

Authors: Yecheng Jason Ma, Joey Hejna, Chuyuan Fu, Dhruv Shah, Jacky Liang, Zhuo Xu, Sean Kirmani, Peng Xu, Danny Driess, Ted Xiao, Osbert Bastani, Dinesh Jayaraman, Wenhao Yu, Tingnan Zhang, Dorsa Sadigh, Fei Xia

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct large scale experiments assessing GVL s value prediction generalization and in-context learning capabilities. Specifically, we study the following questions: 1. Can GVL produce zero-shot value predictions for a broad range of tasks and embodiments? 2. Can GVL improve from in-context learning? 3. Can GVL be used for other downstream robot learning applications? In all our experiments, we use Gemini-1.5-Pro (Gemini Team et al., 2024) as the backbone VLM for GVL; we ablate this model choice and find GVL effective with other VLMs as well.
Researcher Affiliation	Collaboration	Yecheng Jason Ma ,1,2, Joey Hejna1,3, Ayzaan Wahid1, Chuyuan Fu1, Dhruv Shah1, Jacky Liang1, Zhuo Xu1, Sean Kirmani1, Peng Xu1, Danny Driess1, Ted Xiao1, Jonathan Tompson1, Osbert Bastani2, Dinesh Jayaraman2, Wenhao Yu1, Tingnan Zhang1, Dorsa Sadigh1, Fei Xia1 1Google Deep Mind, 2University of Pennsylvania, 3Stanford University
Pseudocode	No	The paper describes the Generative Value Learning (GVL) method in Section 3 and its components (autoregressive value prediction, input observation shuffling, in-context value learning) using textual descriptions and mathematical equations (Eq. 1, 2, 3, 4, 5, 6), but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	Website and Interactive Demo: http://generative-value-learning.github.io (This link leads to a project demonstration page, not an explicit source code repository or an affirmative statement about code release for the methodology described in the paper.)
Open Datasets	Yes	First, we consider the Open X-Embodiment (OXE) dataset (Padalkar et al., 2023). an aggregation of trajectory data from 50 standalone academic robot datasets... To further stress test GVL, we evaluate on a new diverse dataset of 250 distinct household tabletop tasks on the bi-manual ALOHA systems (Zhao et al., 2023; Team et al., 2024a).
Dataset Splits	Yes	For each of the 50 datasets, we randomly sample 20 trajectories and evaluate GVL zero-shot on each of the sampled trajectories." and "for each task, we construct a mixed quality dataset by rolling out a pre-trained policy of roughly 50% success rate for 1000 episodes
Hardware Specification	No	The paper mentions evaluating on "bi-manual ALOHA systems" and using "Gemini-1.5-Pro" as the VLM backbone, but it does not provide specific details about the computational hardware (e.g., GPU/CPU models, memory, or TPU versions) used to run the experiments or train the models.
Software Dependencies	No	The paper mentions using Gemini-1.5-Pro and GPT-4o as backbone VLMs, and methods like Action Chunking Transformer (ACT) and Diffusion Policy (DP), but it does not specify any software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions, or other libraries).
Experiment Setup	No	The paper states that 'ACT hyperparameters are tuned for ACT on the success-only subset and are fixed for all methods' and mentions using a 'VOC threshold of 0.5' for GVL-SD, as well as a 'data mixture comprised of 60% demonstrations from DROID... and 40% 15 in-domain demonstrations for each task' for policy learning. However, it does not provide specific hyperparameter values such as learning rates, batch sizes, number of epochs, or optimizer settings for the models trained or evaluated.