Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Thoughts and Lessons on Using Visual Foundation Models for Manipulation

Authors: Ryan Chen, Ziteng Pang, Bradly C. Stadie

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To address this gap, we systematically evaluate vision foundation models to understand what makes them effective for offline robotic learning. We find that across eleven diverse vision encoders, a representation s ability to reconstruct edges and predict keypoints strongly correlates with its performance on manipulation tasks. Extensive correlation analysis across 21 manipulation tasks consistently shows that representations preserving edge and keypoint information achieve the highest environment success rates.
Researcher Affiliation Academia Ryan Chen EMAIL Department of Statistics and Data Science Northwestern University Ziteng Pang EMAIL Department of Statistics and Data Science Northwestern University Bradly C. Stadie EMAIL Department of Statistics and Data Science Northwestern University
Pseudocode Yes Algorithm 1 Determining modality with continuous expert state-actions
Open Source Code No The paper does not explicitly state that the source code for the methodology is available, nor does it provide a direct link to a code repository. The Open Review link is for peer review, not code.
Open Datasets Yes With these criteria in mind, we arrive at 21 robotic manipulation tasks from the Fetch Suite (Plappert et al., 2018), Adroit Hand Suite Rajeswaran et al. (2017), and Metaworld Suite (Yu et al., 2020).
Dataset Splits No The paper mentions training policies using "2000 expert trajectories per environment" and that "For each scene, the linear probes learned the positions and rotations of the arm joints..." but does not specify how these trajectories or scenes are split into training, validation, and test sets with explicit percentages, counts, or references to predefined splits.
Hardware Specification Yes All policies can be trained on single A10 machines.
Software Dependencies No The paper does not provide specific software dependencies with version numbers for its own implementation (e.g., Python version, PyTorch version, etc.).
Experiment Setup Yes For each policy, we pass both the current scene and goal state images through the encoder. The resulting embeddings are concatenated and processed through a three-layer MLP with hidden layers [256, 128, 64] to predict a four-dimensional action vector. We trained policies using 2000 expert trajectories per environment. Additional training details can be found in Appendix B. Table 4: IQL and Behavior Cloning make the same number of gradient steps on the same size of minibatch. Batch Size 256 256 Training Size 2000 2000 Epochs 100 Minibatches 100 Gradient Steps 10000 Actor LR 0.0008 0.00015 Q LR 0.0003 V LR 0.0003 Expectile τ 0.7 β 3 Polyak τ 0.05