Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Datasets

Authors: Guangqi Jiang, Yifei Sun, Tao Huang, Huanyu Li, Yongyuan Liang, Huazhe Xu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments on both simulations and real robot validate the superiority of our proposed representation. Empirical results across four simulation domains with 20 robotic manipulation tasks demonstrate that MCR outperforms the strongest baseline by 14.8%. Additionally, MCR significantly boosts the success rate in three real-world manipulation tasks by 76.9%.
Researcher Affiliation Academia 1 University of California, San Diego 2 Tongji University 3 Shanghai Jiao Tong University 4 University of Maryland, College Park 5 Tsinghua University
Pseudocode No The paper describes the proposed method MCR, including objective functions (Equations 1-4) and illustrations (Figure 5), as well as PyTorch-like implementation details in Appendix C.4, but does not include a clearly labeled pseudocode or algorithm block for the overall method.
Open Source Code Yes Project website: robots-pretrain-robots.github.io.
Open Datasets Yes Specifically, we pre-train a visual encoder on the DROID (Khazatsky et al., 2024) robotic dataset and leverage motion-relevant data such as robot proprioceptive states and actions. ... We select a total of 20 tasks across 4 simulation environments... Robomimic (Mandlekar et al., 2021)... Robo Casa (Nasiriany et al., 2024)... Meta World (Yu et al., 2019)... Dex Art (Bao et al., 2023)...
Dataset Splits No For the DROID dataset used for pre-training, the paper states 'After processing, we retain 36k trajectories for pre-training,' without specifying any train/validation/test splits. For downstream tasks, it specifies the number of demonstrations used for training (e.g., '200 demonstrations' for Robomimic, '50 demonstrations' for Robo Casa), and mentions evaluating success rate in 'at least 20 episodes,' but does not provide explicit numerical splits of these demonstration datasets into training, validation, and testing sets.
Hardware Specification Yes The whole training process is used 50 hours with a single NVIDIA 3090.
Software Dependencies No The paper mentions 'Our codebase is built upon the implementation of R3M' and 'We utilize the Py Torch-Grad-CAM library to generate Grad-CAM figures,' but does not provide specific version numbers for PyTorch, PyTorch-Grad-CAM, or other software libraries used.
Experiment Setup Yes We show our hyperparameters during the pre-training stage in Table 6. Hyperparameter Value: Encoder type ResNet50, Batch size 32, Learning rate 1e-4, Training steps 500,000, Optimizer Adam. Downstream policy learning settings are introduced in Section C.5. For example, 'The downstream BC policy is a three-layer MLP with ReLu activations and hidden sizes of 256. ... trained with a mean squared error (MSE) loss, a learning rate of 0.001, and a batch size of 256.'