Learning View-invariant World Models for Visual Robotic Manipulation

Authors: Jing-Cheng Pang, Nan Tang, Kaiyuan Li, Yuting Tang, Xin-Qiang Cai, Zhen-Yu Zhang, Gang Niu, Masashi Sugiyama, Yang Yu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the effectiveness of Re Vi Wo in various viewpoint disturbance scenarios, including control under novel camera positions and frequent camera shaking, using the Meta-world & Panda Gym environments. Besides, we also conduct experiments on real world ALOHA robot. The results demonstrate that Re Vi Wo maintains robust performance under viewpoint disturbance, while baseline methods suffer from significant performance degradation. Furthermore, we show that the VIR captures taskrelevant state information and remains stable for observations from novel viewpoints, validating the efficacy of the Re Vi Wo approach.
Researcher Affiliation Collaboration 1 National Key Laboratory for Novel Software Technology, Nanjing University, China & School of Artificial Intelligence, Nanjing University, China; 2 RIKEN Center for Advanced Intelligence Project, Japan; 3 Polixir.ai, China; 4 The University of Tokyo, Japan
Pseudocode Yes Algorithm 1 Representation learning for View-invariant World model (Re Vi Wo)
Open Source Code No The paper mentions using "Offline RL-kit (Sun, 2023)" but does not provide a direct link or explicit statement that *their* methodology's code is open-source or available.
Open Datasets Yes Meanwhile, Re Vi Wo is simutaneously trained on Open X-Embodiment datasets without view labels. We conduct experiments on two robotics manipulation environments: Meta-world (Yu et al., 2019) and Panda Gym (Gallou edec et al., 2021). Integration of Open X-Embodiment data without view labels. In addition to the data with view labels, we also involve multi-view data without view labels from the Open X-Embodiment dataset (O Neill et al., 2024), which are readily available on the internet.
Dataset Splits No The paper describes data collection processes for training the autoencoder and offline control data, as well as evaluation scenarios (e.g., various azimuth offsets, camera shaking). However, it does not provide explicit training/validation/test splits (e.g., percentages or absolute counts for reproduction) from a single dataset, but rather describes training on collected data and evaluating on different disturbance conditions.
Hardware Specification Yes We use 64 CPU cores (AMD EPYC 9654 @ 2.4GHz) and 4 GPUs (NVIDIA Ge Force RTX 4090) for our experiments.
Software Dependencies Yes The software stack employed for our experiments includes Python 3.11 and Py Torch 2.1.0.
Experiment Setup Yes The hyper-parameters for implementing Re Vi Wo are presented in Table 4. For all methods, the model is trained with an offline RL algorithm for 25000 gradient steps, and evaluated for 40 episodes.