SPA: 3D Spatial-Awareness Enables Effective Embodied Representation

Authors: Haoyi Zhu, Honghui Yang, Yating Wang, Jiange Yang, Limin Wang, Tong He

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present the most comprehensive evaluation of embodied representation learning to date, covering 268 tasks across 8 simulators with diverse policies in both single-task and language-conditioned multi-task scenarios. The results are compelling: SPA consistently outperforms more than 10 state-of-the-art representation methods, including those specifically designed for embodied AI, vision-centric tasks, and multi-modal applications, while using less training data. Furthermore, we conduct a series of real-world experiments to confirm its effectiveness in practical scenarios.
Researcher Affiliation Collaboration 1USTC, 2Shanghai AI Lab, 3ZJU, 4Tongji, 5NJU Corresponding Author
Pseudocode No The paper describes methods textually and provides a pipeline overview figure (Figure 2), but it does not include explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code No Our strongest model takes more than 6000 GPU hours to train and we are committed to open-sourcing all code and model weights to foster future research in embodied representation learning. Project Page: https://haoyizhu.github.io/spa/.
Open Datasets Yes We collect several multi-view datasets. ... The datasets investigated are listed in the first column of Tab. 1. ... The datasets used for the final version include Scan Net, Scan Net++, ADT, S3DIS, Hypersim, and Droid.
Dataset Splits Yes Each task includes 100 training demonstrations and 25 testing rollouts. For each group, we train a language-conditioned multi-task agent. We employ RVT-2 (Goyal et al., 2024), the state-of-the-art (SOTA) method on this benchmark, as our policy. ... We train a language-conditioned multi-task policy for each suite, adopting the transformer policy provided by LIBERO. The image encoders are modified from default CNNs to frozen pre-trained Vi Ts, utilizing the [CLS] token for feature extraction. To expedite policy training, we use only 20 demonstrations per task and forgo augmentations, allowing for pre-extraction of all image features during training. After training for 25 epochs, the checkpoints from the 20th and 25th are evaluated with 20 rollouts per task, and the best checkpoint s performance is taken. Finally, the results are averaged on 3 random seeds.
Hardware Specification Yes We utilize 80 NVIDIA A100-SXM4-80GB GPUs, each with a batch size of 2, and accumulate gradients over 8 batches, resulting in a total effective batch size of 2 × 8 × 80 = 1280. Training is conducted over 2000 epochs, sampling each dataset to match the size of ADT per epoch.
Software Dependencies No The paper mentions using Adam W optimizer and One Cycle learning rate scheduler, but does not specify version numbers for general software libraries or frameworks like Python, PyTorch, or CUDA.
Experiment Setup Yes We use a mask ratio of 0.5 and enable all three rendering losses. Following Ponder (Huang et al., 2023), we set the weight for the RGB loss to 10, the weights for the depth and semantic losses to 1, and use λeikonal = 0.01, λsdf = 10, and λfree = 1. The volume size is 128 × 128 × 32. For stable training, we apply the Exponential Moving Average (EMA) technique with a decay of 0.999. We use Adam W (Loshchilov et al., 2017) as the optimizer with a weight decay of 0.04 and a learning rate of 8e−4. One Cycle (Smith & Topin, 2019) learning rate scheduler is adopted.