reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SPA: 3D Spatial-Awareness Enables Effective Embodied Representation

Authors: Haoyi Zhu, Honghui Yang, Yating Wang, Jiange Yang, Limin Wang, Tong He

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present the most comprehensive evaluation of embodied representation learning to date, covering 268 tasks across 8 simulators with diverse policies in both single-task and language-conditioned multi-task scenarios. The results are compelling: SPA consistently outperforms more than 10 state-of-the-art representation methods, including those specifically designed for embodied AI, vision-centric tasks, and multi-modal applications, while using less training data. Furthermore, we conduct a series of real-world experiments to confirm its effectiveness in practical scenarios.
Researcher Affiliation	Collaboration	1USTC, 2Shanghai AI Lab, 3ZJU, 4Tongji, 5NJU Corresponding Author
Pseudocode	No	The paper describes methods textually and provides a pipeline overview figure (Figure 2), but it does not include explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	No	Our strongest model takes more than 6000 GPU hours to train and we are committed to open-sourcing all code and model weights to foster future research in embodied representation learning. Project Page: https://haoyizhu.github.io/spa/.
Open Datasets	Yes	We collect several multi-view datasets. ... The datasets investigated are listed in the first column of Tab. 1. ... The datasets used for the final version include Scan Net, Scan Net++, ADT, S3DIS, Hypersim, and Droid.
Dataset Splits	Yes	Each task includes 100 training demonstrations and 25 testing rollouts. For each group, we train a language-conditioned multi-task agent. We employ RVT-2 (Goyal et al., 2024), the state-of-the-art (SOTA) method on this benchmark, as our policy. ... We train a language-conditioned multi-task policy for each suite, adopting the transformer policy provided by LIBERO. The image encoders are modified from default CNNs to frozen pre-trained Vi Ts, utilizing the [CLS] token for feature extraction. To expedite policy training, we use only 20 demonstrations per task and forgo augmentations, allowing for pre-extraction of all image features during training. After training for 25 epochs, the checkpoints from the 20th and 25th are evaluated with 20 rollouts per task, and the best checkpoint s performance is taken. Finally, the results are averaged on 3 random seeds.
Hardware Specification	Yes	We utilize 80 NVIDIA A100-SXM4-80GB GPUs, each with a batch size of 2, and accumulate gradients over 8 batches, resulting in a total effective batch size of 2 × 8 × 80 = 1280. Training is conducted over 2000 epochs, sampling each dataset to match the size of ADT per epoch.
Software Dependencies	No	The paper mentions using Adam W optimizer and One Cycle learning rate scheduler, but does not specify version numbers for general software libraries or frameworks like Python, PyTorch, or CUDA.
Experiment Setup	Yes	We use a mask ratio of 0.5 and enable all three rendering losses. Following Ponder (Huang et al., 2023), we set the weight for the RGB loss to 10, the weights for the depth and semantic losses to 1, and use λeikonal = 0.01, λsdf = 10, and λfree = 1. The volume size is 128 × 128 × 32. For stable training, we apply the Exponential Moving Average (EMA) technique with a decay of 0.999. We use Adam W (Loshchilov et al., 2017) as the optimizer with a weight decay of 0.04 and a learning rate of 8e−4. One Cycle (Smith & Topin, 2019) learning rate scheduler is adopted.