3D-SPATIAL MULTIMODAL MEMORY
Authors: Xueyan Zou, Yuchen Song, Ri-Zhao Qiu, Xuanbin Peng, Jianglong Ye, Sifei Liu, Xiaolong Wang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To validate M3, we conduct comprehensive quantitative evaluations of feature similarity and downstream tasks, as well as qualitative visualizations to highlight the pixel trace of Gaussian memory attention. Our approach encompasses a diverse range of foundation models, including vision-language models (VLMs), perception models, and large multimodal and language models (LMMs/LLMs). Furthermore, to demonstrate real-world applicability, we deploy M3 s feature field in indoor scenes on a quadruped robot. We report the main quantitative results in Tab. 1, where the average training time and the auxiliary low-level metrics are reported. The downstream evaluation results of grounding and retrieval are shown in Table. 2. Table. 3 shows the ablation of the number of foundation models involved in M3. |
| Researcher Affiliation | Collaboration | Xueyan Zou1 , Yuchen Song1 , Ri-Zhao Qiu1 , Xuanbin Peng1 Jianglong Ye1, Sifei Liu2, Xiaolong Wang1,2 Core Contribution 1UC San Diego 2NVIDIA |
| Pseudocode | Yes | Algorithm 1 Raw Feature (R) Similarity Reduction Algorithm |
| Open Source Code | No | The paper provides a project website link (https://m3-spatial-memory.github.io) which is a high-level overview page, not a direct link to a code repository. There is no explicit statement confirming the release of the code for the methodology described. |
| Open Datasets | Yes | To support extensive quantitative and qualitative evaluation, we perform experiments using several existing scene datasets [3; 18; 10] and collected a custom robot dataset (M3-Robot) using a quadruped robot and a drone. Specifically, we use Garden (an outdoor scene) from Mip Ne RF360 [3], Train from the Tank & Temples dataset [18], and Play Room as well as Dr Johnson from the Deepblending dataset [10]. |
| Dataset Splits | No | The paper states: "We evaluate all the images in the validation sets of the three datasets." However, it does not provide specific training/test/validation dataset splits (e.g., percentages, sample counts, or references to predefined splits) for reproducibility. |
| Hardware Specification | No | No specific hardware details (such as exact GPU/CPU models, memory specifications, or detailed computer configurations) used for running the experiments are mentioned in the paper. |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., library versions like PyTorch 1.9, specific compilers, or operating system versions) are mentioned in the paper. |
| Experiment Setup | Yes | For fair comparisons, we train all the methods in approximately 30,000 iterations (29,993 iterations for M3 due to last-batch data loader roundoffs). ... In compensate, we use point-based loss, where we sample 2000 points ranging from both predict and ground truth features for distance loss computation. In Table. 4, we ablate the computation budget on training M3 in the balance of memory footprint, training iterations, and performance. |