AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring

Authors: Xinyi Wang, Na Zhao, Zhiyuan Han, Dan Guo, Xun Yang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on three benchmark datasets clearly validate the effectiveness of Aug Refer. In Aug Refer, our initial step involves devising a cross-modal augmentation mechanism to enrich 3D scenes by injecting objects and furnishing them with diverse and precise descriptions.
Researcher Affiliation Academia Xinyi Wang1, Na Zhao2*, Zhiyuan Han1, Dan Guo3, Xun Yang1 1University of Science and Technology of China 2Singapore University of Technology and Design 3Hefei University of Technology EMAIL, EMAIL, EMAIL
Pseudocode No Algorithm 1 in the supplementary material outlines the plausible Insertion algorithm. The main paper text does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not explicitly state that code is provided or offer a link to a repository for the described methodology.
Open Datasets Yes We use three 3DVG datasets: Scan Refer (Chen et al. 2020), Nr3D (Achlioptas et al. 2020), and Sr3D (Achlioptas et al. 2020) to evaluate our method.
Dataset Splits Yes We use three 3DVG datasets: Scan Refer (Chen et al. 2020), Nr3D (Achlioptas et al. 2020), and Sr3D (Achlioptas et al. 2020) to evaluate our method.
Hardware Specification Yes Our experiments are conducted on four NVIDIA A100 80G GPUs, utilizing Py Torch and the Adam W optimizer.
Software Dependencies No The paper mentions 'Py Torch' but does not provide specific version numbers for software dependencies.
Experiment Setup Yes We adjust the batch size to 12 or 48 and augment training with 22.5k generated pairs for each dataset. The visual encoder s learning rate is set to 2e-3 for Scan Refer, while other layers are set to 2e-4 across 150 epochs. In contrast, SR3D and NR3D have learning rates of 1e-3 and 1e-4, respectively; NR3D undergoes 200 epochs of training, whereas SR3D requires only 100 epochs due to its simpler, template-generated descriptions.