AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring
Authors: Xinyi Wang, Na Zhao, Zhiyuan Han, Dan Guo, Xun Yang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on three benchmark datasets clearly validate the effectiveness of Aug Refer. In Aug Refer, our initial step involves devising a cross-modal augmentation mechanism to enrich 3D scenes by injecting objects and furnishing them with diverse and precise descriptions. |
| Researcher Affiliation | Academia | Xinyi Wang1, Na Zhao2*, Zhiyuan Han1, Dan Guo3, Xun Yang1 1University of Science and Technology of China 2Singapore University of Technology and Design 3Hefei University of Technology EMAIL, EMAIL, EMAIL |
| Pseudocode | No | Algorithm 1 in the supplementary material outlines the plausible Insertion algorithm. The main paper text does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not explicitly state that code is provided or offer a link to a repository for the described methodology. |
| Open Datasets | Yes | We use three 3DVG datasets: Scan Refer (Chen et al. 2020), Nr3D (Achlioptas et al. 2020), and Sr3D (Achlioptas et al. 2020) to evaluate our method. |
| Dataset Splits | Yes | We use three 3DVG datasets: Scan Refer (Chen et al. 2020), Nr3D (Achlioptas et al. 2020), and Sr3D (Achlioptas et al. 2020) to evaluate our method. |
| Hardware Specification | Yes | Our experiments are conducted on four NVIDIA A100 80G GPUs, utilizing Py Torch and the Adam W optimizer. |
| Software Dependencies | No | The paper mentions 'Py Torch' but does not provide specific version numbers for software dependencies. |
| Experiment Setup | Yes | We adjust the batch size to 12 or 48 and augment training with 22.5k generated pairs for each dataset. The visual encoder s learning rate is set to 2e-3 for Scan Refer, while other layers are set to 2e-4 across 150 epochs. In contrast, SR3D and NR3D have learning rates of 1e-3 and 1e-4, respectively; NR3D undergoes 200 epochs of training, whereas SR3D requires only 100 epochs due to its simpler, template-generated descriptions. |