DenseGrounding: Improving Dense Language-Vision Semantics for Ego-centric 3D Visual Grounding

Authors: Henry Zheng, Hao Shi, Qihang Peng, Yong Xien Chng, Rui Huang, Yepeng Weng, zhongchao shi, Gao Huang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that Dense Grounding significantly outperforms existing methods in overall accuracy, with improvements of 5.81% and 7.56% when trained on the comprehensive full dataset and smaller mini subset, respectively, further advancing the SOTA in ego-centric 3D visual grounding. Our method also achieves 1st place and receives Innovation Award in the CVPR 2024 Autonomous Grand Challenge Multi-view 3D Visual Grounding Track, validating its effectiveness and robustness.
Researcher Affiliation Collaboration 1Department of Automation, BNRist, Tsinghua University 2AI Lab, Lenovo Research EMAIL EMAIL {gaohuang}@tsinghua.edu.cn
Pseudocode No The paper describes methods in structured text and uses flowcharts (Figure 2) to illustrate the architecture. Figure 6 provides a sample input/output for the LLM prompt, but there is no explicit pseudocode or algorithm block labeled as such.
Open Source Code No Code
Open Datasets Yes The Embodied Scan dataset (Wang et al., 2024), used in our experiments, is a large-scale, multi-modal, ego-centric dataset for comprehensive 3D scene understanding.
Dataset Splits Yes For benchmarking, the official dataset maintains a non-public test set for the test leaderboard and divides the original training set into new subsets for training and validation. In this paper, we refer to these as the training and validation sets, while the non-public test set is called the testing set. For the mini data in the Data column of Table 1 and analysis experiments in Sec. 5.2, we use a smaller subset of the data as a proxy task in performing experiments. The subset is referred to as mini sets, available through the official release by Wang et al. (2024).
Hardware Specification No The paper describes the software components and training parameters but does not specify any hardware details like GPU models, CPU, or memory used for the experiments.
Software Dependencies No The paper mentions using Res Net50, Mink Net34, CLIP text encoder, and AdamW optimizer, but it does not provide specific version numbers for these libraries or frameworks (e.g., PyTorch version, specific library versions).
Experiment Setup Yes Our multi-view visual grounding model, Dense Grounding, is trained with the Adam W optimizer using a learning rate of 5e-4, weight decay of 5e-4, and a batch size of 48. The model is trained for 12 epochs, with the learning rate reduced by 0.1 at epochs 8 and 11. All other settings align with Embodied Scan.