Hierarchical Cross-Modal Alignment for Open-Vocabulary 3D Object Detection

Authors: Youjun Zhao, Jiaying Lin, Rynson W. H. Lau

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that the proposed method outperforms SOTA methods on the existing OV-3DOD benchmarks. It also achieves promising OV-3DOD results even without any 3D annotations. We conduct extensive experiments on the OV-3DOD benchmarks. Our method achieves superior performances compared to existing state-of-the-art approaches, demonstrating its effectiveness for the OV-3DOD task.
Researcher Affiliation Academia Youjun Zhao*, Jiaying Lin*, Rynson W.H. Lau Department of Computer Science, City University of Hong Kong EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes methods like Hierarchical Data Integration (HDI), Interactive Cross-Modal Alignment (ICMA), and Object-Focusing Context Adjustment (OFCA) using descriptive text and mathematical equations, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code and Extended version https://youjunzhao.github.io/HCMA/
Open Datasets Yes Scan Net (Dai et al. 2017) is a widely used 3D object detection dataset. ... SUN RGB-D (Song, Lichtenberg, and Xiao 2015) is another popular 3D object detection dataset.
Dataset Splits No The paper mentions using ScanNet and SUN RGB-D datasets and conducting evaluations. It states "Our experimental setup follows that of OV-3DET (Lu et al. 2023) for fair comparison." However, it does not explicitly provide specific percentages, sample counts, or detailed methodologies for training, validation, and test splits within the provided text.
Hardware Specification Yes Experiments are conducted on a single RTX4090 GPU.
Software Dependencies No The paper mentions specific models like "3DETR (Misra, Girdhar, and Joulin 2021) as our 3D detector backbone" and "pre-trained CLIP image and text encoders," and an optimizer "Adam W optimizer." However, it does not provide specific version numbers for software libraries, programming languages (e.g., Python), or frameworks (e.g., PyTorch, TensorFlow).
Experiment Setup Yes We train our model using the Adam W optimizer with a cosine learning rate scheme. The base learning rate and the weight decay are set to 10 4 and 0.1, respectively. The temperature parameter τ is set to 0.1 in contrastive learning. We adopt 3DETR (Misra, Girdhar, and Joulin 2021) as our 3D detector backbone. The number of object queries for 3DETR is set to 128. Experiments are conducted on a single RTX4090 GPU. Our training epoch is the same as the baseline method OV-3DET.