Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention

Authors: Weitai Kang, Mengxue Qu, Jyoti Kini, Yunchao Wei, Mubarak Shah, Yan Yan

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 EXPERIMENTS 5.3 EVALUATION METRICS 5.4 QUANTITATIVE COMPARISONS 5.5 ABLATION STUDIES Tab. 1 and Tab. 2 show quantitative comparisons on val and test set of Intent3D.
Researcher Affiliation Academia Weitai Kang1 , Mengxue Qu2 , Jyoti Kini3 , Yunchao Wei2 , Mubarak Shah3 , Yan Yan1 1University of Illinois Chicago, 2Beijing Jiaotong University, 3University of Central Florida
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. It provides diagrams (Fig. 4, Fig. 17) illustrating the model architecture and loss functions, but no textual pseudocode.
Open Source Code Yes Code: https://github.com/Weitai Kang/Intent3D. Project: https://weitaikang.github.io/Intent3D-webpage/.
Open Datasets Yes To tackle this challenge, we introduce the new Intent3D dataset, consisting of 44,990 intention texts associated with 209 fine-grained classes from 1,042 scenes of the Scan Net [Dai et al., 2017] dataset.
Dataset Splits Yes The dataset is split into train, val, and test sets, containing 35850, 2285, and 6855 samples, respectively, each with disjoint scenes. Our train set comes from Scan Net s train split, while the val and test sets are derived from its val split.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models or memory amounts used for running the experiments. It only mentions general training parameters and model configurations.
Software Dependencies No The paper mentions using 'Point Net++ [Qi et al., 2017]' and 'Ro BERTa [Liu et al., 2019]' as backbones, and 'spa Cy [Honnibal et al., 2020]' for text processing, and 'Chat GPT (GPT-4)' for data generation. However, it does not specify version numbers for general software dependencies or libraries like PyTorch, TensorFlow, or CUDA.
Experiment Setup Yes Our Intent Net is trained from scratch for 90 epochs with a batch size of 24. The learning rate is 0.001 for Point Net++ and 0.0001 for the rest of the network, which decays by 0.1 at the 65th epoch. The Ro BERTa is frozen. The number of point tokens is 1024, and the maximum length for text tokens is set to 256. The hidden dimension used is 288. For BUTD-DETR [Jain et al., 2022], we adhere to its official configuration in Scan Refer [Chen et al., 2020]. The batch size is set to 24, and the learning rate (which is the same as ours) decreases to one-tenth at the 65th epoch. It finally takes 100 epochs to converge. For EDA [Wu et al., 2023], we also follow its official configuration, where the batch size is set to 48. The learning rate for backbones is 0.002, and for the rest, it is 0.0002. The learning rate decreases to one-tenth at the 50th and 75th epochs. It takes 104 epochs to converge. For 3D-Vis TA [Zhu et al., 2023], we use its official configuration in Scan Refer [Chen et al., 2020], where the batch size is set to 64, and the learning rate is 0.0001. The warm-up step is set to 5000. Although the official training epoch is 100, we find that the model converges at the 47th epoch. In the case of Chat-3D v2, it takes 3 epochs to fine-tune the model.