Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation

Authors: Mohamed el amine Boudjoghra, Angela Dai, Jean Lahoud, Hisham Cholakkal, Rao Anwer, Salman Khan, Fahad Khan

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate our Open-YOLO 3D on two benchmarks, Scan Net200 and Replica, under two scenarios: (i) with ground truth masks, where labels are required for given object proposals, and (ii) with class-agnostic 3D proposals generated from a 3D proposal network. Our Open-YOLO 3D achieves state-of-the-art performance on both datasets while obtaining up to 16 speedup compared to the best existing method in literature. On Scan Net200 val. set, our Open-YOLO 3D achieves mean average precision (m AP) of 24.7% while operating at 22 seconds per scene. github.com/aminebdj/Open YOLO3D
Researcher Affiliation Academia Mohamed El Amine Boudjoghra TUM, MBZUAI Angela Dai TUM Jean Lahoud MBZUAI Hisham Cholakkal MBZUAI Rao Muhammad Anwer MBZUAI, Aalto University Salman Khan MBZUAI, ANU Fahad Shahbaz Khan MBZUAI, Linköping University
Pseudocode No The paper describes the methodology using text and diagrams (e.g., Figure 2 for the overall pipeline) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes On Scan Net200 val. set, our Open-YOLO 3D achieves mean average precision (m AP) of 24.7% while operating at 22 seconds per scene. github.com/aminebdj/Open YOLO3D
Open Datasets Yes We conduct our experiments using the Scan Net200 Rozenberszki et al. (2022) and Replica Straub et al. (2019) datasets.
Dataset Splits Yes Our analysis on Scan Net200 is based on its validation set, comprising 312 scenes. For the 3D instance segmentation task, we utilize the 200 predefined categories from the Scan Net200 annotations. ... We use RGB-depth pairs from the Scan Net200 and Replica datasets, processing every 10th frame for Scan Net200 and all frames for Replica, maintaining the same settings as Open Mask3D for fair comparison.
Hardware Specification Yes We use a single NVIDIA A100 40GB GPU for all experiments.
Software Dependencies No The paper mentions several models and frameworks like YOLO-World Cheng et al. (2024), Mask3D Schult et al. (2023), SAM Kirillov et al. (2023), and CLIP Zhang et al. (2023) but does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup No The paper states: 'We use RGB-depth pairs from the Scan Net200 and Replica datasets, processing every 10th frame for Scan Net200 and all frames for Replica, maintaining the same settings as Open Mask3D for fair comparison.' and 'To create LG label maps, we use the YOLO-World Cheng et al. (2024) extra-large model for its real-time capability and high zero-shot performance.' While it mentions specific models and some processing choices, it lacks concrete numerical hyperparameters (e.g., learning rate, batch size, number of epochs, optimizer settings) within the main text required for full reproducibility.