Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation
Authors: Mohamed el amine Boudjoghra, Angela Dai, Jean Lahoud, Hisham Cholakkal, Rao Anwer, Salman Khan, Fahad Khan
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our Open-YOLO 3D on two benchmarks, Scan Net200 and Replica, under two scenarios: (i) with ground truth masks, where labels are required for given object proposals, and (ii) with class-agnostic 3D proposals generated from a 3D proposal network. Our Open-YOLO 3D achieves state-of-the-art performance on both datasets while obtaining up to 16 speedup compared to the best existing method in literature. On Scan Net200 val. set, our Open-YOLO 3D achieves mean average precision (m AP) of 24.7% while operating at 22 seconds per scene. github.com/aminebdj/Open YOLO3D |
| Researcher Affiliation | Academia | Mohamed El Amine Boudjoghra TUM, MBZUAI Angela Dai TUM Jean Lahoud MBZUAI Hisham Cholakkal MBZUAI Rao Muhammad Anwer MBZUAI, Aalto University Salman Khan MBZUAI, ANU Fahad Shahbaz Khan MBZUAI, Linköping University |
| Pseudocode | No | The paper describes the methodology using text and diagrams (e.g., Figure 2 for the overall pipeline) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | On Scan Net200 val. set, our Open-YOLO 3D achieves mean average precision (m AP) of 24.7% while operating at 22 seconds per scene. github.com/aminebdj/Open YOLO3D |
| Open Datasets | Yes | We conduct our experiments using the Scan Net200 Rozenberszki et al. (2022) and Replica Straub et al. (2019) datasets. |
| Dataset Splits | Yes | Our analysis on Scan Net200 is based on its validation set, comprising 312 scenes. For the 3D instance segmentation task, we utilize the 200 predefined categories from the Scan Net200 annotations. ... We use RGB-depth pairs from the Scan Net200 and Replica datasets, processing every 10th frame for Scan Net200 and all frames for Replica, maintaining the same settings as Open Mask3D for fair comparison. |
| Hardware Specification | Yes | We use a single NVIDIA A100 40GB GPU for all experiments. |
| Software Dependencies | No | The paper mentions several models and frameworks like YOLO-World Cheng et al. (2024), Mask3D Schult et al. (2023), SAM Kirillov et al. (2023), and CLIP Zhang et al. (2023) but does not provide specific version numbers for any software dependencies or libraries. |
| Experiment Setup | No | The paper states: 'We use RGB-depth pairs from the Scan Net200 and Replica datasets, processing every 10th frame for Scan Net200 and all frames for Replica, maintaining the same settings as Open Mask3D for fair comparison.' and 'To create LG label maps, we use the YOLO-World Cheng et al. (2024) extra-large model for its real-time capability and high zero-shot performance.' While it mentions specific models and some processing choices, it lacks concrete numerical hyperparameters (e.g., learning rate, batch size, number of epochs, optimizer settings) within the main text required for full reproducibility. |