Unifying 2D and 3D Vision-Language Understanding
Authors: Ayush Jain, Alexander Swerdlow, Yuzhou Wang, Sergio Arnaud, Ada Martin, Alexander Sax, Franziska Meier, Katerina Fragkiadaki
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We test Uni VLG on established 2D and 3D vision language benchmarks (Achlioptas et al., 2020; Chen et al., 2020a). We find that when trained exclusively on 3D data, Uni VLG achieves state-of-the-art performance across all established benchmarks, outperforming prior methods in comparable settings by more than 15%. Furthermore, co-training Uni VLG with 2D data enhances its 3D performance even further, both on in-domain and out-of-domain benchmarks. Notably, this improvement does not come at the expense of 2D tasks Uni VLG retains strong performance on 2D referential grounding datasets (Kazemzadeh et al., 2014) compared to its version which is only trained on 2D referential grounding data. |
| Researcher Affiliation | Collaboration | *Equal contribution 1Carnegie Mellon University 2Meta Inc.. Correspondence to: Ayush Jain <EMAIL>, Alexander Swerdlow <EMAIL>. |
| Pseudocode | No | The paper describes the model architecture and training objectives using mathematical equations and textual explanations, for example, the query refinement process is detailed with X, Q, T, V variables and Norm, Cross Attention, Self Attention operations. However, it does not present a clearly labeled pseudocode or algorithm block with numbered steps. |
| Open Source Code | Yes | Code and additional visualizations are available at univlg.github.io. We make our code publicly available at univlg.github.io. |
| Open Datasets | Yes | We test Uni VLG on established 2D and 3D vision language benchmarks (Achlioptas et al., 2020; Chen et al., 2020a). We train our model on the 3D referential grounding datasets of SR3D, NR3D (Achlioptas et al., 2020) and Scan Refer (Chen et al., 2020a) and 3D instance segmentation datasets of Scan Net200 (Rozenberszki et al., 2022) and Matterport3D (Chang et al., 2017). In addition to the 3D datasets, we also train our model on 2D referential grounding datasets with Ref COCO, Ref COCO+ and Ref COCOg (Kazemzadeh et al., 2014), and 2D image segmentation dataset with COCO (Lin et al., 2014). We test Uni VLG on Scan QA (Azuma et al., 2022) and SQA3D (Ma et al., 2022) question answering benchmarks. We evaluate our model and baselines on L3DD (Arnaud et al., 2025), an out-of-domain 3D language grounding dataset that spans Scan Net (Dai et al., 2017), Scan Net++ (Yeshwanth et al., 2023), ARKit Scenes (Baruch et al., 2021), and HM3D (Yadav et al., 2023). |
| Dataset Splits | Yes | We evaluate top-1 accuracy on the official validation set with assuming ground-truth (GT) or without assuming ground-truth proposals (Det). We show results in Table 3 on the validation sets of these benchmarks. We report additional standard metrics used by Scan QA benchmark in Table-12. The results are shown in Table-9 on the official validation splits of these benchmarks. |
| Hardware Specification | Yes | We train in data-parallel across 32 A100 80G GPUs with an effective batch size of 64. Our method provides for fast inference, with a 90-frame scene taking 1050ms and 15GB of VRAM on an A100 GPU. |
| Software Dependencies | No | We encode each RGB image independently using Di NO VIT encoder (Oquab et al., 2024)... We embed the natural language query using Jina CLIP (Koukounas et al., 2024)... Our mask decoder head draws inspi-ration from Mask2Former (Cheng et al., 2022)... the decoder of a pre-trained T5 (Raffel et al., 2020) decoder... We use Jina-CLIP (Koukounas et al., 2024) as the text-encoder... We use a 88M parameter Swin (Liu et al., 2021) image-encoder. We use a DINOv2 (Oquab et al., 2024) backbone... LLM based methods of 3D-LLM (Hong et al., 2023) and Navi LLM (Zheng et al., 2024) which use BLIP2-flan T5 (Li et al., 2023) and Vicuna7B (Peng et al., 2023a) as their answer generation heads. We compare with 3D-Vis TA (Zhu et al., 2023b) and PQ3D (Zhu et al., 2024b) which use small decoder heads like T5-small (Raffel et al., 2020). |
| Experiment Setup | Yes | Implementation details: Uni VLG consists of 108M trainable parameters along with a frozen 220M parameter textencoder (Koukounas et al., 2024) and a 304M parameter image-encoder (Oquab et al., 2024). For ablations in Table 7 and 5, we use a 88M parameter Swin (Liu et al., 2021) image-encoder. We train in data-parallel across 32 A100 80G GPUs with an effective batch size of 64. We use Scan Ents3D (Abdelreheem et al., 2023) version of Scan Refer (Chen et al., 2020a) and Referit3D (Achlioptas et al., 2020) which provides object annotations for all noun words in the language sentence. During training, we process either a sequence of N posed RGB-D images, or a single RGB image. For 2D images, we apply a 2D-to-3D lifting strategy with a 50% probability. When lifted, the images pass through all 2D-3D layers; otherwise, they remain in 2D space, skipping the 3D attention layers. At test time, we retain 2D images in their original space to prevent noise from predicted 3D pointmaps from impacting 2D performance. For 3D scenes, we compute CLIP embeddings for all images and captions and use this to select 5 relevant frames, with an additional 10 frames coming from Furthest-Point-Sampling (FPS) in the CLIP embedding space, for a total of 15 frames. At test time, we feed all images in a scene to our model. For validation results, we perform span prediction to identify the primary subject from a given utterance. We use Jina-CLIP (Koukounas et al., 2024) as the text-encoder, as it supports arbitrary input-length. We jointly train our model on all datasets, with text generation loss only active in question answering datasets. Our method provides for fast inference, with a 90-frame scene taking 1050ms and 15GB of VRAM on an A100 GPU. Mask Loss: We match queries to ground-truth instances using Hungarian Matching (Carion et al., 2020). We supervise the matched queries s predicted masks with both a Binary Cross Entropy (BCE) and Dice loss following Mask2Former (Cheng et al., 2022). Text Span Loss: Similar to prior works (Li et al., 2022; Kamath et al., 2021; Jain et al., 2022), we match the predicted 3D object segmentations to the relevant noun phrases in the input utterance through a dot-product between the object queries and the language tokens, generating the distribution Gi over the input text sentence for the ith query: Box Loss: We observe a failure mode in our model where, when trained with the aforementioned objectives, some masks include a small number of distant, unrelated points, or multiple instances of the same object category are predicted by a single object query (see Figure 5 in Appendix). To address this, we introduce a novel box loss. This loss computes an enclosing 3D bounding box for each predicted mask and supervises it using standard box prediction losses L1 and Generalized Intersection-over-Union (GIo U) (Rezatofighi etg. al., 2019) against the ground-truth bounding boxes. We incorporate this box loss as an additional cost in both Hungarian matching and the final loss. Text Generation Loss: For question answering tasks, our model decodes a text utterance as an output. We supervise the generated text with the ground-truth text answer using standard cross-entropy loss. |