Evaluating Compositional Scene Understanding in Multimodal Generative Models
Authors: Shuhao Fu, Andrew Jun Lee, Yixin Anna Wang, Ida Momennejad, Trevor Bihl, Hongjing Lu, Taylor Whittington Webb
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we present an evaluation of the compositional visual processing capabilities in the current generation of text-to-image (DALL-E 3) and multimodal vision-language models (GPT-4V, GPT-4o, Claude Sonnet 3.5, QWEN2-VL-72B, and Intern VL2.5-38B), and compare the performance of these systems to human participants. |
| Researcher Affiliation | Collaboration | Shuhao Fu*, Andrew Jun Lee*, Anna Wang Department of Psychology, University of California, Los Angeles; Ida Momennejad Microsoft Research, NYC; Trevor Bihl Air Force Research Laboratory; Hongjing Lu EMAIL Department of Psychology, Department of Statistics, University of California, Los Angeles; Taylor Webb EMAIL Microsoft Research, NYC. The affiliations include academic institutions (University of California, Los Angeles) and industry/government research labs (Microsoft Research, Air Force Research Laboratory), indicating a collaboration. |
| Pseudocode | No | No pseudocode or algorithm blocks are explicitly labeled or presented in a structured format in the paper. |
| Open Source Code | Yes | All experimental materials, code, and data are available at: https://github.com/andrewjlee0/evaluating_compositionality_VLMs |
| Open Datasets | Yes | We first evaluated relational concept learning with real-world scenes, using the Bongard-HOI dataset (Jiang et al., 2022). |
| Dataset Splits | Yes | For Bongard-HOI: In the standard evaluation methodology, 6 labeled images from each class are presented (12 total), and the remaining positive and negative images are presented for classification. evaluation was performed by presenting 9 labeled example images (randomly selecting either 5 positive examples and 4 negative examples, or vice versa) followed by a single query image. For SVRT: For each problem, we presented 1-9 few-shot examples, consisting of a mixture of positive and negative instances (in random order). |
| Hardware Specification | Yes | two open-source VLMs: QWEN2-VL-72B and Intern VL2.5-38B (these were evaluated locally on a workstation with 4 NVIDIA GPUs). |
| Software Dependencies | Yes | Images were generated by prompting DALL-E 3 (Betker et al., 2023), a text-to-image model developed by Open AI, through the Microsoft Azure API1 (version 2024-02-01 for all experiments). To systematically assess the impact of prompt likelihood on the validity of the generated images, we measured the plausibility of each text prompt using the GPT-3 language model from Open AI (the davinci-002-1 engine, available through the Microsoft Azure API). |
| Experiment Setup | Yes | We generated 10 images for each prompt, with the following hyperparameters: quality set to standard , style set to natural , and image size set to 1024 1024. Temperature was set to 0 when evaluating both models, top-p was set to 1, and the detail parameter was set to high . |