reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Evaluating Compositional Scene Understanding in Multimodal Generative Models

Authors: Shuhao Fu, Andrew Jun Lee, Yixin Anna Wang, Ida Momennejad, Trevor Bihl, Hongjing Lu, Taylor Whittington Webb

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we present an evaluation of the compositional visual processing capabilities in the current generation of text-to-image (DALL-E 3) and multimodal vision-language models (GPT-4V, GPT-4o, Claude Sonnet 3.5, QWEN2-VL-72B, and Intern VL2.5-38B), and compare the performance of these systems to human participants.
Researcher Affiliation	Collaboration	Shuhao Fu, Andrew Jun Lee, Anna Wang Department of Psychology, University of California, Los Angeles; Ida Momennejad Microsoft Research, NYC; Trevor Bihl Air Force Research Laboratory; Hongjing Lu EMAIL Department of Psychology, Department of Statistics, University of California, Los Angeles; Taylor Webb EMAIL Microsoft Research, NYC. The affiliations include academic institutions (University of California, Los Angeles) and industry/government research labs (Microsoft Research, Air Force Research Laboratory), indicating a collaboration.
Pseudocode	No	No pseudocode or algorithm blocks are explicitly labeled or presented in a structured format in the paper.
Open Source Code	Yes	All experimental materials, code, and data are available at: https://github.com/andrewjlee0/evaluating_compositionality_VLMs
Open Datasets	Yes	We first evaluated relational concept learning with real-world scenes, using the Bongard-HOI dataset (Jiang et al., 2022).
Dataset Splits	Yes	For Bongard-HOI: In the standard evaluation methodology, 6 labeled images from each class are presented (12 total), and the remaining positive and negative images are presented for classification. evaluation was performed by presenting 9 labeled example images (randomly selecting either 5 positive examples and 4 negative examples, or vice versa) followed by a single query image. For SVRT: For each problem, we presented 1-9 few-shot examples, consisting of a mixture of positive and negative instances (in random order).
Hardware Specification	Yes	two open-source VLMs: QWEN2-VL-72B and Intern VL2.5-38B (these were evaluated locally on a workstation with 4 NVIDIA GPUs).
Software Dependencies	Yes	Images were generated by prompting DALL-E 3 (Betker et al., 2023), a text-to-image model developed by Open AI, through the Microsoft Azure API1 (version 2024-02-01 for all experiments). To systematically assess the impact of prompt likelihood on the validity of the generated images, we measured the plausibility of each text prompt using the GPT-3 language model from Open AI (the davinci-002-1 engine, available through the Microsoft Azure API).
Experiment Setup	Yes	We generated 10 images for each prompt, with the following hyperparameters: quality set to standard , style set to natural , and image size set to 1024 1024. Temperature was set to 0 when evaluating both models, top-p was set to 1, and the detail parameter was set to high .