reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning

Authors: Nilay Yilmaz, Maitreya Patel, Lawrence Luo, Tejas Gokhale, Chitta Baral, Suren Jayasuriya, 'YZ' Yezhou Yang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate that the analogical reasoning tasks in VOILA present a challenge to MLLMs. Through multi-step analysis, we reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in highlevel relational reasoning. Notably, we observe that performance improves when following a multi-step strategy of least-to-most prompting. Comprehensive evaluations on open-source models and GPT-4o show that on text-based answers, the best accuracy for challenging scenarios is 13% (LLa Ma 3.2) and even for simpler tasks is only 29% (GPT-4o), while human performance is significantly higher at 70% across both difficulty levels.
Researcher Affiliation	Academia	Arizona State University University of Maryland, Baltimore County
Pseudocode	Yes	Algorithm 1 Visual Analogy Generation
Open Source Code	Yes	Code and data: github.com/nlylmz/Voila
Open Datasets	Yes	To address this, we introduce VOILA, a large-scale, open-ended, dynamic benchmark designed to evaluate MLLMs perceptual understanding and abstract relational reasoning.
Dataset Splits	Yes	For evaluating MLLMs, the VOILA benchmark is split into two distinct test datasets: VOILA-WD, which includes Distraction rules, and VOILA-ND, which excludes them. VOILA-WD consists of 10K unique questions, applying all four rules across 19 structures, while VOILA-ND comprises 3.6K questions with seven configurations and three rules (see Appendix A.4 for details). Each dataset contains 527 questions per configuration, with 728 images annotated with image contents, relationship explanations, and descriptions of the requested image.
Hardware Specification	No	We thank the NSF NAIRR initiative, the Research Computing (RC) at Arizona State University (ASU), and cr8dl.ai for their generous support in providing computing resources.
Software Dependencies	No	We employ the open-source SDXL model, which generates high-quality images based on a simple text prompt structure that includes the number of subjects, subject types, and actions, for example Two dogs walking. Complex or overly detailed prompts can lead to incorrect image generation (Podell et al., 2023), so we maintain straightforward prompt structures. After generating text prompts, the images for the analogy questions are produced using the SDXL pipeline, with output resolution set to 1024x1024 and a guidance scale of 8, which controls the fidelity to the text prompt.
Experiment Setup	Yes	We employ the open-source SDXL model, which generates high-quality images based on a simple text prompt structure that includes the number of subjects, subject types, and actions, for example Two dogs walking. Complex or overly detailed prompts can lead to incorrect image generation (Podell et al., 2023), so we maintain straightforward prompt structures. After generating text prompts, the images for the analogy questions are produced using the SDXL pipeline, with output resolution set to 1024x1024 and a guidance scale of 8, which controls the fidelity to the text prompt. For each prompt, 30 images are generated. Appendix A.2 provides examples of generated images and their text descriptions. [...] We applied the Least-to-Most (L2M) prompting strategy (Zhou et al., 2023) and manually decomposed the visual analogy task into four sub-problems: (1) understanding the visual content, (2) identifying relationships between images, (3) applying those relationships to the third image, and (4) generating the content of the fourth image. Instead of using sub-questions, we employed sub-instructions, asking the models to solve each sub-task sequentially, with the previous answer appended to the next problem. This structured reasoning process allowed us to evaluate performance at each sub-task. We tested various prompts on the baseline models using both L2M and direct answer approaches. The prompts used for multi-step reasoning and direct answering are detailed in Appendix C.1.