Can We Talk Models Into Seeing the World Differently?

Authors: Paul Gavrikov, Jovita Lukasik, Steffen Jung, Robert Geirhos, Muhammad Jehanzeb Mirza, Margret Keuper, Janis Keuper

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The results in Fig. 2 paint a fairly uniform picture across different models and on two different tasks. Overall, the shape bias of VLMs is still significantly lower than that of humans (96%), but higher than in typical image-only discriminative classifiers (e.g., 22% for an Image Net-trained Res Net-50 (He et al., 2015; Geirhos et al., 2019)).
Researcher Affiliation Collaboration Paul Gavrikov1,2,3,4 Jovita Lukasik5 Steffen Jung2,6 Robert Geirhos7 M. Jehanzeb Mirza8 Margret Keuper2,6 Janis Keuper1,2 1 IMLA, Offenburg University 2 University of Mannheim 3 Tübingen AI Center 4 Goethe University Frankfurt 5 University of Siegen 6 Max Planck Institute for Informatics, Saarland Informatics Campus 7 Google Deep Mind 8 MIT CSAIL
Pseudocode No The paper describes methodologies in prose and through experimental results and figures, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes The source code, evaluation results containing model answers for each sample, prompts generated by the automated prompt search, and the frequency-cue-conflict dataset, along with its creation scripts, are available at: https://github.com/paulgavrikov/vlm_shapebias.
Open Datasets Yes Like most studies on the texture/shape bias in vision models, we use the texture-shape cue-conflict classification problem (cue-conflict) (Geirhos et al., 2019) consisting of 1,280 samples with conflicting shape and texture cues synthetically generated via a style transfer model (Gatys et al., 2016) from Image Net (Deng et al., 2009) samples... The source code, evaluation results [...] and the frequency-cue-conflict dataset, along with its creation scripts, are available at: https://github.com/paulgavrikov/vlm_shapebias.
Dataset Splits No The paper uses the texture-shape cue-conflict dataset (Geirhos et al., 2019) with 1,280 samples and a newly created frequency-cue-conflict dataset with 1,200 samples. While it mentions that "the optimization is done for the cue-conflict test set" in Section 5.1 regarding automated prompt engineering, it does not provide specific percentages, sample counts, or a detailed methodology for creating these splits for its own experiments.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running its experiments.
Software Dependencies Yes We embed the generated descriptions and all (raw) class labels using ember-v1 (Nur & Aliyev, 2024) and predict the class with the smallest cosine distance... As an additional signal, we perform a more granular analysis using an additional LLM (Nous-Hermes-2-Mixtral-8x7B-DPO (Teknium et al., 2024))... We utilized an LLM to optimize prompts. we switched to Mixtral-8x7B-Instruct-v0.1.
Experiment Setup Yes For VQA Classification, we ask the model "Which option best describes the image?" and provide an alphabetic enumeration of all class labels in the style "A. airplane"... We end the prompt by instructing the model to answer with only the letter corresponding to the correct answer ("Answer with the option s letter from the given choices directly.")... In this task, we are instructing models to generate brief descriptions ("Describe the image. Keep your response short.").