Can We Talk Models Into Seeing the World Differently?
Authors: Paul Gavrikov, Jovita Lukasik, Steffen Jung, Robert Geirhos, Muhammad Jehanzeb Mirza, Margret Keuper, Janis Keuper
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The results in Fig. 2 paint a fairly uniform picture across different models and on two different tasks. Overall, the shape bias of VLMs is still significantly lower than that of humans (96%), but higher than in typical image-only discriminative classifiers (e.g., 22% for an Image Net-trained Res Net-50 (He et al., 2015; Geirhos et al., 2019)). |
| Researcher Affiliation | Collaboration | Paul Gavrikov1,2,3,4 Jovita Lukasik5 Steffen Jung2,6 Robert Geirhos7 M. Jehanzeb Mirza8 Margret Keuper2,6 Janis Keuper1,2 1 IMLA, Offenburg University 2 University of Mannheim 3 Tübingen AI Center 4 Goethe University Frankfurt 5 University of Siegen 6 Max Planck Institute for Informatics, Saarland Informatics Campus 7 Google Deep Mind 8 MIT CSAIL |
| Pseudocode | No | The paper describes methodologies in prose and through experimental results and figures, but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The source code, evaluation results containing model answers for each sample, prompts generated by the automated prompt search, and the frequency-cue-conflict dataset, along with its creation scripts, are available at: https://github.com/paulgavrikov/vlm_shapebias. |
| Open Datasets | Yes | Like most studies on the texture/shape bias in vision models, we use the texture-shape cue-conflict classification problem (cue-conflict) (Geirhos et al., 2019) consisting of 1,280 samples with conflicting shape and texture cues synthetically generated via a style transfer model (Gatys et al., 2016) from Image Net (Deng et al., 2009) samples... The source code, evaluation results [...] and the frequency-cue-conflict dataset, along with its creation scripts, are available at: https://github.com/paulgavrikov/vlm_shapebias. |
| Dataset Splits | No | The paper uses the texture-shape cue-conflict dataset (Geirhos et al., 2019) with 1,280 samples and a newly created frequency-cue-conflict dataset with 1,200 samples. While it mentions that "the optimization is done for the cue-conflict test set" in Section 5.1 regarding automated prompt engineering, it does not provide specific percentages, sample counts, or a detailed methodology for creating these splits for its own experiments. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running its experiments. |
| Software Dependencies | Yes | We embed the generated descriptions and all (raw) class labels using ember-v1 (Nur & Aliyev, 2024) and predict the class with the smallest cosine distance... As an additional signal, we perform a more granular analysis using an additional LLM (Nous-Hermes-2-Mixtral-8x7B-DPO (Teknium et al., 2024))... We utilized an LLM to optimize prompts. we switched to Mixtral-8x7B-Instruct-v0.1. |
| Experiment Setup | Yes | For VQA Classification, we ask the model "Which option best describes the image?" and provide an alphabetic enumeration of all class labels in the style "A. airplane"... We end the prompt by instructing the model to answer with only the letter corresponding to the correct answer ("Answer with the option s letter from the given choices directly.")... In this task, we are instructing models to generate brief descriptions ("Describe the image. Keep your response short."). |