reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Can We Talk Models Into Seeing the World Differently?

Authors: Paul Gavrikov, Jovita Lukasik, Steffen Jung, Robert Geirhos, Muhammad Jehanzeb Mirza, Margret Keuper, Janis Keuper

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The results in Fig. 2 paint a fairly uniform picture across different models and on two different tasks. Overall, the shape bias of VLMs is still significantly lower than that of humans (96%), but higher than in typical image-only discriminative classifiers (e.g., 22% for an Image Net-trained Res Net-50 (He et al., 2015; Geirhos et al., 2019)).
Researcher Affiliation	Collaboration	Paul Gavrikov1,2,3,4 Jovita Lukasik5 Steffen Jung2,6 Robert Geirhos7 M. Jehanzeb Mirza8 Margret Keuper2,6 Janis Keuper1,2 1 IMLA, Offenburg University 2 University of Mannheim 3 Tübingen AI Center 4 Goethe University Frankfurt 5 University of Siegen 6 Max Planck Institute for Informatics, Saarland Informatics Campus 7 Google Deep Mind 8 MIT CSAIL
Pseudocode	No	The paper describes methodologies in prose and through experimental results and figures, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The source code, evaluation results containing model answers for each sample, prompts generated by the automated prompt search, and the frequency-cue-conflict dataset, along with its creation scripts, are available at: https://github.com/paulgavrikov/vlm_shapebias.
Open Datasets	Yes	Like most studies on the texture/shape bias in vision models, we use the texture-shape cue-conflict classification problem (cue-conflict) (Geirhos et al., 2019) consisting of 1,280 samples with conflicting shape and texture cues synthetically generated via a style transfer model (Gatys et al., 2016) from Image Net (Deng et al., 2009) samples... The source code, evaluation results [...] and the frequency-cue-conflict dataset, along with its creation scripts, are available at: https://github.com/paulgavrikov/vlm_shapebias.
Dataset Splits	No	The paper uses the texture-shape cue-conflict dataset (Geirhos et al., 2019) with 1,280 samples and a newly created frequency-cue-conflict dataset with 1,200 samples. While it mentions that "the optimization is done for the cue-conflict test set" in Section 5.1 regarding automated prompt engineering, it does not provide specific percentages, sample counts, or a detailed methodology for creating these splits for its own experiments.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running its experiments.
Software Dependencies	Yes	We embed the generated descriptions and all (raw) class labels using ember-v1 (Nur & Aliyev, 2024) and predict the class with the smallest cosine distance... As an additional signal, we perform a more granular analysis using an additional LLM (Nous-Hermes-2-Mixtral-8x7B-DPO (Teknium et al., 2024))... We utilized an LLM to optimize prompts. we switched to Mixtral-8x7B-Instruct-v0.1.
Experiment Setup	Yes	For VQA Classification, we ask the model "Which option best describes the image?" and provide an alphabetic enumeration of all class labels in the style "A. airplane"... We end the prompt by instructing the model to answer with only the letter corresponding to the correct answer ("Answer with the option s letter from the given choices directly.")... In this task, we are instructing models to generate brief descriptions ("Describe the image. Keep your response short.").