reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Vision-Language Models Create Cross-Modal Task Representations

Authors: Grace Luo, Trevor Darrell, Amir Bar

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate cross-modal transfer. We measure the performance of text examples applied to image queries on our six cross-modal tasks, following the same procedure illustrated in Figure 3. We evaluate our entire collection of early and late-fusion models, LLa VA-v1.5, Mantis-Fuyu, and Idefics2. We ablate two key axes of cross-modal patching (Text Examples Patch): the application method (Patch vs. Prompt) and specification modality (Text vs. Image Examples). We also provide the performance of two lower bounds the majority answer from the examples (Random) and the query without any task information (No Context).
Researcher Affiliation	Academia	1University of California, Berkeley, USA. Correspondence to: Grace Luo <EMAIL>.
Pseudocode	Yes	Figure 12: Py Torch-like pseudocode for the continuous visualization shown in Figure 14.
Open Source Code	No	The paper mentions "vlm-cross-modal-reps.github.io" as a project page, but it does not explicitly state that the source code for the methodology described in the paper is available at this link, nor does it provide a direct repository link or mention code in supplementary materials.
Open Datasets	Yes	Beyond the synthetic tasks in our main evaluation set, we automatically construct an in-the-wild evaluation set derived from VQAv2 (Goyal et al., 2017), a visual questionanswering dataset consisting of images and question-answer pairs. The labels are derived from the mammals categorized in i Naturalist (2021). We use 148 overlapping images with conflicting questions from OK-VQA (Marino et al., 2019) and A-OKVQA (Schwenk et al., 2022).
Dataset Splits	Yes	For each task, we split the example pool into 30 samples for validation and 100 for testing, where the split is kept consistent across modalities. Each sample is then used as a query, where its corresponding answer is the ground-truth label. To construct a 30-sample validation and 100-sample test set centered around the same task.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running its experiments. While computational overhead is mentioned in terms of VRAM usage, the specific hardware is not identified.
Software Dependencies	No	The paper mentions "Py Torch-like pseudocode" and references using Claude 3.5 Sonnet and GPT4o, but it does not specify version numbers for PyTorch or any other core software dependencies used to implement their models or experimental setup.
Experiment Setup	Yes	When conditioning on examples, we use the generic template from Todd et al. (2024): Q:{x1} A:{y1} Q:{xn} A:{yn} where we evaluate with N = 5 examples. For instructions, we pass the raw string with no templating. We determine the best layer to patch for each model via average task accuracy on the validation set. We report metrics on the unseen test set, averaged over three seeds. We resize images to a standard width of 224 pixels.