Vision-Language Models Create Cross-Modal Task Representations

Authors: Grace Luo, Trevor Darrell, Amir Bar

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate cross-modal transfer. We measure the performance of text examples applied to image queries on our six cross-modal tasks, following the same procedure illustrated in Figure 3. We evaluate our entire collection of early and late-fusion models, LLa VA-v1.5, Mantis-Fuyu, and Idefics2. We ablate two key axes of cross-modal patching (Text Examples Patch): the application method (Patch vs. Prompt) and specification modality (Text vs. Image Examples). We also provide the performance of two lower bounds the majority answer from the examples (Random) and the query without any task information (No Context).
Researcher Affiliation Academia 1University of California, Berkeley, USA. Correspondence to: Grace Luo <EMAIL>.
Pseudocode Yes Figure 12: Py Torch-like pseudocode for the continuous visualization shown in Figure 14.
Open Source Code No The paper mentions "vlm-cross-modal-reps.github.io" as a project page, but it does not explicitly state that the source code for the methodology described in the paper is available at this link, nor does it provide a direct repository link or mention code in supplementary materials.
Open Datasets Yes Beyond the synthetic tasks in our main evaluation set, we automatically construct an in-the-wild evaluation set derived from VQAv2 (Goyal et al., 2017), a visual questionanswering dataset consisting of images and question-answer pairs. The labels are derived from the mammals categorized in i Naturalist (2021). We use 148 overlapping images with conflicting questions from OK-VQA (Marino et al., 2019) and A-OKVQA (Schwenk et al., 2022).
Dataset Splits Yes For each task, we split the example pool into 30 samples for validation and 100 for testing, where the split is kept consistent across modalities. Each sample is then used as a query, where its corresponding answer is the ground-truth label. To construct a 30-sample validation and 100-sample test set centered around the same task.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running its experiments. While computational overhead is mentioned in terms of VRAM usage, the specific hardware is not identified.
Software Dependencies No The paper mentions "Py Torch-like pseudocode" and references using Claude 3.5 Sonnet and GPT4o, but it does not specify version numbers for PyTorch or any other core software dependencies used to implement their models or experimental setup.
Experiment Setup Yes When conditioning on examples, we use the generic template from Todd et al. (2024): Q:{x1} A:{y1} Q:{xn} A:{yn} where we evaluate with N = 5 examples. For instructions, we pass the raw string with no templating. We determine the best layer to patch for each model via average task accuracy on the validation set. We report metrics on the unseen test set, averaged over three seeds. We resize images to a standard width of 224 pixels.