Vision-Language Models Create Cross-Modal Task Representations
Authors: Grace Luo, Trevor Darrell, Amir Bar
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate cross-modal transfer. We measure the performance of text examples applied to image queries on our six cross-modal tasks, following the same procedure illustrated in Figure 3. We evaluate our entire collection of early and late-fusion models, LLa VA-v1.5, Mantis-Fuyu, and Idefics2. We ablate two key axes of cross-modal patching (Text Examples Patch): the application method (Patch vs. Prompt) and specification modality (Text vs. Image Examples). We also provide the performance of two lower bounds the majority answer from the examples (Random) and the query without any task information (No Context). |
| Researcher Affiliation | Academia | 1University of California, Berkeley, USA. Correspondence to: Grace Luo <EMAIL>. |
| Pseudocode | Yes | Figure 12: Py Torch-like pseudocode for the continuous visualization shown in Figure 14. |
| Open Source Code | No | The paper mentions "vlm-cross-modal-reps.github.io" as a project page, but it does not explicitly state that the source code for the methodology described in the paper is available at this link, nor does it provide a direct repository link or mention code in supplementary materials. |
| Open Datasets | Yes | Beyond the synthetic tasks in our main evaluation set, we automatically construct an in-the-wild evaluation set derived from VQAv2 (Goyal et al., 2017), a visual questionanswering dataset consisting of images and question-answer pairs. The labels are derived from the mammals categorized in i Naturalist (2021). We use 148 overlapping images with conflicting questions from OK-VQA (Marino et al., 2019) and A-OKVQA (Schwenk et al., 2022). |
| Dataset Splits | Yes | For each task, we split the example pool into 30 samples for validation and 100 for testing, where the split is kept consistent across modalities. Each sample is then used as a query, where its corresponding answer is the ground-truth label. To construct a 30-sample validation and 100-sample test set centered around the same task. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running its experiments. While computational overhead is mentioned in terms of VRAM usage, the specific hardware is not identified. |
| Software Dependencies | No | The paper mentions "Py Torch-like pseudocode" and references using Claude 3.5 Sonnet and GPT4o, but it does not specify version numbers for PyTorch or any other core software dependencies used to implement their models or experimental setup. |
| Experiment Setup | Yes | When conditioning on examples, we use the generic template from Todd et al. (2024): Q:{x1} A:{y1} Q:{xn} A:{yn} where we evaluate with N = 5 examples. For instructions, we pass the raw string with no templating. We determine the best layer to patch for each model via average task accuracy on the validation set. We report metrics on the unseen test set, averaged over three seeds. We resize images to a standard width of 224 pixels. |