LLMs can see and hear without any training

Authors: Kumar Ashutosh, Yossi Gandelsman, Xinlei Chen, Ishan Misra, Rohit Girdhar

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We now empirically evaluate MILS and compare it to existing approaches on some of the multimodal understanding and generation tasks enabled by it. For each of the downstream applications, we describe the GENERATOR, SCORER, benchmarks and evaluation setup, followed by the key results. Finally in Section 4.7 we ablate the various design choices in MILS.
Researcher Affiliation Collaboration 1Meta AI 2UT Austin 3UC Berkeley. Correspondence to: Kumar Ashutosh <EMAIL>, Rohit Girdhar <EMAIL>.
Pseudocode No The paper describes the MILS approach conceptually using GENERATOR and SCORER modules and flow diagrams, but does not present any structured pseudocode or algorithm blocks.
Open Source Code Yes The code to reproduce MILS is available at https://github.com/ facebookresearch/MILS.
Open Datasets Yes We evaluate MILS on the MSCOCO captioning test set (Karpathy & Fei-Fei, 2015). It consists of 5,000 images sampled from the MSCOCO dataset (Lin et al., 2014). We experiment on the MSR-VTT (Xu et al., 2016) test set, which contains 2,990 videos... We evaluate our approach on a popular audio captioning dataset, Clotho (Drossos et al., 2020).
Dataset Splits Yes We evaluate MILS on the MSCOCO captioning test set (Karpathy & Fei-Fei, 2015). It consists of 5,000 images sampled from the MSCOCO dataset (Lin et al., 2014). We experiment on the MSR-VTT (Xu et al., 2016) test set, which contains 2,990 videos. For computational ease, we randomly sample 1000 images from MSCOCO for captioning, and use the 200 prompt Draw Bench set for image generation, as the test set for this analysis.
Hardware Specification No The paper mentions 'modern GPUs' in the context of inference time optimization but provides no specific details about the hardware (e.g., GPU models, CPU models, memory) used for conducting its experiments.
Software Dependencies No The paper refers to specific models like 'Llama 3.1 8B', 'CLIP', and 'Sig LIP', but does not list any general software dependencies or libraries with their version numbers (e.g., Python, PyTorch, CUDA versions) that are typically required for reproducibility.
Experiment Setup Yes We generate an initial list of 30K prompts that we use to bootstrap the optimization process... Then for each optimization step, we keep the top-50 highest scoring generations from the SCORER... We run the optimization process for 10 steps. We use the Llama 3.1 8B (Dubey et al., 2024) LLM as the core generation module.