reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LLMs can see and hear without any training

Authors: Kumar Ashutosh, Yossi Gandelsman, Xinlei Chen, Ishan Misra, Rohit Girdhar

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We now empirically evaluate MILS and compare it to existing approaches on some of the multimodal understanding and generation tasks enabled by it. For each of the downstream applications, we describe the GENERATOR, SCORER, benchmarks and evaluation setup, followed by the key results. Finally in Section 4.7 we ablate the various design choices in MILS.
Researcher Affiliation	Collaboration	1Meta AI 2UT Austin 3UC Berkeley. Correspondence to: Kumar Ashutosh <EMAIL>, Rohit Girdhar <EMAIL>.
Pseudocode	No	The paper describes the MILS approach conceptually using GENERATOR and SCORER modules and flow diagrams, but does not present any structured pseudocode or algorithm blocks.
Open Source Code	Yes	The code to reproduce MILS is available at https://github.com/ facebookresearch/MILS.
Open Datasets	Yes	We evaluate MILS on the MSCOCO captioning test set (Karpathy & Fei-Fei, 2015). It consists of 5,000 images sampled from the MSCOCO dataset (Lin et al., 2014). We experiment on the MSR-VTT (Xu et al., 2016) test set, which contains 2,990 videos... We evaluate our approach on a popular audio captioning dataset, Clotho (Drossos et al., 2020).
Dataset Splits	Yes	We evaluate MILS on the MSCOCO captioning test set (Karpathy & Fei-Fei, 2015). It consists of 5,000 images sampled from the MSCOCO dataset (Lin et al., 2014). We experiment on the MSR-VTT (Xu et al., 2016) test set, which contains 2,990 videos. For computational ease, we randomly sample 1000 images from MSCOCO for captioning, and use the 200 prompt Draw Bench set for image generation, as the test set for this analysis.
Hardware Specification	No	The paper mentions 'modern GPUs' in the context of inference time optimization but provides no specific details about the hardware (e.g., GPU models, CPU models, memory) used for conducting its experiments.
Software Dependencies	No	The paper refers to specific models like 'Llama 3.1 8B', 'CLIP', and 'Sig LIP', but does not list any general software dependencies or libraries with their version numbers (e.g., Python, PyTorch, CUDA versions) that are typically required for reproducibility.
Experiment Setup	Yes	We generate an initial list of 30K prompts that we use to bootstrap the optimization process... Then for each optimization step, we keep the top-50 highest scoring generations from the SCORER... We run the optimization process for 10 steps. We use the Llama 3.1 8B (Dubey et al., 2024) LLM as the core generation module.