reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Understanding Multimodal LLMs Under Distribution Shifts: An Information-Theoretic Approach

Authors: Changdae Oh, Zhen Fang, Shawn Im, Xuefeng Du, Yixuan Li

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on real benchmark datasets, spanning 61 shift scenarios, empirically validate our theoretical insights. Code: [...] Finally, we conduct comprehensive validation for the theoretical framework and show that our theorems empirically hold in real-world benchmarks.
Researcher Affiliation	Academia	1Department of Computer Sciences, University of Wisconsin Madison, WI, USA 2Faculty of Engineering & Information Technology, University of Technology Sydney, Sydney, Australia. Correspondence to: Yixuan Li <EMAIL>.
Pseudocode	No	The paper describes methods and theoretical derivations using mathematical formulas and prose but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Extensive experiments on real benchmark datasets, spanning 61 shift scenarios, empirically validate our theoretical insights. Code:
Open Datasets	Yes	Specifically, we adopt LLaVA-1.5 (Liu et al., 2023) and LLaVA-NeXT (Liu et al., 2024a) in 7B and 13B sizes as our target MLLM, with LLaVA-Bench COCO (Liu et al., 2023) serving as the ID dataset [...] We adopt LLaVA-Bench Wild (Liu et al., 2023) to vary visual input semantics [...] we adopt LLaVA-Med instruction dataset (Li et al., 2024) as a domain-specific open-ended benchmark on the medical images and corresponding questions.
Dataset Splits	No	The paper describes how out-of-distribution scenarios were constructed and evaluated (e.g., "34 synthetic and 27 natural shifts spanning 61 shift scenarios in total"), but it does not provide specific train/test/validation splits (e.g., percentages or exact counts) for the underlying datasets like LLaVA-Bench COCO, nor does it refer to standard predefined splits with citations for reproducibility of data partitioning.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU or CPU models, memory details) used to run the experiments.
Software Dependencies	No	The paper mentions several tools and models like CLUB (Cheng et al., 2020), RJSD (Hoyos-Osorio & Sanchez-Giraldo, 2023), CLIP-ViT-B/32 (Radford et al., 2021), XLM-RoBERTa-Base (Conneau, 2019), and GPT-4 (Hurst et al., 2024). However, it does not provide specific version numbers for these or other software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	we parameterize the qψ as a multi-variate Gaussian distribution and estimate the mean and variance parameters of Gaussian with separated two-layer MLPs with 250 hidden dimension size. During mini-batch training, those MLPs consume the concatenated input and response embeddings {[zxi, zyi]}N i=1 to produce a scalar estimate of MI, and they are simultaneously optimized by AdamW optimizer with learning rate 0.001 and batch size 1,024 for 5,000 iterations.