Understanding Multimodal LLMs Under Distribution Shifts: An Information-Theoretic Approach

Authors: Changdae Oh, Zhen Fang, Shawn Im, Xuefeng Du, Yixuan Li

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on real benchmark datasets, spanning 61 shift scenarios, empirically validate our theoretical insights. Code: [...] Finally, we conduct comprehensive validation for the theoretical framework and show that our theorems empirically hold in real-world benchmarks.
Researcher Affiliation Academia 1Department of Computer Sciences, University of Wisconsin Madison, WI, USA 2Faculty of Engineering & Information Technology, University of Technology Sydney, Sydney, Australia. Correspondence to: Yixuan Li <EMAIL>.
Pseudocode No The paper describes methods and theoretical derivations using mathematical formulas and prose but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Extensive experiments on real benchmark datasets, spanning 61 shift scenarios, empirically validate our theoretical insights. Code:
Open Datasets Yes Specifically, we adopt LLaVA-1.5 (Liu et al., 2023) and LLaVA-NeXT (Liu et al., 2024a) in 7B and 13B sizes as our target MLLM, with LLaVA-Bench COCO (Liu et al., 2023) serving as the ID dataset [...] We adopt LLaVA-Bench Wild (Liu et al., 2023) to vary visual input semantics [...] we adopt LLaVA-Med instruction dataset (Li et al., 2024) as a domain-specific open-ended benchmark on the medical images and corresponding questions.
Dataset Splits No The paper describes how out-of-distribution scenarios were constructed and evaluated (e.g., "34 synthetic and 27 natural shifts spanning 61 shift scenarios in total"), but it does not provide specific train/test/validation splits (e.g., percentages or exact counts) for the underlying datasets like LLaVA-Bench COCO, nor does it refer to standard predefined splits with citations for reproducibility of data partitioning.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU or CPU models, memory details) used to run the experiments.
Software Dependencies No The paper mentions several tools and models like CLUB (Cheng et al., 2020), RJSD (Hoyos-Osorio & Sanchez-Giraldo, 2023), CLIP-ViT-B/32 (Radford et al., 2021), XLM-RoBERTa-Base (Conneau, 2019), and GPT-4 (Hurst et al., 2024). However, it does not provide specific version numbers for these or other software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes we parameterize the qψ as a multi-variate Gaussian distribution and estimate the mean and variance parameters of Gaussian with separated two-layer MLPs with 250 hidden dimension size. During mini-batch training, those MLPs consume the concatenated input and response embeddings {[zxi, zyi]}N i=1 to produce a scalar estimate of MI, and they are simultaneously optimized by AdamW optimizer with learning rate 0.001 and batch size 1,024 for 5,000 iterations.