EMMA: Efficient Visual Alignment in Multi-Modal LLMs

Authors: Sara Ghazanfari, Alexandre Araujo, Prashanth Krishnamurthy, Siddharth Garg, Farshad Khorrami

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our key contributions include: (3) comprehensive experiments that demonstrate notable improvements on both specialized and general benchmarks for MLLMs. Empirical results show that EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations. We perform a comprehensive evaluation on both general and MLLM-specialized benchmarks, demonstrating that EMMA significantly improves cross-modal alignment, boosts task performance, and enhances the robustness of multi-modal LLMs.
Researcher Affiliation Academia Sara Ghazanfari , Alexandre Araujo Prashanth Krishnamurthy, Siddharth Garg, Farshad Khorrami Department of Electronic and Computer Engineering, New York University EMAIL
Pseudocode No The paper describes the architecture and method of EMMA using textual explanations, a diagram in Figure 1, and mathematical formulas in Section 3.1, but it does not contain a dedicated pseudocode or algorithm block.
Open Source Code No The paper does not contain an explicit statement about releasing source code, a direct link to a code repository, or mention of code provided in supplementary materials for the methodology described.
Open Datasets Yes The evaluation of multi-modal LLMs relies on a mix of traditional academic benchmarks and newer ones tailored to instruction-following MLLMs. Established benchmarks like VQA-v2 (Goyal et al., 2017) and GQA Hudson & Manning (2019) gauge a model s ability to interpret visuals through open-ended, short-answer questions. Science QA (Lu et al., 2022b) tests zero-shot generalization in scientific question answering, while Viz Wiz (Gurari et al., 2018) offers real-world images... Additionally, newer benchmarks target instruction-following MLLMs. Math Vista (Lu et al., 2023)... MMMU (Yue et al., 2024)... MUIRBENCH (Wang et al., 2024a)... MMBench (Liu et al., 2023a)... MMVP (Tong et al., 2024)... POPE (Li et al., 2023f)... AMBER (Wang et al., 2023b)... FOIL (Shekhar et al., 2017), MMRel (Nie et al., 2024), and R-Bench (Wu et al., 2024).
Dataset Splits Yes For all the analysis performed in the Section 3, we use the same dataset as the baseline model, which is 558K and 665K samples for the pretraining and fine-tuning stages, respectively. In the Evaluation setting, we have preserved the same pertaining data but scaled the fine-tuning data to 1.2M samples, including LVIS-Instruct4V(Wang et al., 2023a), CLEVR(Johnson et al., 2017), Viz Wiz(Gurari et al., 2018), and Science QA(Lu et al., 2022a) training data.
Hardware Specification No The paper mentions support from "NYU IT High Performance Computing resources, services, and staff expertise" in the Acknowledgments section. However, this is a general statement about computing resources and does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for the experiments.
Software Dependencies No The paper mentions using "CLIP-Vi T-L-14" as the base image and text encoder and "Vicuna v1.5 (Zheng et al., 2023)" as the base LLM. These are specific models/architectures, but the paper does not list software libraries (e.g., PyTorch, TensorFlow), programming languages (e.g., Python), or their specific version numbers required to replicate the experimental environment.
Experiment Setup Yes For training, we follow the same two-stage instruction fine-tuning process as LLa VA. In the pretraining stage, only the Visual Alignment and Projection modules are trained, while the language model remains frozen. During the fine-tuning stage, the LLM is unfrozen and fine-tuned along with the two aforementioned modules. The Visual Alignment module is initialized with the identity matrix for the visual tokens and all zeros for the instruction tokens to transfer all the visual tokens at the beginning of training. Moreover, the Visual Alignment module is designed to maintain the same number of visual tokens as the baseline model.