EMMA: Efficient Visual Alignment in Multi-Modal LLMs
Authors: Sara Ghazanfari, Alexandre Araujo, Prashanth Krishnamurthy, Siddharth Garg, Farshad Khorrami
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our key contributions include: (3) comprehensive experiments that demonstrate notable improvements on both specialized and general benchmarks for MLLMs. Empirical results show that EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations. We perform a comprehensive evaluation on both general and MLLM-specialized benchmarks, demonstrating that EMMA significantly improves cross-modal alignment, boosts task performance, and enhances the robustness of multi-modal LLMs. |
| Researcher Affiliation | Academia | Sara Ghazanfari , Alexandre Araujo Prashanth Krishnamurthy, Siddharth Garg, Farshad Khorrami Department of Electronic and Computer Engineering, New York University EMAIL |
| Pseudocode | No | The paper describes the architecture and method of EMMA using textual explanations, a diagram in Figure 1, and mathematical formulas in Section 3.1, but it does not contain a dedicated pseudocode or algorithm block. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code, a direct link to a code repository, or mention of code provided in supplementary materials for the methodology described. |
| Open Datasets | Yes | The evaluation of multi-modal LLMs relies on a mix of traditional academic benchmarks and newer ones tailored to instruction-following MLLMs. Established benchmarks like VQA-v2 (Goyal et al., 2017) and GQA Hudson & Manning (2019) gauge a model s ability to interpret visuals through open-ended, short-answer questions. Science QA (Lu et al., 2022b) tests zero-shot generalization in scientific question answering, while Viz Wiz (Gurari et al., 2018) offers real-world images... Additionally, newer benchmarks target instruction-following MLLMs. Math Vista (Lu et al., 2023)... MMMU (Yue et al., 2024)... MUIRBENCH (Wang et al., 2024a)... MMBench (Liu et al., 2023a)... MMVP (Tong et al., 2024)... POPE (Li et al., 2023f)... AMBER (Wang et al., 2023b)... FOIL (Shekhar et al., 2017), MMRel (Nie et al., 2024), and R-Bench (Wu et al., 2024). |
| Dataset Splits | Yes | For all the analysis performed in the Section 3, we use the same dataset as the baseline model, which is 558K and 665K samples for the pretraining and fine-tuning stages, respectively. In the Evaluation setting, we have preserved the same pertaining data but scaled the fine-tuning data to 1.2M samples, including LVIS-Instruct4V(Wang et al., 2023a), CLEVR(Johnson et al., 2017), Viz Wiz(Gurari et al., 2018), and Science QA(Lu et al., 2022a) training data. |
| Hardware Specification | No | The paper mentions support from "NYU IT High Performance Computing resources, services, and staff expertise" in the Acknowledgments section. However, this is a general statement about computing resources and does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for the experiments. |
| Software Dependencies | No | The paper mentions using "CLIP-Vi T-L-14" as the base image and text encoder and "Vicuna v1.5 (Zheng et al., 2023)" as the base LLM. These are specific models/architectures, but the paper does not list software libraries (e.g., PyTorch, TensorFlow), programming languages (e.g., Python), or their specific version numbers required to replicate the experimental environment. |
| Experiment Setup | Yes | For training, we follow the same two-stage instruction fine-tuning process as LLa VA. In the pretraining stage, only the Visual Alignment and Projection modules are trained, while the language model remains frozen. During the fine-tuning stage, the LLM is unfrozen and fine-tuned along with the two aforementioned modules. The Visual Alignment module is initialized with the identity matrix for the visual tokens and all zeros for the instruction tokens to transfer all the visual tokens at the beginning of training. Moreover, the Visual Alignment module is designed to maintain the same number of visual tokens as the baseline model. |