Eve: Efficient Multimodal Vision Language Models with Elastic Visual Experts
Authors: Miao Rang, Zhenni Bi, Chuanjian Liu, Yehui Tang, Kai Han, Yunhe Wang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Eve strikes a balance between preserving linguistic abilities and augmenting multimodal capabilities. This balanced approach results in a versatile model with only 1.8B parameters that delivers significant improvements in both multimodal and linguistic tasks. Notably, in configurations below 3B parameters, Eve distinctly outperforms in language benchmarks and achieves state-of-the-art results 68.87% in VLM Benchmarks. Additionally, its multimodal accuracy outstrips that of the larger 7B LLa VA-1.5 model. |
| Researcher Affiliation | Industry | Huawei Noah s Ark Lab EMAIL |
| Pseudocode | No | The paper describes the methods, including equations for routing mechanisms and loss functions, and figures illustrating the framework, but does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any specific links to source code repositories, nor does it explicitly state that the code for the described methodology is being released or is available in supplementary materials. |
| Open Datasets | Yes | To minimize the temporal costs associated with individual trials, our study employs smaller-scale visual models in conjunction with the Pan Gu-π-1.5B (Wang et al. 2023) language model. Specifically, we utilize the Res Net-50 architecture (He et al. 2016) as the vision encoder, evaluating its performance across different datasets: Image Net-1K (Russakovsky et al. 2015), Image Net-22K (Ridnik et al. 2021), and LAION400M (Radford et al. 2021). The experimental results are summarized in Table 1. Our dataset has been carefully refined and expanded to create high-quality datasets that enhance cross-modal understanding. In the first two phases, we utilize the CC-595K and LLa VA-mixed-665 datasets to develop foundational multimodal capabilities. In the third phase, we curate a diverse collection of datasets across several domains, including General Multi-modality, Visual Question Answering (VQA), Optical Character Recognition (OCR), Image Captioning, and Knowledge-intensive tasks. This comprehensive ensemble consists of over 3.2 million samples, all meticulously designed to significantly enhance the model s versatility and performance across a wide range of modal scenarios. Detailed descriptions of the various datasets are provided in Appendix B. Following the rigorous evaluation protocols established in prior works such as (Chu et al. 2023, 2024), we employ a comprehensive suite of VLM benchmarks for multimodal assessment, comprising GQA (Hudson and Manning 2019), SQA (Lu et al. 2022), Text VQA (Singh et al. 2019), MME (Guo et al. 2023), MMBench (Liu et al. 2023d) and POPE (Li et al. 2023c). Consistent with the approach outlined in (Tang et al. 2024), we employ a diverse array of benchmarks to evaluate linguistic competencies. These include C-Eval (Huang et al. 2024), CMMLU (Li et al. 2023a), MMLU (Hendrycks et al. 2020), Bool Q (Clark et al. 2019), PIQA (Bisk et al. 2020), EPRSTM (Xu et al. 2021) and XSum (Narayan, Cohen, and Lapata 2018). |
| Dataset Splits | No | The paper mentions several datasets used for training and evaluation (e.g., Image Net-1K, Image Net-22K, LAION400M, CC-595K, LLaVA-mixed-665, GQA, SQA, Text VQA, MME, MMBench, POPE). It describes the process of curating datasets and their sizes, but it does not specify explicit training, validation, or test splits (e.g., percentages, sample counts, or references to predefined splits) for these datasets. |
| Hardware Specification | Yes | GPU 8x V100-32G 8x V100-32G 8x V100-32G. We gratefully acknowledge the support of Mind Spore, CANN (Compute Architecture for Neural Networks) and Ascend AI Processor used for this research. |
| Software Dependencies | No | The paper mentions "Mind Spore, CANN (Compute Architecture for Neural Networks) and Ascend AI Processor" in the acknowledgments. While these are software/hardware platforms, no specific version numbers are provided for any of these components or other libraries/packages. |
| Experiment Setup | Yes | Table 2: Training hyperparameters of Eve. This table provides specific details for each stage of training, including: Learning rate (1e-3, 2e-5, 2e-5), LR schedule (Cosine decay for all stages), Weight decay (0 for all stages), Optimizer (Adam W(b1=0.9, b2=0.95)), Warmup ratio (0.03 for all stages), Global batch size (256, 128, 128), Training steps (2181, 5197, 25510), Epoch (1 for all stages), and Image resolution (384x384 for all stages). Additionally, it specifies FFN capacity as C=1.5. |