MAGE: Multimodal Alignment and Generation Enhancement via Bridging Visual and Semantic Spaces
Authors: Shaojun E, Yuchen Yang, Jiaheng Wu, Yan Zhang, Tiejun Zhao, Ziyan Chen
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, our proposed multimodal large model architecture, MAGE, achieved significantly better performance compared to similar works across various evaluation benchmarks, including MME, MMBench, and SEED. Complete code and appendix are available at: https://github.com/GTCOM-NLP/MAGE. ... Experimental results demonstrate that MAGE performs exceptionally well across various benchmarks, achieving state-of-the-art efficiency and accuracy. |
| Researcher Affiliation | Collaboration | Shaojun E1,3, , Yuchen Yang2, , Jiaheng Wu2 , Yan Zhang1 , Tiejun Zhao2 , Ziyan Chen1, 1 Global Tone Communication Technology Co., Ltd., Beijing, China 2 Faculty of computing, Harbin Institute of Technology, Harbin, China 3 School of Computer Science and Technology, Beijing Jiaotong University, Beijing, China EMAIL, EMAIL; EMAIL |
| Pseudocode | No | The paper describes the architecture and method in prose and diagrams (Figures 2 and 3) but does not include any clearly labeled pseudocode or algorithm blocks. The methods are explained descriptively within the text, such as in Section 3, without presenting structured pseudocode. |
| Open Source Code | Yes | Complete code and appendix are available at: https://github.com/GTCOM-NLP/MAGE |
| Open Datasets | Yes | All the training data is presented in table 2. ... Blip Cap Filt[Li et al., 2022] ... LLaVA150K[Liu et al., 2024b], Share GPT[Chiang et al., 2023] ... Doc-VQA[Mathew et al., 2021], Chart QA[Masry et al., 2022], DVQA[Kafle et al., 2018] ... Geo QA+[Chen et al., 2021] ... Synth Dog-EN[Kim et al., 2022] |
| Dataset Splits | Yes | For training in this phase, only 90% of this dataset was used. ... The training data combines the remaining 10% of the fine-tuning dataset from the second stage with the newly constructed self-awareness dataset... |
| Hardware Specification | Yes | Training was conducted using 48 A6000 GPUs. |
| Software Dependencies | Yes | We used two variants of the Vicunav1.5 [Zheng et al., 2023] large language model (7B and 13B) and CLIP Vi T-L/14 [Radford et al., 2021b] (336px) as the vision encoder |
| Experiment Setup | Yes | We used two variants of the Vicunav1.5 [Zheng et al., 2023] large language model (7B and 13B) and CLIP Vi T-L/14 [Radford et al., 2021b] (336px) as the vision encoder, with input images set to a resolution of 336 × 336 pixels. Visual features are derived from the penultimate layer of the CLIP model, excluding the CLS token. During training, we fine-tuned all model parameters (including CLIP and LLM) without using parameter-efficient techniques like LoRA [Hu et al., 2021]. Training was conducted using 48 A6000 GPUs. Our training process is divided into three phases... we propose combining cross-entropy loss with mean squared error (MSE) loss... Our proposed alignment strategy employs two primary loss functions: Image-Guided Text Generation (ITG) loss and Image-Text Distance Minimization (ITDM) loss. ... The loss function can be expressed as follows: {L }_{ \mathrm {I TG}} = -\frac {1}{L} \sum _{i=1} {L} \log p_{\theta }(y_{i} | y_{<i}, \mathbf {X}_{\mathrm {img}}, \mathbf {X}_{\mathrm {text}}), ... {L }_{\m a thrm { ITDM}} = \frac {1}{N} \sum _{i=1} {N} \| \mathbf {d}_{\mathrm {ian}} \mathbf {d}_{\mathrm {llm}} \|_2 2, ... \mat h cal {L} = \mathcal {L}_{\mathrm {ITG}} + \mathcal {L}_{\mathrm {ITDM}}, |