reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MAGE: Multimodal Alignment and Generation Enhancement via Bridging Visual and Semantic Spaces

Authors: Shaojun E, Yuchen Yang, Jiaheng Wu, Yan Zhang, Tiejun Zhao, Ziyan Chen

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, our proposed multimodal large model architecture, MAGE, achieved significantly better performance compared to similar works across various evaluation benchmarks, including MME, MMBench, and SEED. Complete code and appendix are available at: https://github.com/GTCOM-NLP/MAGE. ... Experimental results demonstrate that MAGE performs exceptionally well across various benchmarks, achieving state-of-the-art efficiency and accuracy.
Researcher Affiliation	Collaboration	Shaojun E1,3, , Yuchen Yang2, , Jiaheng Wu2 , Yan Zhang1 , Tiejun Zhao2 , Ziyan Chen1, 1 Global Tone Communication Technology Co., Ltd., Beijing, China 2 Faculty of computing, Harbin Institute of Technology, Harbin, China 3 School of Computer Science and Technology, Beijing Jiaotong University, Beijing, China EMAIL, EMAIL; EMAIL
Pseudocode	No	The paper describes the architecture and method in prose and diagrams (Figures 2 and 3) but does not include any clearly labeled pseudocode or algorithm blocks. The methods are explained descriptively within the text, such as in Section 3, without presenting structured pseudocode.
Open Source Code	Yes	Complete code and appendix are available at: https://github.com/GTCOM-NLP/MAGE
Open Datasets	Yes	All the training data is presented in table 2. ... Blip Cap Filt[Li et al., 2022] ... LLaVA150K[Liu et al., 2024b], Share GPT[Chiang et al., 2023] ... Doc-VQA[Mathew et al., 2021], Chart QA[Masry et al., 2022], DVQA[Kafle et al., 2018] ... Geo QA+[Chen et al., 2021] ... Synth Dog-EN[Kim et al., 2022]
Dataset Splits	Yes	For training in this phase, only 90% of this dataset was used. ... The training data combines the remaining 10% of the fine-tuning dataset from the second stage with the newly constructed self-awareness dataset...
Hardware Specification	Yes	Training was conducted using 48 A6000 GPUs.
Software Dependencies	Yes	We used two variants of the Vicunav1.5 [Zheng et al., 2023] large language model (7B and 13B) and CLIP Vi T-L/14 [Radford et al., 2021b] (336px) as the vision encoder
Experiment Setup	Yes	We used two variants of the Vicunav1.5 [Zheng et al., 2023] large language model (7B and 13B) and CLIP Vi T-L/14 [Radford et al., 2021b] (336px) as the vision encoder, with input images set to a resolution of 336 × 336 pixels. Visual features are derived from the penultimate layer of the CLIP model, excluding the CLS token. During training, we fine-tuned all model parameters (including CLIP and LLM) without using parameter-efficient techniques like LoRA [Hu et al., 2021]. Training was conducted using 48 A6000 GPUs. Our training process is divided into three phases... we propose combining cross-entropy loss with mean squared error (MSE) loss... Our proposed alignment strategy employs two primary loss functions: Image-Guided Text Generation (ITG) loss and Image-Text Distance Minimization (ITDM) loss. ... The loss function can be expressed as follows: {L }_{ \mathrm {I TG}} = -\frac {1}{L} \sum _{i=1} {L} \log p_{\theta }(y_{i} \| y_{<i}, \mathbf {X}_{\mathrm {img}}, \mathbf {X}_{\mathrm {text}}), ... {L }_{\m a thrm { ITDM}} = \frac {1}{N} \sum _{i=1} {N} \\| \mathbf {d}_{\mathrm {ian}} \mathbf {d}_{\mathrm {llm}} \\|_2 2, ... \mat h cal {L} = \mathcal {L}_{\mathrm {ITG}} + \mathcal {L}_{\mathrm {ITDM}},