Enhancing Multimodal Model Robustness Under Missing Modalities via Memory-Driven Prompt Learning

Authors: Yihan Zhao, Wei Xi, Xiao Fu, Jizhong Zhao

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate the effectiveness of the proposed model, achieving significant improvements across diverse missingmodality scenarios, with average performance increasing from 34.76% to 40.40% on MM-IMDb, 62.71% to 77.06% on Food101, and 60.40% to 62.77% on Hateful Memes.
Researcher Affiliation Academia Yihan Zhao , Wei Xi , Xiao Fu and Jizhong Zhao School of Computer Science and Technology, Xi an Jiaotong University, Xi an, China EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the methodology using textual explanations and mathematical equations, but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes The code is available at https://github.com/zhao-yh20/MemPrompt.
Open Datasets Yes We evaluated our proposed method on three widely used datasets: MM-IMDb [Arevalo et al., 2017], UPMC Food101 [Wang et al., 2015], and Hateful Memes [Kiela et al., 2020].
Dataset Splits Yes For fair comparison, we adopted the same data processing methods as in prior work on the three benchmarks [Lee et al., 2023]. The missing rate η% is set to 70% during both train and inference. Given the fix missing case during training, we evaluate all models under various missing cases during inference.
Hardware Specification Yes All experiments were conducted on NVIDIA 4090Ti GPUs.
Software Dependencies No The paper does not provide specific software versions for libraries, frameworks, or operating systems used in the experiments.
Experiment Setup Yes The maximum text input lengths were set according to the specific datasets: 1024 for MM-IMDb, 512 for UPMC Food-101, and 128 for Hateful Memes. We employed the multimodal transformer Vi LT [Kim et al., 2021] as the backbone model. The pretrained Vi LT model was frozen, and only the learnable prompts and classification layer were finetuned on the target datasets. The lengths of the generative and shared prompts were set to 16 for MM-IMDb and UPMC Food101, and 4 for Hateful Memes. The prompt memory was configured with N as 5 and memory size as 16. Generative and shared prompts were added only to the first 6 transformer blocks. All experiments used the Adam W optimizer, with an initial learning rate of 5 10 3 for MM-IMDb and UPMC Food-101, and 1 10 3 for Hateful Memes. The weight decay rate was set to 2 10 2.