Enhancing Multimodal Model Robustness Under Missing Modalities via Memory-Driven Prompt Learning
Authors: Yihan Zhao, Wei Xi, Xiao Fu, Jizhong Zhao
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate the effectiveness of the proposed model, achieving significant improvements across diverse missingmodality scenarios, with average performance increasing from 34.76% to 40.40% on MM-IMDb, 62.71% to 77.06% on Food101, and 60.40% to 62.77% on Hateful Memes. |
| Researcher Affiliation | Academia | Yihan Zhao , Wei Xi , Xiao Fu and Jizhong Zhao School of Computer Science and Technology, Xi an Jiaotong University, Xi an, China EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the methodology using textual explanations and mathematical equations, but it does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is available at https://github.com/zhao-yh20/MemPrompt. |
| Open Datasets | Yes | We evaluated our proposed method on three widely used datasets: MM-IMDb [Arevalo et al., 2017], UPMC Food101 [Wang et al., 2015], and Hateful Memes [Kiela et al., 2020]. |
| Dataset Splits | Yes | For fair comparison, we adopted the same data processing methods as in prior work on the three benchmarks [Lee et al., 2023]. The missing rate η% is set to 70% during both train and inference. Given the fix missing case during training, we evaluate all models under various missing cases during inference. |
| Hardware Specification | Yes | All experiments were conducted on NVIDIA 4090Ti GPUs. |
| Software Dependencies | No | The paper does not provide specific software versions for libraries, frameworks, or operating systems used in the experiments. |
| Experiment Setup | Yes | The maximum text input lengths were set according to the specific datasets: 1024 for MM-IMDb, 512 for UPMC Food-101, and 128 for Hateful Memes. We employed the multimodal transformer Vi LT [Kim et al., 2021] as the backbone model. The pretrained Vi LT model was frozen, and only the learnable prompts and classification layer were finetuned on the target datasets. The lengths of the generative and shared prompts were set to 16 for MM-IMDb and UPMC Food101, and 4 for Hateful Memes. The prompt memory was configured with N as 5 and memory size as 16. Generative and shared prompts were added only to the first 6 transformer blocks. All experiments used the Adam W optimizer, with an initial learning rate of 5 10 3 for MM-IMDb and UPMC Food-101, and 1 10 3 for Hateful Memes. The weight decay rate was set to 2 10 2. |