reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Enhancing Multimodal Model Robustness Under Missing Modalities via Memory-Driven Prompt Learning

Authors: Yihan Zhao, Wei Xi, Xiao Fu, Jizhong Zhao

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate the effectiveness of the proposed model, achieving significant improvements across diverse missingmodality scenarios, with average performance increasing from 34.76% to 40.40% on MM-IMDb, 62.71% to 77.06% on Food101, and 60.40% to 62.77% on Hateful Memes.
Researcher Affiliation	Academia	Yihan Zhao , Wei Xi , Xiao Fu and Jizhong Zhao School of Computer Science and Technology, Xi an Jiaotong University, Xi an, China EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the methodology using textual explanations and mathematical equations, but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	The code is available at https://github.com/zhao-yh20/MemPrompt.
Open Datasets	Yes	We evaluated our proposed method on three widely used datasets: MM-IMDb [Arevalo et al., 2017], UPMC Food101 [Wang et al., 2015], and Hateful Memes [Kiela et al., 2020].
Dataset Splits	Yes	For fair comparison, we adopted the same data processing methods as in prior work on the three benchmarks [Lee et al., 2023]. The missing rate η% is set to 70% during both train and inference. Given the fix missing case during training, we evaluate all models under various missing cases during inference.
Hardware Specification	Yes	All experiments were conducted on NVIDIA 4090Ti GPUs.
Software Dependencies	No	The paper does not provide specific software versions for libraries, frameworks, or operating systems used in the experiments.
Experiment Setup	Yes	The maximum text input lengths were set according to the specific datasets: 1024 for MM-IMDb, 512 for UPMC Food-101, and 128 for Hateful Memes. We employed the multimodal transformer Vi LT [Kim et al., 2021] as the backbone model. The pretrained Vi LT model was frozen, and only the learnable prompts and classification layer were finetuned on the target datasets. The lengths of the generative and shared prompts were set to 16 for MM-IMDb and UPMC Food101, and 4 for Hateful Memes. The prompt memory was configured with N as 5 and memory size as 16. Generative and shared prompts were added only to the first 6 transformer blocks. All experiments used the Adam W optimizer, with an initial learning rate of 5 10 3 for MM-IMDb and UPMC Food-101, and 1 10 3 for Hateful Memes. The weight decay rate was set to 2 10 2.