Retrieval-Augmented Dynamic Prompt Tuning for Incomplete Multimodal Learning
Authors: Jian Lang, Zhangtao Cheng, Ting Zhong, Fan Zhou
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments conducted on three real-world datasets show that RAGPT consistently outperforms all competitive baselines in handling incomplete modality problems. |
| Researcher Affiliation | Academia | 1University of Electronic Science and Technology of China, Chengdu, Sichuan, China 2Kash Institute of Electronics and Information Industry, Kashgar, Xinjiang, China EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes methods in prose and mathematical formulations, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code of our work and prompt-based baselines is available at https://github.com/Jian-Lang/RAGPT. |
| Open Datasets | Yes | MM-IMDb (Arevalo et al. 2017), primarily used for movie genre classification involving both image and text modalities. (2) Food101 (Wang et al. 2015), which focuses on image classification that incorporates both image and text. (3) Hate Memes (Kiela et al. 2020), aimed to identify hate speech in memes using image and text modalities. |
| Dataset Splits | Yes | Detailed statistics of datasets are presented in Table 2. The dataset splits are consistent with the original paper. Table 2: Statistics of three multimodal downstream datasets. Dataset # Image # Text # Train # Val # Test MM-IMDb 25,959 25,959 15,552 2,608 7,799 Hate Memes 10,000 10,000 8,500 500 1,500 Food101 90,688 90,688 67,972 22,716 |
| Hardware Specification | Yes | All experiments are conducted with an NVIDIA RTX 3090 GPU. |
| Software Dependencies | No | The paper mentions using pre-trained Vi LT and the Adam W optimizer but does not specify version numbers for any software libraries or programming languages used. |
| Experiment Setup | Yes | The length l of context-aware prompts is set to 2, the number of retrieved instances K is chosen from {1, 3, 5, 7, 9}, and the prompt insertion layer b is set to 2. We utilize the Adam W optimizer (Loshchilov and Hutter 2017) with a learning rate of 1 10 3 and total 20 epochs for optimizing the parameters. |