MP-Nav: Enhancing Data Poisoning Attacks against Multimodal Learning

Authors: Jingfeng Zhang, Prashanth Krishnamurthy, Naman Patel, Anthony Tzes, Farshad Khorrami

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experiments corroborate MP-Nav can significantly improve the efficacy of state-of-the-art data poisoning attacks such as Ato B and Shadow Cast in multimodal tasks, and maintain model utility across diverse datasets. Notably, this study underscores the vulnerabilities of multimodal models and calls for the counterpart defenses.
Researcher Affiliation Academia 1New York University. Correspondence to: Farshad Khorrami <EMAIL>.
Pseudocode Yes Algorithm 1 Meta Algorithm of MP-Nav. Input: Open-sourced dataset D = {(xi, yi)}N i=1 where xi denotes images and yi denotes text captions, opensourced multimodal encoders ϵimg( ) and ϵtxt( ), and the attacker budget η. Output: Dataset Dp containing the poisoned instances.
Open Source Code No The paper does not provide explicit links to source code repositories or clear statements about the release of their own implementation code. It mentions using 'open-sourced model (e.g., CLIP)' and 'open-sourced models from Hugging Face', but these refer to third-party tools or models used in their work, not their specific code for MP-Nav.
Open Datasets Yes For the TIR task, we followed the previous the study of Ato B attack (Yang et al., 2023) and chose the COCO dataset (Lin et al., 2014) and Flickr-PASCAL dataset (Young et al., 2014; Rashtchian et al., 2010). Following Xu et al. (2024), we used the clean Mini GPT4 dataset (Zhu et al., 2024) that consists of 3,500 detailed image description pairs for visual instruction tuning. Besides, to conduct control experiments, we leverage the open-sourced Food101 dataset (Bossard et al., 2014) that consists of 101 food categories with 750 training and 250 test images per category, making a total of 101k images.
Dataset Splits Yes COCO dataset has 80 object categories and contains 5 captions per image. For each image, we randomly selected one of the object categories as its label (a.k.a. concept). To make the COCO training set balanced, two concepts toaster (with 28 images) and hair drier (with 53 images) are removed. We used 119387 images with their corresponding captions for training and the rest 3900 images for evaluation of both model utility and poisoning efficacy. We divide the PASCAL dataset by half with 500 images used for injection of poisoning noises and 500 images used for evaluation of poisoning efficacy. Thus, we had 500 (from PASCAL dataset) plus 29000 images (from the Flickr dataset) used for training and 1000 images (from the Flickr dataset) for evaluation of model utility. Food101 dataset (Bossard et al., 2014) that consists of 101 food categories with 750 training and 250 test images per category, making a total of 101k images.
Hardware Specification No The paper mentions using specific models like "CLIP Vi T-B/32" and "LLa VA-1.5" but does not specify the hardware (e.g., GPU models, CPU types, or cloud computing instances) on which these models were trained or experiments were run.
Software Dependencies No The paper mentions using "Adam W optimizer" and specific models like "CLIP Vi T-B/32" and "LLa VA-1.5", but it does not specify version numbers for any underlying software libraries or dependencies (e.g., Python, PyTorch, TensorFlow, CUDA versions).
Experiment Setup Yes We fine-tuned the with 10 epochs with 128 batchsize using Adam W optimizer, initial learning rate of 0.2, cosine scheduler of 1.0 decay rate and weight decay of 0.2. For each dataset, we trained LLa VA-1.5 model for 1 epochs using Adam W optimizer of learning rate 2e 4.