Benchmarking Robustness of Multimodal Image-Text Models under Distribution Shift
Authors: Jielin Qiu, Yi Zhu, Xingjian Shi, Florian Wenzel, Zhiqiang Tang, Ding Zhao, Bo Li, Mu Li
DMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we investigate the robustness of 12 popular open-sourced image-text models under common perturbations on five tasks (image-text retrieval, visual reasoning, visual entailment, image captioning, and text-to-image generation). In particular, we propose several new multimodal robustness benchmarks by applying 17 image perturbation and 16 text perturbation techniques on top of existing datasets. We observe that multimodal models are not robust to image and text perturbations, especially to image perturbations. Among the tested perturbation methods, character-level perturbations constitute the most severe distribution shift for text, and zoom blur is the most severe shift for image data. We also introduce two new robustness metrics (MMI for Multi Modal Impact score and MOR for Missing Object Rate) for proper evaluations of multimodal models. |
| Researcher Affiliation | Collaboration | Jielin Qiu1 EMAIL Yi Zhu2 EMAIL Xingjian Shi2 EMAIL Florian Wenzel3 EMAIL Zhiqiang Tang4 EMAIL Ding Zhao1 EMAIL Bo Li4,5 EMAIL Mu Li2 EMAIL 1 Carnegie Mellon University 2 Boson AI 3 Mirelo AI 4 Amazon Web Services 5 University of Chicago |
| Pseudocode | No | The paper describes methods and experiments but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | More details can be found on the project webpage: https://MMRobustness.github.io. (Also confirmed by the ML reproducibility checklist: 'Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] Our code can be found on the project webpage: https://MMRobustness.github.io') |
| Open Datasets | Yes | We build multimodal robustness evaluation benchmarks by leveraging existing datasets and tasks, e.g., image-text retrieval (Flicker30K, COCO), visual reasoning (NLVR2), visual entailment (SNLI-VE), image captioning (COCO), and text-to-image generation (COCO)... For each task, we perturb the corresponding datasets, i.e., Flickr30K (Young et al., 2014), COCO (Lin et al., 2014) , NLVR2 (Suhr et al., 2017), and SNLI-VE (Xie et al., 2018, 2019b). |
| Dataset Splits | Yes | For image-text retrieval, the Flickr30K dataset contains 1,000 images, and each of them has 5 corresponding captions, while the COCO dataset contains 5,000 images, and each of them also has 5 corresponding captions. We report the RSUM score averaged on five perturbation levels under each perturbation method to reveal the overall performance... For visual reasoning, the NLVR2 dev set contains 2,018 unique sentences and 6,982 samples, while the test-P set contains 1,995 unique sentences and 6,967 samples... For visual entailment, the SNLI-VE val set contains 1,000 images and 6,576 sentences, while the test set contains 1,000 images and 6,592 sentences. |
| Hardware Specification | No | The paper describes experimental methodologies and results, but it does not provide specific hardware details such as GPU models, CPU specifications, or cloud computing resources used for the experiments. |
| Software Dependencies | No | The paper mentions using pretrained models like CLIP, BLIP, Stable Diffusion, etc., and tools like paraphrase-mpnet-base-v2 (Reimers and Gurevych, 2019) and GLIP (Li et al., 2021c) but does not provide specific version numbers for these software dependencies or programming languages used. |
| Experiment Setup | Yes | To evaluate the robustness of large pretrained multimodal models under distribution shift, we start by building several evaluation benchmark datasets via perturbing the original image-text pairs on either the image side or text side... We use these perturbations to simulate distribution shifts of various intensities... We include Stylize-Image Net for its effectiveness in perturbing the original image by breaking its shape and texture... The perturbations are grouped into five categories: noise, blur, weather, digital, and stylize. Specifically, we use 17 image perturbation techniques... each category has five levels of severity, resulting in 85 perturbation methods in total... we design 16 text perturbation techniques grouped into three categories: character-level, word-level, and sentence-level... For strategies within the character-level and word-level perturbations, we apply 5 severity levels similar to image perturbations, while for strategies within the sentence-level perturbations, there is only one severity level. This leads to a total of 60 text perturbation methods. |