CSA: Data-efficient Mapping of Unimodal Features to Multimodal Features
Authors: Po-han Li, Sandeep Chinchali, ufuk topcu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that CSA outperforms CLIP while requiring 50,000 fewer multimodal data pairs to bridge the modalities given pre-trained unimodal encoders on Image Net classification and misinformative news caption detection. ... We tested CSA on tasks such as image classification, cross-modal retrieval, and misinformative captions detection (Section 6). |
| Researcher Affiliation | Academia | Po-han Li, Sandeep Chinchali & Ufuk Topcu The University of Texas at Austin EMAIL |
| Pseudocode | No | The paper describes its methodology using mathematical formulations and prose, such as the formulation of CCA in Section 4.1. There are no explicit pseudocode blocks or algorithm listings in the document. |
| Open Source Code | Yes | Lastly, we put the core code of CSA in the supplementary details. The code includes dataloaders, execution code, and links to download all the datasets and models used. We will provide an example configuration file after publication, as it contains non-anonymous directory paths. |
| Open Datasets | Yes | CSA outperforms CLIP while requiring 50,000 fewer multimodal data pairs to bridge the modalities given pre-trained unimodal encoders on Image Net classification and misinformative news caption detection. ... We examined CSA and the other baselines on a cross-modal retrieval dataset, Flickr30k (Young et al., 2014)... ...Detecting Misinformative Captions. ... COSMOS dataset (Aneja et al., 2023). |
| Dataset Splits | Yes | We split the multimodal data into training and test sets, just like in standard machine learning. ... This dataset includes only 800 training images and 100 test images of plants... ... We trained ASIF and CSA on the Flickr validation set, which includes 145,000 images and 5 captions for each image. We then validated the models on a test set of 5,000 images and 25,000 captions. ... The train set contains 5,000 cross-modal instances, and the test set contains 6,000 instances. |
| Hardware Specification | Yes | We run all inferences of LLAVA and encoder models on an NVIDIA RTX A5000 GPU, and solving Equation 2 with 35,000 multimodal feature vectors on a 64 core Xeon Gold 6226R CPU machine takes less than 10 minutes. |
| Software Dependencies | No | The paper mentions software like "CCA Zoo Chapman & Wang (2021)", "Num Py", "Cu Py Okuta et al. (2017)", and "tsfresh Christ et al. (2018)" but does not provide specific version numbers for these tools as required. |
| Experiment Setup | Yes | Finally, for any multimodal downstream task, CSA uses Equation 4 to evaluate the similarity between any pair of multimodal data. ... For all the hyperparameters and detailed settings of the experiments, please refer to Appendix Section A. ... All methods input an image and all possible captions This is an image of label and select the most similar caption as the predicted label. ... We set this distance threshold per Shubodh et al. (2024). |