reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CSA: Data-efficient Mapping of Unimodal Features to Multimodal Features

Authors: Po-han Li, Sandeep Chinchali, ufuk topcu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that CSA outperforms CLIP while requiring 50,000 fewer multimodal data pairs to bridge the modalities given pre-trained unimodal encoders on Image Net classification and misinformative news caption detection. ... We tested CSA on tasks such as image classification, cross-modal retrieval, and misinformative captions detection (Section 6).
Researcher Affiliation	Academia	Po-han Li, Sandeep Chinchali & Ufuk Topcu The University of Texas at Austin EMAIL
Pseudocode	No	The paper describes its methodology using mathematical formulations and prose, such as the formulation of CCA in Section 4.1. There are no explicit pseudocode blocks or algorithm listings in the document.
Open Source Code	Yes	Lastly, we put the core code of CSA in the supplementary details. The code includes dataloaders, execution code, and links to download all the datasets and models used. We will provide an example configuration file after publication, as it contains non-anonymous directory paths.
Open Datasets	Yes	CSA outperforms CLIP while requiring 50,000 fewer multimodal data pairs to bridge the modalities given pre-trained unimodal encoders on Image Net classification and misinformative news caption detection. ... We examined CSA and the other baselines on a cross-modal retrieval dataset, Flickr30k (Young et al., 2014)... ...Detecting Misinformative Captions. ... COSMOS dataset (Aneja et al., 2023).
Dataset Splits	Yes	We split the multimodal data into training and test sets, just like in standard machine learning. ... This dataset includes only 800 training images and 100 test images of plants... ... We trained ASIF and CSA on the Flickr validation set, which includes 145,000 images and 5 captions for each image. We then validated the models on a test set of 5,000 images and 25,000 captions. ... The train set contains 5,000 cross-modal instances, and the test set contains 6,000 instances.
Hardware Specification	Yes	We run all inferences of LLAVA and encoder models on an NVIDIA RTX A5000 GPU, and solving Equation 2 with 35,000 multimodal feature vectors on a 64 core Xeon Gold 6226R CPU machine takes less than 10 minutes.
Software Dependencies	No	The paper mentions software like "CCA Zoo Chapman & Wang (2021)", "Num Py", "Cu Py Okuta et al. (2017)", and "tsfresh Christ et al. (2018)" but does not provide specific version numbers for these tools as required.
Experiment Setup	Yes	Finally, for any multimodal downstream task, CSA uses Equation 4 to evaluate the similarity between any pair of multimodal data. ... For all the hyperparameters and detailed settings of the experiments, please refer to Appendix Section A. ... All methods input an image and all possible captions This is an image of label and select the most similar caption as the predicted label. ... We set this distance threshold per Shubodh et al. (2024).