BiMAC: Bidirectional Multimodal Alignment in Contrastive Learning

Authors: Masoumeh Zareapoor, Pourya Shamsolmoali, Yue Lu

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate the superiority of our model over state-of-the-art models in various vision-language tasks. ... Experimental Results To ensure fairness and consistency in our experiments, we adopt the Co Ca framework as the base architecture...
Researcher Affiliation Academia 1 Shanghai Jiaotong University, Shanghai, China 2 East China Normal University, Shanghai, China 3 University of York, York, United Kingdom
Pseudocode Yes Algorithm 1: Bi MAC: Bidirectional Multimodal Alignment
Open Source Code No The paper does not provide any explicit statements about releasing source code or a link to a code repository.
Open Datasets Yes The pretraining process uses two versions of conceptual captions datasets, i.e., CC3M (Sharma et al. 2018) and CC12M (Changpinyo et al. 2021) (together denoted as CC15M), which, after filtering out invalid URLs, consists of 13M image-text pairs. ... The models are trained to find the most relevant sample corresponding to a specific input across different modalities, using the Flickr30K and MSCOCO datasets. ... We fine-tune the models on the COCO Captions dataset (Lin et al. 2014) and evaluate their performance using metrics: BLEU@4, METEOR, CIDEr, and SPICE. We further evaluate the models on the No Caps dataset (Agrawal et al. 2019) in a zero-shot setting...
Dataset Splits Yes The models are trained to find the most relevant sample corresponding to a specific input across different modalities, using the Flickr30K (1K test set) and MSCOCO (5K test set) datasets. ... Following (Wang et al. 2023b), we fine-tune the models on the COCO Captions dataset (Lin et al. 2014) and evaluate their performance using metrics: BLEU@4, METEOR, CIDEr, and SPICE. As shown in Table 2, Bi MAC consistently outperforms all baselines, achieving improvements over Co Ca ranging from 4% to 7%. We further evaluate the models on the No Caps dataset (Agrawal et al. 2019) in a zero-shot setting, without any additional fine-tuning. ... We report BLEU@4, METEOR, CIDEr, and SPICE scores on the Karpathy test split.
Hardware Specification Yes The model was trained over 30 epochs using 8 RTX 3090 GPUs, with a batch size of 1024 and an image resolution of 256 × 256.
Software Dependencies No The paper mentions "The Adam optimizer" but does not specify any software libraries or frameworks with version numbers (e.g., PyTorch, TensorFlow) or specific Adam version used.
Experiment Setup Yes The Adam optimizer with an initial learning rate of 2 × 10−4 combined with a cosine decay schedule. The model was trained over 30 epochs using 8 RTX 3090 GPUs, with a batch size of 1024 and an image resolution of 256 × 256. The coefficients were set to λCG = 1 in line with Co Ca, λGIT = 0.7 and the temperature τ = 0.5.