reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Mitigating Spurious Correlations in Zero-Shot Multimodal Models

Authors: Shenyu Lu, Junyi Chai, Xiaoqian Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conducted experiments on benchmark datasets, which have shown significant improvements in worst-group accuracy. Additionally, our visualizations of VLMs further demonstrate the effectiveness of this intervention.
Researcher Affiliation	Academia	Shenyu Lu, Junyi Chai & Xiaoqian Wang Elmore Family School of Electrical and Computer Engineering Purdue University West Lafayette, IN 47906, USA EMAIL
Pseudocode	Yes	We summarize our method in Algorithm 1.
Open Source Code	Yes	1Code at https://github.com/lu876/TIE
Open Datasets	Yes	Datasets. We study five well-established benchmark datasets for spurious correlation research: Waterbirds (Koh et al., 2021; Sagawa et al., 2019), Celeb A (Liu et al., 2015), ISIC (Codella et al., 2019), COVID-19 (Cohen et al., 2020), FMOW (Christie et al., 2018).
Dataset Splits	Yes	Following the protocol established by robust learning studies (Sagawa et al., 2019; Adila et al., 2024), we report three metrics: worst group accuracy (WG), average accuracy (Avg), and the gap between these two metrics (Gap).
Hardware Specification	Yes	We conducted all experiments on an Nvidia RTX 3090 GPU with 24 GB of memory, using frozen CLIP models across various datasets.
Software Dependencies	No	The paper mentions "Model construction and pre-trained weights are sourced from Open CLIP (Ilharco et al., 2021)" and "We utilize GPT-4 (Open AI, 2023)" but does not provide specific version numbers for these or other key software libraries like PyTorch, numpy, or scikit-learn that would be necessary for reproduction.
Experiment Setup	Yes	The model was trained using an SGD optimizer with a learning rate of 10 4, a weight decay of 10 3, and a momentum of 0.9, over 200 epochs.