reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning Invariant Causal Mechanism from Vision-Language Models

Authors: Zeen Song, Siyu Zhao, Xingyu Zhang, Jiangmeng Li, Changwen Zheng, Wenwen Qiang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on several OOD datasets show that CLIPICM significantly improves the performance of CLIP. Our method offers a simple but powerful enhancement, boosting the reliability of CLIP in real-world applications. The source code is available at https://github.com/Zeen Song/CLIP-ICM. ... To evaluate the CLIP-ICM framework in OOD scenarios, we conduct experiments on the Domain Bed benchmark (Gulrajani & Lopez-Paz).
Researcher Affiliation	Academia	1Institute of Software Chinese Academy of Sciences, Beijing, China 2University of the Chinese Academy of Sciences. Correspondence to: Wenwen Qiang <EMAIL>.
Pseudocode	Yes	H.3. Pseudo Code The pseodo code of CLIP-ICM is illustrated in Algorithm 1. Algorithm 1 CLIP-ICM
Open Source Code	Yes	The source code is available at https://github.com/Zeen Song/CLIP-ICM.
Open Datasets	Yes	We conduct an experiment on the Terra Incognita dataset (Beery et al., 2018)... We evaluate the proposed CLIPICM on OOD generalization datasets, including Domainbed (Gulrajani & Lopez-Paz) and variants of Image Net (Recht et al., 2019; Hendrycks et al., 2021b;a; Wang et al., 2019). ... We use five datasets from Domain Bed: PACS (Li et al., 2017), VLCS (Fang et al., 2013), Office Home (Venkateswara et al., 2017), Terra Incognita (Beery et al., 2018), and Domain Net (Peng et al., 2019).
Dataset Splits	Yes	Specifically, for a given target domain, a linear classifier is trained on frozen CLIP image embeddings from all other domains and tested on the held-out domain to assess how well the model handles shifts in distribution. ... For domain shift, we use a leave-one-out protocol, training on all domains except the target and testing on the target domain (Table 2). For the combined setting, we split data into base and new classes, train on base classes in training domains, and evaluate both base and new classes in the target domain (Table 3). ... For all datasets, we first pool the raw training, validation, and testing images together. For each random seed, we then instantiate random training, validation, and testing splits.
Hardware Specification	Yes	All experiments are conducted on a single NVIDIA-RTX A6000 GPU
Software Dependencies	No	The paper mentions using "GPT-4o (Open AI, 2023)" as a model for interventional data generation, but does not provide specific version numbers for ancillary software components (e.g., programming languages, libraries, or frameworks like Python, PyTorch, or CUDA) used for implementing their methodology.
Experiment Setup	Yes	Each value in Table 2 and Table 3 represents the mean and standard deviation over 5 runs with different random seeds. ... Here, IDinv denotes the identity matrix of dimension Dinv, and λ is a regularization hyperparameter. ... We conduct an ablation study regarding the choice of Dinv.