reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Synergy and Diversity in CLIP: Enhancing Performance Through Adaptive Backbone Ensembling

Authors: Cristian Rodriguez-Opazo, Ehsan Abbasnejad, Damien Teney, Hamed Damirchi, Edison Marrese-Taylor, Anton Hengel

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper explores the differences across various CLIP-trained vision backbones. Despite using the same data and training objective, we find that these architectures have notably different representations, different classification performances across datasets, and different robustness properties to certain types of image perturbations. Our findings indicate a remarkable possible synergy across backbones by leveraging their respective strengths. On a large collection of datasets, the method achieves a remarkable increase in accuracy of up to 39.1% over the best single backbone, well beyond traditional ensembles.
Researcher Affiliation	Academia	1Australian Institute for Machine Learning University of Adelaide 2Idiap Research Institute, Switzerland 3The University of Tokyo EMAIL
Pseudocode	No	The paper describes the proposed method, Neural Logit Controller (NLC), in Section 3 and provides details on its technical inspiration and learning process in paragraph form. It does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	No	The paper mentions that "All models are obtained through the open-source project Open CLIP (Ilharco et al., 2021)" and "All the backbones used in the paper are pre-trained using the same dataset called openai and objectives from the Open AI Fundation https://github.com/mlfoundations/open_clip.". This refers to the third-party models and framework used, not the specific implementation code for the NLC method proposed in this paper.
Open Datasets	Yes	We use a selection of 21 popular image classification datasets: CALTECH101 (Li et al., 2022b), CARS (Krause et al., 2013), CIFAR10 (Krizhevsky et al., 2009), CIFAR100 (Krizhevsky et al., 2009), CLEVR (Johnson et al., 2017), CUB (Wah et al., 2011), DTD (Cimpoi et al., 2014), EUROSAT (Helber et al., 2018), FGVC (Maji et al., 2013), FLOWERS (Nilsback and Zisserman, 2008), FOOD (Bossard et al., 2014), GTSRB (Houben et al., 2013), IMGNET-1K (Deng et al., 2009) MNIST (Deng, 2012), PCAM (Veeling et al., 2018), PETS (Parkhi et al., 2012), Rendered SST2 (Socher et al., 2013), RESISC45 (Cheng et al., 2017), STL10 (Coates et al., 2011) and SUN397 (Xiao et al., 2010).
Dataset Splits	Yes	The evaluation metric is simply the accuracy of each dataset s test split. The proposed method, named the Neural Logit Controller (NLC), uses a few labeled examples (as little as one per class) to tune the combination of backbones. The MLP directly produces a vector of temperatures t RB and is trained on a holdout set for the training set of each target dataset using the cross-entropy loss between final predictions and ground truth labels. We train such linear classifiers, using each backbone s frozen features and each dataset s training split. The remaining 10% are used to train NLC.
Hardware Specification	Yes	All our experiments are done in one NVi DIA Ge Force RTX 4090 GPU, in a small server with 64 GB of RAM and 32 CPU cores.
Software Dependencies	No	The paper mentions the "ADAM optimizer" and "Open CLIP" but does not provide specific version numbers for these or any other software libraries (e.g., Python, PyTorch) that would be needed for replication.
Experiment Setup	Yes	The MLP takes as input the concatenated representations obtained by passing the images through the encoder ϕv of each backbone b B. The one-layer MLP is trained with ADAM optimizer with a learning rate of 2e-4. We use a weight decay of 0.01. The hidden layer has a width of 128 and its outcome is the temperature values for the used backbones. We initialize the weights of the linear classifiers using language weights, which was shown to be more stable than a random initialization. The output of each backbone ϕb is L2-normalized before being passed to each linear classifier, which is trained using 90% of the target dataset s training split. The remaining 10% are used to train NLC. We use standard hyperparameters: 9 experts trained for classification with cross-entropy loss with Adam (Kingma and Ba, 2014) and a learning rate of 2e 5 for 300 epochs.