Synergy and Diversity in CLIP: Enhancing Performance Through Adaptive Backbone Ensembling
Authors: Cristian Rodriguez-Opazo, Ehsan Abbasnejad, Damien Teney, Hamed Damirchi, Edison Marrese-Taylor, Anton Hengel
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper explores the differences across various CLIP-trained vision backbones. Despite using the same data and training objective, we find that these architectures have notably different representations, different classification performances across datasets, and different robustness properties to certain types of image perturbations. Our findings indicate a remarkable possible synergy across backbones by leveraging their respective strengths. On a large collection of datasets, the method achieves a remarkable increase in accuracy of up to 39.1% over the best single backbone, well beyond traditional ensembles. |
| Researcher Affiliation | Academia | 1Australian Institute for Machine Learning University of Adelaide 2Idiap Research Institute, Switzerland 3The University of Tokyo EMAIL |
| Pseudocode | No | The paper describes the proposed method, Neural Logit Controller (NLC), in Section 3 and provides details on its technical inspiration and learning process in paragraph form. It does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | The paper mentions that "All models are obtained through the open-source project Open CLIP (Ilharco et al., 2021)" and "All the backbones used in the paper are pre-trained using the same dataset called openai and objectives from the Open AI Fundation https://github.com/mlfoundations/open_clip.". This refers to the third-party models and framework used, not the specific implementation code for the NLC method proposed in this paper. |
| Open Datasets | Yes | We use a selection of 21 popular image classification datasets: CALTECH101 (Li et al., 2022b), CARS (Krause et al., 2013), CIFAR10 (Krizhevsky et al., 2009), CIFAR100 (Krizhevsky et al., 2009), CLEVR (Johnson et al., 2017), CUB (Wah et al., 2011), DTD (Cimpoi et al., 2014), EUROSAT (Helber et al., 2018), FGVC (Maji et al., 2013), FLOWERS (Nilsback and Zisserman, 2008), FOOD (Bossard et al., 2014), GTSRB (Houben et al., 2013), IMGNET-1K (Deng et al., 2009) MNIST (Deng, 2012), PCAM (Veeling et al., 2018), PETS (Parkhi et al., 2012), Rendered SST2 (Socher et al., 2013), RESISC45 (Cheng et al., 2017), STL10 (Coates et al., 2011) and SUN397 (Xiao et al., 2010). |
| Dataset Splits | Yes | The evaluation metric is simply the accuracy of each dataset s test split. The proposed method, named the Neural Logit Controller (NLC), uses a few labeled examples (as little as one per class) to tune the combination of backbones. The MLP directly produces a vector of temperatures t RB and is trained on a holdout set for the training set of each target dataset using the cross-entropy loss between final predictions and ground truth labels. We train such linear classifiers, using each backbone s frozen features and each dataset s training split. The remaining 10% are used to train NLC. |
| Hardware Specification | Yes | All our experiments are done in one NVi DIA Ge Force RTX 4090 GPU, in a small server with 64 GB of RAM and 32 CPU cores. |
| Software Dependencies | No | The paper mentions the "ADAM optimizer" and "Open CLIP" but does not provide specific version numbers for these or any other software libraries (e.g., Python, PyTorch) that would be needed for replication. |
| Experiment Setup | Yes | The MLP takes as input the concatenated representations obtained by passing the images through the encoder ϕv of each backbone b B. The one-layer MLP is trained with ADAM optimizer with a learning rate of 2e-4. We use a weight decay of 0.01. The hidden layer has a width of 128 and its outcome is the temperature values for the used backbones. We initialize the weights of the linear classifiers using language weights, which was shown to be more stable than a random initialization. The output of each backbone ϕb is L2-normalized before being passed to each linear classifier, which is trained using 90% of the target dataset s training split. The remaining 10% are used to train NLC. We use standard hyperparameters: 9 experts trained for classification with cross-entropy loss with Adam (Kingma and Ba, 2014) and a learning rate of 2e 5 for 300 epochs. |