AutoCLIP: Auto-tuning Zero-Shot Classifiers for Vision-Language Models
Authors: Jan Hendrik Metzen, Piyapat Saranrittichai, Chaithanya Kumar Mummadi
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that Auto CLIP outperforms baselines across a broad range of vision-language models, datasets, and prompt templates consistently and by up to 3 percent point accuracy. We evaluate Auto CLIP on a large number of datasets, vision-language models, and prompt templates (Section 4) as well as in a controlled setting (Section 5). |
| Researcher Affiliation | Industry | Jan Hendrik Metzen Jan EMAIL Bosch Center for Artificial Intelligence, Robert Bosch Gmb H Piyapat Saranrittichai EMAIL Bosch Center for Artificial Intelligence, Robert Bosch Gmb H Chaithanya Kumar Mummadi Chaithanya EMAIL Bosch Center for Artificial Intelligence, Robert Bosch LLC |
| Pseudocode | Yes | Algorithm 1 Zero-Shot Classifier for a single sample x... Algorithm 2 Auto CLIP: Auto-Tuned Zero-Shot Classifier for a single sample x |
| Open Source Code | Yes | We provide a basic implementation of Auto CLIP at https://github.com/boschresearch/autoclip. Code for reproducing the results of this section is available at https://github.com/boschresearch/autoclip. |
| Open Datasets | Yes | We conduct experiments on the datasets CUB200 (Welinder et al., 2010), Euro SAT (Helber et al., 2019), Food101 (Bossard et al., 2014), Oxford Pets (Parkhi et al., 2012), Image Net (Russakovsky et al., 2015), Image Net V2 (Kornblith et al., 2019), Image Net-R (Hendrycks et al., 2021), and Image Net-C (Hendrycks & Dietterich, 2019). |
| Dataset Splits | Yes | We conduct experiments on the datasets CUB200 (Welinder et al., 2010), Euro SAT (Helber et al., 2019), Food101 (Bossard et al., 2014), Oxford Pets (Parkhi et al., 2012), Image Net (Russakovsky et al., 2015), Image Net V2 (Kornblith et al., 2019), Image Net-R (Hendrycks et al., 2021), and Image Net-C (Hendrycks & Dietterich, 2019). |
| Hardware Specification | Yes | Here, encoding an image takes 12.64ms on a V100 (minimum over 100 images). |
| Software Dependencies | Yes | For bisection, we use an independent call to scipy.optimize.bisect (Virtanen et al., 2020) (maxiter=100, xtol=1e-2, rtol=1e-2). |
| Experiment Setup | Yes | We set the target entropy to β log2 K, where the entropy reduction factor β [0, 1] is the new free hyperparameter that we set globally to β = 0.85... For bisection, we use an independent call to scipy.optimize.bisect (Virtanen et al., 2020) (maxiter=100, xtol=1e-2, rtol=1e-2). |