Does VLM Classification Benefit from LLM Description Semantics?

Authors: Pingchuan Ma, Lennart Rietdorf, Dmytro Kotovenko, Vincent Tao Hu, Björn Ommer

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This section evaluates our approach on seven widely used benchmark datasets for (fine-grained) visual classification. We compare our approach to state-of-the-art methods and provide qualitative results. Quantitative results are shown in Table 1, where we report the peak accuracy for each dataset regardless of wcls.
Researcher Affiliation Academia Pingchuan Ma1,2*, Lennart Rietdorf1*, Dmytro Kotovenko1, Vincent Tao Hu1,2, Bj orn Ommer1,2 1 Comp Vis @ LMU Munich 2 Munich Center for Machine Learning
Pseudocode Yes Algorithm 1: Inference: Obtain distinctive language descriptions with feedback from VLM space.
Open Source Code Yes Code https://github.com/Comp Vis/Dis CLIP Extended Version https://arxiv.org/abs/2412.11917
Open Datasets Yes We evaluated our methods on the following standard datasets using the standard protocol...: Image Net (Deng et al. 2009), Image Net V2 (Recht et al. 2019), CUB200-2011 (Wah et al. 2011) (fine-grained bird classification), Euro SAT (Helber et al. 2017) (satellite image recognition), Places365 (Zhou et al. 2017), DTD (Textures, (Cimpoi et al. 2014)), and Flowers102 (Nilsback and Zisserman 2008).
Dataset Splits Yes We randomly sample a subset from each dataset s standard training split to obtain the lookup similarity table S (details see Appendix A.9). Datasets. We evaluated our methods on the following standard datasets using the standard protocol (classification accuracy) based on previous works (Menon and Vondrick 2023; Roth et al. 2023) For sample sizes n see Appendix A.9.
Hardware Specification Yes The authors gratefully acknowledge the Gauss Center for Supercomputing for providing compute through the NIC on JUWELS at JSC and the HPC resources supplied by the Erlangen National High Performance Computing Center (NHR@FAU funded by DFG).
Software Dependencies Yes The Large-Language Model (LLM) generated descriptions are sourced directly from DCLIP (Menon and Vondrick 2023) or generated using the contrastive prompting method with gpt-3.5-turbo-1106 and Llama-3-70b-chat-hf via APIs.
Experiment Setup Yes A weighting factor wcls R+ gets introduced to the vision-language ensemble: s(c, x) = 1 |D(c)| d D(c) w(d) ϕ(d, x) ( wcls if d = dcls, 1 |D(c)| 1 if d D(c) ∖ {dcls}. Weights of the classname-free descriptions are normalized to one to have the same relative weightings between classes with different amounts of assigned descriptions. Given a test image xi, one retrieves its top-k predictions based solely on text embeddings of texts such as a photo of [cls] .