reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Does VLM Classification Benefit from LLM Description Semantics?

Authors: Pingchuan Ma, Lennart Rietdorf, Dmytro Kotovenko, Vincent Tao Hu, Björn Ommer

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This section evaluates our approach on seven widely used benchmark datasets for (fine-grained) visual classification. We compare our approach to state-of-the-art methods and provide qualitative results. Quantitative results are shown in Table 1, where we report the peak accuracy for each dataset regardless of wcls.
Researcher Affiliation	Academia	Pingchuan Ma1,2, Lennart Rietdorf1, Dmytro Kotovenko1, Vincent Tao Hu1,2, Bj orn Ommer1,2 1 Comp Vis @ LMU Munich 2 Munich Center for Machine Learning
Pseudocode	Yes	Algorithm 1: Inference: Obtain distinctive language descriptions with feedback from VLM space.
Open Source Code	Yes	Code https://github.com/Comp Vis/Dis CLIP Extended Version https://arxiv.org/abs/2412.11917
Open Datasets	Yes	We evaluated our methods on the following standard datasets using the standard protocol...: Image Net (Deng et al. 2009), Image Net V2 (Recht et al. 2019), CUB200-2011 (Wah et al. 2011) (fine-grained bird classification), Euro SAT (Helber et al. 2017) (satellite image recognition), Places365 (Zhou et al. 2017), DTD (Textures, (Cimpoi et al. 2014)), and Flowers102 (Nilsback and Zisserman 2008).
Dataset Splits	Yes	We randomly sample a subset from each dataset s standard training split to obtain the lookup similarity table S (details see Appendix A.9). Datasets. We evaluated our methods on the following standard datasets using the standard protocol (classification accuracy) based on previous works (Menon and Vondrick 2023; Roth et al. 2023) For sample sizes n see Appendix A.9.
Hardware Specification	Yes	The authors gratefully acknowledge the Gauss Center for Supercomputing for providing compute through the NIC on JUWELS at JSC and the HPC resources supplied by the Erlangen National High Performance Computing Center (NHR@FAU funded by DFG).
Software Dependencies	Yes	The Large-Language Model (LLM) generated descriptions are sourced directly from DCLIP (Menon and Vondrick 2023) or generated using the contrastive prompting method with gpt-3.5-turbo-1106 and Llama-3-70b-chat-hf via APIs.
Experiment Setup	Yes	A weighting factor wcls R+ gets introduced to the vision-language ensemble: s(c, x) = 1 \|D(c)\| d D(c) w(d) ϕ(d, x) ( wcls if d = dcls, 1 \|D(c)\| 1 if d D(c) ∖ {dcls}. Weights of the classname-free descriptions are normalized to one to have the same relative weightings between classes with different amounts of assigned descriptions. Given a test image xi, one retrieves its top-k predictions based solely on text embeddings of texts such as a photo of [cls] .