Efficient Vocabulary-Free Fine-Grained Visual Recognition in the Age of Multimodal LLMs
Authors: Hari Chandana Kuchibhotla, Sai Srinivas Kancheti, Abbavaram Gowtham Reddy, Vineeth N. Balasubramanian
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we comprehensively evaluate the classification performance of Nea R for the VF-FGVR task. We begin by describing the datasets, metrics and benchmark methods we compare against. ... The results are shown in Table 3, with all numbers reported for 3-shot training images. ... We conducted a thorough ablation to evaluate the contribution of each component in our pipeline in Table 5. |
| Researcher Affiliation | Collaboration | Hari Chandana Kuchibhotla EMAIL Indian Institute of Technology Hyderabad, India Sai Srinivas Kancheti EMAIL Indian Institute of Technology Hyderabad, India Abbavaram Gowtham Reddy EMAIL CISPA Helmholtz Center for Information Security, Saarbrücken, Germany Vineeth N Balasubramanian EMAIL & EMAIL Microsoft Research India & Indian Institute of Technology Hyderabad, India |
| Pseudocode | Yes | An overview of our methodology is presented in Figure 1, and the pseudocode is detailed in Algorithm 1 in the appendix. We begin by discussing the necessary preliminaries. ... Algorithm 1 Nea R algorithm: Training |
| Open Source Code | Yes | Our code is available at https:/github.com/Nea R. |
| Open Datasets | Yes | We perform experiments on five benchmark fine-grained datasets: Caltech UCSD Bird-200 (Wah et al., 2011), Stanford Car-196 (Khosla et al., 2011), Stanford Dog-120 (Krause et al., 2013), Flower-102 (Nilsback & Zisserman, 2008), Oxford-IIIT Pet-37 (Parkhi et al., 2012). |
| Dataset Splits | Yes | Following (Liu et al., 2024a), for each dataset, Nea R and other baselines only have access to m unlabeled training images per class. Unless specified otherwise, we assume m = 3. Results for 1 ≤ m ≤ 10 are shown in Figure 2. ... Table A13: Train and test set sizes of the datasets used in this paper. The number of shots is denoted by m, with m = 3 used as the default in our experiments unless otherwise specified. |
| Hardware Specification | Yes | We run all our experiments on a single Nvidia Tesla V100-32GB GPU with an Nvidia driver version of 525.85.12. |
| Software Dependencies | Yes | We use Py Torch 2.4.0 and CUDA 12.0. We utilize the publicly available meta-llama/Llama-3.2-11B-Vision-Instruct model and Qwen/Qwen2-VL-2B-Instruct model from Hugging Face. |
| Experiment Setup | Yes | For both the Co Op baseline and our method, we introduce 16 trainable context vectors. The same set of prompts are optimized during the warmup stage, and for the subsequent training stage. We use SGD as the optimizer and train for 50 epochs, with 10 warmup epochs. We use a temperature of 2 in the sharpening function. Our batch size is 32. We use the SGD optimizer with a learning rate of 0.002, and use both constant learning rate scheduler and cosine annealing scheduler sequentially. The training hyperparameters are the same for Co Op and Nea R. |