Pre-trained Vision-Language Models Learn Discoverable Visual Concepts
Authors: Yuan Zang, Tian Yun, Hao Tan, Trung Bui, Chen Sun
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We carefully design quantitative and human evaluations of the discovered concepts on nine diverse visual recognition datasets, which confirm that pre-trained VLMs do learn visual concepts that provide accurate and thorough descriptions for the recognized objects. Code and models are publicly released. 1 |
| Researcher Affiliation | Collaboration | Yuan Zang EMAIL Department of Computer Science Brown University Tian Yun EMAIL Department of Computer Science Brown University Hao Tan EMAIL Adobe Research Trung Bui EMAIL Adobe Research Chen Sun EMAIL Department of Computer Science Brown University |
| Pseudocode | No | The paper describes methods in Section 3.3 'Visual Concept Discovery and Learning' but does not present them in a structured pseudocode or algorithm block format. The descriptions are narrative. |
| Open Source Code | Yes | Code and models are publicly released. 1 Project page and code: https://conceptdiscovery.github.io |
| Open Datasets | Yes | We conduct experiments on several challenging fine-grained image classification datasets, including Image Net (Deng et al., 2009), Food (Bossard et al., 2014), CIFAR-100 (Krizhevsky et al., 2009), CIFAR-10, CUB (Wah et al., 2011) ,Flowers (Nilsback & Zisserman, 2008) Stanford Cars (Krause et al., 2015), Aircrafts (Maji et al., 2013) and Oxford Pets (Parkhi et al., 2012). |
| Dataset Splits | Yes | The statistics and split of the datasets are shown in the appendix. (Table A1: Statistical details of datasets. #Class means the number of classifications. #Train , #Valid , and #Test denote the instance numbers of each dataset respectively. For Image Net, we randomly select 10% of the training set as the validation set and regard the validation set as the test set.) |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware used (e.g., GPU models, CPU models, or cloud computing resources) for running its experiments. |
| Software Dependencies | Yes | We use the same LLM GPT-3-text-davinci-002 to obtain descriptors as previous works. We also use the same CLIP backbone (Vi T-L-14) to compare with baseline models. Following (Yun et al., 2023), we use logistic regression to train the concept bottleneck models. We observe that the performance of CBM is robust to the choice of hyperparameters and use the default Scikit-learn hyperparameters for all datasets. |
| Experiment Setup | Yes | For concept learning, we use the Adam W optimizer with 5e-4 learning rate and 1e-4 weight decay to fine-tune the CLIP model, and we use the validation loss to select checkpoints. For α in Equation 2, we set it to 0.7 for Image Net dataset, 0.8 for Food-101, CIFAR-100, CUB-200 and Flowers-102 datasets and 0.9 for CIFAR-10 dataset. |