reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Pre-trained Vision-Language Models Learn Discoverable Visual Concepts

Authors: Yuan Zang, Tian Yun, Hao Tan, Trung Bui, Chen Sun

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We carefully design quantitative and human evaluations of the discovered concepts on nine diverse visual recognition datasets, which confirm that pre-trained VLMs do learn visual concepts that provide accurate and thorough descriptions for the recognized objects. Code and models are publicly released. 1
Researcher Affiliation	Collaboration	Yuan Zang EMAIL Department of Computer Science Brown University Tian Yun EMAIL Department of Computer Science Brown University Hao Tan EMAIL Adobe Research Trung Bui EMAIL Adobe Research Chen Sun EMAIL Department of Computer Science Brown University
Pseudocode	No	The paper describes methods in Section 3.3 'Visual Concept Discovery and Learning' but does not present them in a structured pseudocode or algorithm block format. The descriptions are narrative.
Open Source Code	Yes	Code and models are publicly released. 1 Project page and code: https://conceptdiscovery.github.io
Open Datasets	Yes	We conduct experiments on several challenging fine-grained image classification datasets, including Image Net (Deng et al., 2009), Food (Bossard et al., 2014), CIFAR-100 (Krizhevsky et al., 2009), CIFAR-10, CUB (Wah et al., 2011) ,Flowers (Nilsback & Zisserman, 2008) Stanford Cars (Krause et al., 2015), Aircrafts (Maji et al., 2013) and Oxford Pets (Parkhi et al., 2012).
Dataset Splits	Yes	The statistics and split of the datasets are shown in the appendix. (Table A1: Statistical details of datasets. #Class means the number of classifications. #Train , #Valid , and #Test denote the instance numbers of each dataset respectively. For Image Net, we randomly select 10% of the training set as the validation set and regard the validation set as the test set.)
Hardware Specification	No	The paper does not explicitly describe the specific hardware used (e.g., GPU models, CPU models, or cloud computing resources) for running its experiments.
Software Dependencies	Yes	We use the same LLM GPT-3-text-davinci-002 to obtain descriptors as previous works. We also use the same CLIP backbone (Vi T-L-14) to compare with baseline models. Following (Yun et al., 2023), we use logistic regression to train the concept bottleneck models. We observe that the performance of CBM is robust to the choice of hyperparameters and use the default Scikit-learn hyperparameters for all datasets.
Experiment Setup	Yes	For concept learning, we use the Adam W optimizer with 5e-4 learning rate and 1e-4 weight decay to fine-tune the CLIP model, and we use the validation loss to select checkpoints. For α in Equation 2, we set it to 0.7 for Image Net dataset, 0.8 for Food-101, CIFAR-100, CUB-200 and Flowers-102 datasets and 0.9 for CIFAR-10 dataset.