Do Vision-Language Pretrained Models Learn Composable Primitive Concepts?
Authors: Tian Yun, Usha Bhalla, Ellie Pavlick, Chen Sun
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on two image classification benchmarks where the target labels are composite concepts. The first dataset is MIT-States (Dosovitskiy et al., 2021), which contains 53K images... The other dataset is Caltech-UCSD Birds-200-2011 (CUB) (Wah et al., 2011)... Tables 1, 2, 3, 4, 5, 6, A1, A2, A4, A5, A6, A7, A8 provide quantitative results from these experiments. |
| Researcher Affiliation | Academia | Tian Yun EMAIL Department of Computer Science Brown University Usha Bhalla EMAIL Department of Computer Science Brown University Ellie Pavlick EMAIL Department of Computer Science Brown University Chen Sun EMAIL Department of Computer Science Brown University |
| Pseudocode | No | The paper describes a two-step framework called Compositional Concept Mapping (Comp Map) in Section 3 and illustrates it in Figure 1, but it does not present any formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at: https://github.com/tttyuntian/vlm_primitive_concepts |
| Open Datasets | Yes | We conduct experiments on two image classification benchmarks... The first dataset is MIT-States (Dosovitskiy et al., 2021)... The other dataset is Caltech-UCSD Birds-200-2011 (CUB) (Wah et al., 2011). |
| Dataset Splits | Yes | For the MIT-States dataset, we follow the evaluation protocol from Purushwalkam et al. (2019)... We use the standard splits from Purushwalkam et al. (2019). The training split has 30K images of 1262 seen attribute-object compositions, the validation split has 10K images of 300 seen and 300 unseen compositions, and the test split has 13K images of 400 seen and unseen compositions. For the CUB dataset, we follow the standard practice in n-way k-shot evaluation (Snell et al., 2017) and report mean accuracy over the 600 sampled tasks run for each setup. In each task, the n classes and k examples are chosen at random. |
| Hardware Specification | No | The paper mentions specific visual encoders like "Vi T-B/32" and "Vi T-B/16" (which are model architectures), but does not specify the actual hardware (e.g., GPU models, CPU types) used to run the experiments or train the models. |
| Software Dependencies | No | For CUB, a logistic regression model is used to train the composition models... We used the default sklearn hyperparameters and L2 regularization of 1 for all experiments. The paper mentions 'sklearn' but does not provide a specific version number for this or any other software library. |
| Experiment Setup | Yes | For MIT-States, a contrastive objective is used to train the composition models. We apply two linear projection layers on the primitive concept activations and the text embeddings of target composite concepts respectively, to embed them into a shared space... For CUB, a logistic regression model is used to train the composition models since the number of target composite concepts is fixed. We used the default sklearn hyperparameters and L2 regularization of 1 for all experiments. |