reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Do Vision-Language Pretrained Models Learn Composable Primitive Concepts?

Authors: Tian Yun, Usha Bhalla, Ellie Pavlick, Chen Sun

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments on two image classification benchmarks where the target labels are composite concepts. The first dataset is MIT-States (Dosovitskiy et al., 2021), which contains 53K images... The other dataset is Caltech-UCSD Birds-200-2011 (CUB) (Wah et al., 2011)... Tables 1, 2, 3, 4, 5, 6, A1, A2, A4, A5, A6, A7, A8 provide quantitative results from these experiments.
Researcher Affiliation	Academia	Tian Yun EMAIL Department of Computer Science Brown University Usha Bhalla EMAIL Department of Computer Science Brown University Ellie Pavlick EMAIL Department of Computer Science Brown University Chen Sun EMAIL Department of Computer Science Brown University
Pseudocode	No	The paper describes a two-step framework called Compositional Concept Mapping (Comp Map) in Section 3 and illustrates it in Figure 1, but it does not present any formal pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at: https://github.com/tttyuntian/vlm_primitive_concepts
Open Datasets	Yes	We conduct experiments on two image classification benchmarks... The first dataset is MIT-States (Dosovitskiy et al., 2021)... The other dataset is Caltech-UCSD Birds-200-2011 (CUB) (Wah et al., 2011).
Dataset Splits	Yes	For the MIT-States dataset, we follow the evaluation protocol from Purushwalkam et al. (2019)... We use the standard splits from Purushwalkam et al. (2019). The training split has 30K images of 1262 seen attribute-object compositions, the validation split has 10K images of 300 seen and 300 unseen compositions, and the test split has 13K images of 400 seen and unseen compositions. For the CUB dataset, we follow the standard practice in n-way k-shot evaluation (Snell et al., 2017) and report mean accuracy over the 600 sampled tasks run for each setup. In each task, the n classes and k examples are chosen at random.
Hardware Specification	No	The paper mentions specific visual encoders like "Vi T-B/32" and "Vi T-B/16" (which are model architectures), but does not specify the actual hardware (e.g., GPU models, CPU types) used to run the experiments or train the models.
Software Dependencies	No	For CUB, a logistic regression model is used to train the composition models... We used the default sklearn hyperparameters and L2 regularization of 1 for all experiments. The paper mentions 'sklearn' but does not provide a specific version number for this or any other software library.
Experiment Setup	Yes	For MIT-States, a contrastive objective is used to train the composition models. We apply two linear projection layers on the primitive concept activations and the text embeddings of target composite concepts respectively, to embed them into a shared space... For CUB, a logistic regression model is used to train the composition models since the number of target composite concepts is fixed. We used the default sklearn hyperparameters and L2 regularization of 1 for all experiments.