CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement

Authors: Mohammadreza Salehi, Mehrdad Farajtabar, Maxwell Horton, Fartash Faghri, Hadi Pouransari, Raviteja Vemulapalli, Oncel Tuzel, Ali Farhadi, Mohammad Rastegari, Sachin Mehta

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We probe CLIPTe X and other pre-trained models on different downstream tasks and multiple datasets using classifier or regressor probes. This helps us understand if training with hard pseudo-labels from experts can improve the effectiveness of CLIP s image representations across different vision tasks. Experiments with multiple probes on variety of vision tasks and datasets (e.g., segmentation on PASCAL VOC and ADE20k, detection on COCO, depth estimation on NYU-v2, classification on Image Net-1k and Places-365, and surface normal estimation on NYU-v2) demonstrate the effectiveness of CLIPTe X.
Researcher Affiliation Collaboration Mohammadreza Salehi University of Washington; Mehrdad Farajtabar Maxwell Horton Fartash Faghri Hadi Pouransari Raviteja Vemulapalli Oncel Tuzel Apple; Ali Farhadi Allen Institute for Artificial Intelligence; Mohammad Rastegari Sachin Mehta Apple
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks. The methodology is described in natural language and mathematical formulas.
Open Source Code No The paper does not contain any explicit statement about providing source code or a link to a code repository for the described methodology.
Open Datasets Yes We finetune pre-trained CLIP with and without pseudo-labels on CC3M (Sharma et al., 2018) for 30 epochs on 64 A100 GPUs. Semantic segmentation. We use PASCAL VOC (Everingham et al., 2010) with 20 classes and ADE20K (Zhou et al., 2019) with 150 classes for the task of semantic segmentation. Object detection and instance segmentation. We use the COCO dataset for detection and instance segmentation. Monocular depth estimation. We use NYU-V2 (Silberman et al., 2012) dataset as our depth estimation benchmark. Image classification. We evaluate on two standard image classification datasets, i.e., Image Net (Russakovsky et al., 2015) and Places365 (Zhou et al., 2017). retrieval on Flickr-30k (Young et al., 2014)
Dataset Splits Yes Following a standard convention, we report the accuracy on the validation sets of these datasets in terms of mean intersection over union (m Io U). Following standard convention, we evaluate the accuracy on COCO s validation set in terms of mean average precision (m AP). We use absolute relative error as a metric for evaluation on the validation set. evaluate on the official test set of NYU-V2. We use top-1 accuracy on the validation set as an evaluation metric.
Hardware Specification Yes Therefore, to show the efficacy of our approach, we finetune pre-trained CLIP with and without pseudo-labels on CC3M (Sharma et al., 2018) for 30 epochs on 64 A100 GPUs.
Software Dependencies No The paper mentions various models and architectures (e.g., Mask-RCNN, DPT, NLL-Ang MF, Deep Labv3, PSPNet, SSD) but does not specify any software dependencies with version numbers.
Experiment Setup Yes A Hyperparameters. Hyper-parameters used during training and probing CLIPTe X and other models are given in Table 8 and Table 9 respectively. For selecting λclip and λtask (where task = {depth, seg, surface normal} in Eq. (1), linear search was used. The value of λclip and λtask was chosen as 0.1, 1.0, and 10 for each of the tasks. We found that λclip = λtask = 1.0 worked well for all tasks except segmentation where we found λseg = 0.1 delivered the best or close to the best performance. So, we set these hyper-parameters in our experiments.